Understanding Regular Expressions

Abhishek Duppati
5y
15.1k
0
5

Article

Introduction

This article is written to get an understanding of Regular Expressions and why we use them when we already have string operations. We look at how much time does it take to learn it, what type of elements it contains, and how many languages the Regular Expression is available in. We also see where we can run our expression to test or visualize it. Some important Regular Expressions are included as references.

What is a Regular Expression?

A Regular Expression, or Regex, is a pattern we search for in a text. This regex helps with matching, locating, and managing text.

What is the Use of Regex?

A Regex can save you a massive amount of time if you need to parse large amounts of data in the text.

Why Do We Need Regex When We Have String Operations?

Actually, it depends on many factors that how we use it and on what type of data because regular expressions might have some performance issues compared to string operations. On the contrary, it completely depends on how clever are you at creating and performing a regex pattern on your data. Moreover, regex is mostly not used for performance level, but rather to handle complex logic with very little code.

How Long Does it Take to Learn Regex?

I would say you can get to know it in about 30 minutes. Furthermore, it is an ever-learning process in defining regular languages. What I mean is that even though you might learn the syntaxes needed for regex, which we will look at in a further explanation, you’ll keep on learning it forever because creating a unique Regular Expression for a pattern to match is what we do in every code we come across.

What is a Regular Language?

A regular or rational language is just a formal language expressed with the help of regular expression. It is also defined as a language recognized by finite automation. Formal languages are nothing but words in which letters are taken from an alphabet, based on a specific set of rules.

Can We Only Use Regex in C#?

Regex supports many languages including C#, Java, Pearl, Javascript, MySQL, and Oracle. Whereas MSSQL has pure SQLOperators /functions such as LIKE and PATHINDEX which are sufficient, EVAL SQL.NET comprises of SQL Regex - ISMatch, Match, Matches, Replace, Split will help you easily cover all unsupported formats.

What is the Regex Made of?

Regex is full of elements, such as Basic syntax, Position, Character, Special Characters, Escape Sequences, Groups, and Range, Quantifiers, Assertions, String Replacement, Pattern Modifiers, etc.

Regular Expression Elements

Basic Syntax	Position
/…/ Start and End Regex delimiters	^ Start of a string/line/multiline
() Grouping	$ End of String/line/multiline
\| Alternation	\A Start of String
Groups and Range:	\Z End of String
. Any Character except \n	\b Word Boundary
(…) Capturing Group	\B Word Non-Boundary
(a\|b) a or b	\< Start of Word
(?:) Non-Capturing Group	\> End of Word
[abc] a,b or c	Character:
[^abc] Not a,b or c	\s White Space
[a-z] Lower Case Letters from a to z	\S Non-White Space
[A-Z] Upper Case letters from A to Z	\w Word Character
[0-9] Digits from 0 to 9	\W Non-Word Character
Quantifiers:	\d Digit
* Zero or More	\D NonDigit
+ One or More	\x HexaDecimal Digit
? Zero or One	\0 Octal Digit
{2} Exactly Two	[\b] Backspace Character
{2,} Two or more	Special Characters:
{2,6} Between 2 and 6 like (2,3,4,5 or 6)	\f form feed
String Replacement:	\n Newline
$+ Last Matched Group	\r Carriage Return
$& Entire Matched Group	\t Tab
`$`` Before Matched Group	\v Vertical Tab
$’ After Matched Group	\xaa Hex Character aa
$1 First Group	\0nn Octal Character nn matches when (0<=n<=7)
$n nth Group	Escape Sequences:
Assertions:	\Q Begin Literal Sequence
?= Lookahead Assertion	\E End Literal Sequence
?<= Lookbehind Assertion	\ Escape following Characters like {}^$.\|*+?
?! Negative Lookahead	Pattern Modifiers: Flags
?<! or ?!= Negative Lookbehind	g Global Match
?> Only Once Subscription	s Single line Mode matches all including line breaks
?() Condition If Then	m Multiline Mode (^ and $ match start and end of a line)
?()\| Condition If Then Else	E Evaluate Replacement
?# Comment	i case insensitive, ignore case
	U Un-greedy Mode
	x Allow Components and White Space
POSIX: (Portable Operating System for Unix)	POSIX: (Portable Operating System for Unix)
[:aplha:] All Letters	[:blank:] Space and Tab
[:upper:] Upper Case Letters	[:space:] Blank Characters
[:lower] Lower Case Letters	[:cntrl:] Control Characters
[:alnum:] Digits and Letters	[:graph:] Printed Characters
[:digit:] Digits	[:print] Printed Characters and Spaces
[:xdigit:] Hexa Decimal Digits
[:punt] Punctuation
[:word] Digits Letters Underscore

Matching Regex with Test String

In all the images of Regular Expression, Slash (/) and (/g) are already specified before and after our regular expressions and can be changed as per our need by just clicking on it. I have used this link for testing the string with Regular Expression.

/[abc]+/g : Matches a single character of a, b or c, which are case sensitive:

To know the inner meaning of Regular Expression, you can use any Regular Expression Visualizers available online. For example, I’ve used this link, below is the visualized format of our Regular Expression.

/[^abc]/g : Matches a Character except a, b or c

/[a-z]/g : Matches any Character between a and z

Here a-z is a single character in the range between a and z which are case sensitive.

/[^a-z]/ : Matches a character not in range a-z:

a-z is a single character in the range between a and z which are case sensitive.

/[a-zA-Z]+/g : Matches a character in range a-z or A-Z:

A-Z a single character in the range between A and Z which are case sensitive.

/.+/ : Matches any single Character

.+ matches any character, except for line terminators.

. Matches any character other than newline or including newline with the /s flag.

/\s/g : Matches any Whitespace Character

To require a space we use: [\s] or \s

\s matches any whitespace character and equal to [\r\n\t\f\v]

/\d/g : Matches any Digit

/\D+/g : Matches any Non-Digit

/\w+/g : Matches any Word Character

\w Matches any letter, digit, or underscore. Equivalent to [a-zA-Z0-9_].

/\W+/g : Matches any Non-Word Character

(…) : Captures everything Enclosed

Captures everything in the parenthesis, but should be in sequence.

(a|b) : Matches either a or b

bh? : Matches zero or one of bh

If you want anything to be optional, put a? after it

ab* : Matches zero or more of ab

Here, it is trying to find ab in sequence.

^\w+/ here ^ matches Start of a String

\w+$/ : Here, $ matches End of a String

/d\b/g : Here, \b is a word boundary

For example, If you take word diseased, d\b chooses the last letter, whereas \bd chooses first letter d.

/r\B/g : r\B is a Non-Word Boundary, which maintains a position where \b does not match.

Important Regular Expressions to know:

To match duplicates in a string: /(\b\w+\b)(?=.*\b\1\b)/
To match a Username: /^[a-z0-9_-]{3,16}$/
To match a Password: /^[a-z0-9_-]{6,18}$/

Password Strength

For Complex: (Should have 1 lowercase letter, 1 uppercase letter, 1 number, 1 special character and be at least 8 characters)

/(?=(.*[0-9]))(?=.*[\!@#$%^&*()\\[\]{}\-_+=~`|:;"'<>,./?])(?=.*[a-z])(?=(.*[A-Z]))(?=(.*)).{8,}/

For Moderate: (Should have 1 lowercase letter, 1 uppercase letter, 1 number, and be at least 8 characters)

/(?=(.*[0-9]))((?=.*[A-Za-z0-9])(?=.*[A-Z])(?=.*[a-z]))^.{8,}$/

To match an Email: /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
To match a Hex Value: /^#?([a-f0-9]{6}|[a-f0-9]{3})$/
To match a Slug: /^[a-z0-9-]+$/
To match a URL:

http(s) Protocol:/https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#()?&//=]*)/
Optional Protocol:/(https?:\/\/)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)/

To match an IP Address: /^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/
To match an HTML Tag: /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
To match a whole text or line that does not contain a word hello: /^(?!.*?hello).*$/
To match a line, other than those end with hello: .*(?<!\.hello)$
To match multiple words: ^(?!.*(hello|hola|Salve|Bonjour|Shalom))
To match Time: Time Format HH:MM 12-hour, optional leading 0

/^(0?[1-9]|1[0-2]):[0-5][0-9]$/

Time Format HH:MM 12-hour, optional leading 0, Meridiems (AM/PM)

/((1[0-2]|0?[1-9]):([0-5][0-9]) ?([AaPp][Mm]))/

Time Format HH:MM 24-hour with leading 0

/^(0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$/

Time Format HH:MM 24-hour, optional leading 0

/^([0-9]|0[0-9]|1[0-9]|2[0-3]):[0-5][0-9]$/

Time Format HH:MM:SS 24-hour

/(?:[01]\d|2[0123]):(?:[012345]\d):(?:[012345]\d)/

To match a City, example, Hyderabad from an Address line out of spaces and commas:

Regular Expression:/[^\s,][^,]*(?=,[^,]*$)/
Text String: 500001 Telangana, Hyderabad, India
Explanation
Match a char except for whitespace and a comma: [^\s,]
Match 0+ chars except a comma: [^,]*
Match a positive lookahead that requires a comma and then 0+ chars except for comma ([^,]*) till the end of the string ($) : (?=,[^,]*$)

To select a line with 3 commas out of a text document, which also includes lines with 2 & 1 Commas:

Regular Expression:/.*,.*,.*,/g
Text String:lovely day, lovely day, lovely day
lovely day, lovely day, lovely day, lovely day
lovely day, lovely day

To match Groups which has form_ type format in Text:

Regular Expression:/(?:^|\s)from_(.*?)(?:\s|$)/g
Text String: Stay safe from_Covid-19 from_Corona from_Virus

Conclusion

Use regex mostly in cases where there is a need to find a complex pattern in the string and when you don't have any other efficient options to carry out.

Hope this helps! Happy Coding!