.NET Regular Expressions Demystified - Part Two

Before you proceed and read the second part, I recommend you to read:
Thus, you can follow along. If you already have some understanding of the regular expressions, you can continue to follow along from this article, but I still recommend you to check out  Part 1. You may learn something new.
 
Prerequisites
 
You must have a basic understanding of C# and Object-Oriented Programming.
 
Let’s Continue
 
In the previous part, we learned some basic constructs of the regular expressions, some basic concepts, need, and importance of regular expressions in real world applications. We discussed the regular expressions engine, some basic understanding of Regex class, and some handy examples to get your hands dirty for the practice.
 
Now, we are ready to continue our learning session on .NET Regular Expressions Demystified Series.
 

Common Confusing Point

 
A common mistake that some developers do is copying a regular expression somewhere from the internet and using it directly in the C# program. Please do remember, there are many different regular expression syntaxes, which are designed and used by the different programming languages. If your regular expression is not returning what you are expecting, make sure that you are using the right syntax, designed for C#.
 

Character Classes

 
As the name explains a character class is nothing but a single unit that represents a group of special characters in Regex.
 
In Regex, character classes are defined by the character types, all alphanumeric characters are defined in one class, all digit characters are defined in another class.
 
There are two types of character classes in .NET regular expressions.
  1. One of those is already defined.
  2. One of those can be defined.

Predefined Classes

 
Here, are some mostly used character classes, that are already defined and available.
 
\w Matches any word character. This can be any alphanumeric or digit. Its character’s range is equivalent to [A-Za-z_0-9]. Represents all the uppercase, lowercase alphanumeric characters, and all digits’ literal from 0-9.
\W Matches any character, which is not a word. This can be any non-word character like ‘*’, ‘?’, ‘&‘and ‘$’ etc. Its character’s range is equivalent to [^A-Za-z0-9]. The cap sign at the front represents negation, which means this class matches any character that is not [a-zA-Z0-9].
\s Matches any single whitespace character. Normally, this class represents a space, newline, return, tab, and vertical tab,
\S Matches any single non-whitespace character. This class is a negation of \s. It matches any character, which is not white space, returns a new line, tab, and vertical tab.
\d This class represents any single decimal digit. Its character’s range is equivalent to [0-9].
\D This class represents any non-digit character. This class is a negation of \d. Its character range is equivalent to [^0-9].
 
Example
 
@"^\(?\d{3}[) -]?\d{9}\b" finds a 12 digit standard Pakistan phone number in following three formats.
  • "092 341787878"
  • "092-341787878"
  • "(092)341787878"
     
    code

Constructs for defining Custom Classes

 
.NET regular expressions syntax provides some basic constructs, that you can use to create your own classes. If predefined character classes do not meet your need, you can define your own character classes with simple syntax. To define a character class, all you need is an open square bracket ‘[’, a closed square bracket ‘]’ and some text inside, which defines a range of characters(a-z) or a single character for example class “[abc]” matches any single lower case ‘a’, ‘b’ and ‘c’ character. You can define your character classes in three different ways.
  1. [characters]:
     
    Matches any single character, which resides inside these braces.
     
  2. [^characters]:
     
    Matches any single character, which does not reside inside these braces.
     
  3. [first-last]:
     
    Matches any single character, which is between the first and last character of these braces.
Why we define
 
We define our own character classes because we want more control and flexibility. For example, if you have to find all the vowels in a document, you define “[aeiou]” class for it. What if you want to find all the characters between ‘e’ to ‘j’, you define another class for it [e-j].
 
class
 
Example
 
"[aeiou]" matches any vowel.
 
code
 
Negation Example
 
"[^aeiou]" matches any character, which is not a vowel.
 
cmd
 

Anchors

 
Anchors are also called atomic-zero-width assertions. An anchor doesn’t represent a character; it represents the position of a character in the string without consuming any character. For example, the cap sign ‘^’ anchor represents the beginning of the string, and the dollar sign ‘$’ represents the ending of the string. The ‘\b’ represents both the beginning and end of a word in the string. Given below is a list of some anchors, which I referenced from MSDN.
 
^ Represents the beginning of the whole string or line.
& Represents the ending of the whole string or before \n at the end for line.
\A Matches the beginning of the string. (input string as a whole)
\z Matches the ending of the string. (input string as a whole)
\b Matches both the beginning and end of a word.
\Z Matches the ending of the string or before the \n at the end of the string.
Anchors
 
Example: @"\bM\w*\b" finds any word starts with M.
 
code
 

Quantifiers

 
Quantifiers are normally used to repeat the match one or more times. Quantifiers instruct the regular expression engine to repeat the previous match a certain number of times.
 
+ Match the previous element 1 or more time
* Match the previous element 0 or more time
? Match the previous element 0 or 1 time
{n} Finds the match for the previous element exactly n times.
{n, } Finds the match for the previous element n and more times.
{n, m} Finds the match for the previous element at least n times but no more than m time. Match the previous element from n to m times.
Quantifiers
 
A simple regular expression for finding a 4-digit number may look like this “\d\d\d\d”, but if we try to use one of our quantifiers above, it will look smarter like this “\d{4}” as both the expressions are doing the same job but with the use of quantifiers. We can make our regular expressions more concise, readable, and better looking.
 
Normally, our quantifiers try to find a match as many times, as possible but it follows one of these with ‘?’. The pattern will start matching as few times as possible.
 
For example, you want to match ‘foo’ in the word ‘fool’. Thus, you will write “fo+” regular expression and it will exactly find the match ‘foo’ but, if you follow the ‘+’ quantifier with ‘?’ sign like this “fo+?”, your expression will find the match “fo”, it’s because putting ‘?’ sign after any quantifier makes them match a few letters as possible.
 
code
 

Grouping Constructs

 
As the name explains the grouping constructs in the regular expressions are used to group one or more special characters to be used as a single unit. Grouping is used to create sub-expressions inside the main regular expression.
 
We use grouping for better reusability of characters. A group can be created with an open and closed parentheses and some special characters inside like this expression “(\b[0-9]{3}\w+\b)?\w{3}\b”.
 
Grouping
 
By grouping, we can delimit the subexpressions, perform the repetition and special treatment on them like (\d*[0-9]{3})?
 
Example
 
“^\d{5}(-\d{4})?$” this pattern matches a valid Pakistan postal code + optional 4 digits.
 
The first part “^\d{5}” of the expression finds a 5 digit number and the second part will try to find a 4 digit number, followed by “-“, the question mark ‘?’ denotes that the last part grouped by curly braces may exist or may not because we use ‘?’ quantifier for zero or more occurrence of the character.
 
Alternation Constructs
 
Alternation constructs are used to define one or more sub-expressions to match either one of them. The pipe ‘|’ sign is used for the alternation between subexpressions.
 
The regular expression engine evaluates the alternative sub-expressions from the left to right and matches if either one of them satisfies. While designing your expressions for alternation, you should write the more specific sub-expressions on the left and the more general on the right to get the expected results.
 
Alternation
 
Example
 
“^\d{5}|\d{3}$” expression matches both 5-digit and 3-digit numbers, depending on the user input. The reason, I wrote the 5-digit expression on the left side is because the evaluation will start from the left side. First, the expression will try to find a 5-digit number and if it didn’t find one, it will move towards the next expression to find a match for the 3-digit one. If I reverse the order of subexpressions, it will always find a match for 3-digit numbers, because 5-digit numbers also have 3-digits inside.
  1. public class AlternationDemo   
  2. {  
  3.     static void Main(string[] args)   
  4.   {  
  5.         string pattern = @ "^\d{5}|\d{3}$";  
  6.         string input = "12345 345";  
  7.         Regex regex = new Regex(pattern);  
  8.         MatchCollection matchCollection = regex.Matches(input);  
  9.         Console.WriteLine("\tMatches");  
  10.         foreach(var match in matchCollection) {  
  11.             Console.WriteLine(match);  
  12.         }  
  13.         Console.ReadLine();  
  14.     }  
  15. }  
Result
 
Result
 

Conclusion

 
In these two parts of the regular expression series, we just scratched the surface of the regular expressions. We learned some basic concepts around the regular expressions and the regular expression engine.
 
In the first part, we learned the basic syntax of the regular expressions and how we can use the Regex class to utilize the power of the regular expressions in .NET based applications.
 
In the second part, we learned about the character classes, anchors, quantifiers, grouping, alternations, and other interesting features.
 
The purpose of this series was to get you started with the essential knowledge of the regular expressions so that you can write and experiment with your own regular expressions. This is just the beginning. Now, you are ready to learn more advanced features of the regular expression. If you are curious and want to gain some serious knowledge about regular expressions, go ahead and check Microsoft's official documentation about .NET Regular Expressions at MSDN.