Regular Expressions Part 2: Grouping, Quantifiers, and Character Classes

Before reading this article, I highly recommend reading the previous part:

Welcome to the second part of a series of articles on the fascinating world of regular expressions! The journey continues with extensive coverage on the syntax of this mysterious creature. We will explore grouping and character classes, so hold on tight, folks!

So, what is a character class? A character class is nothing more than a collection of characters in a certain category. We can create character classes in a number of ways and every way of creating these collections of characters are useful in many cases and scenarios.

For example, if we want to find a character that is part of a word, we type a backslash (\) followed by a lowercase w.

These character classes are known as short-hand character classes, because they are already built into the regular expression engine.

Here are some of the most popular short-hand character classes.

  • \w This finds a character in the words category (a through z, 0 through 9 and the underscore character).
  • \b This finds a word boundary such as a space.
  • \s This finds a white space character, such as a tab, a linefeed, or a space.

Character classes are also created using square brackets. One example is [a-z], that tells the engine to find the characters a through z using what is known as a range. One can create a range by placing a dash between two characters. One example is [a-d], that tells the engine to find characters from a through d. One can also place two or more ranges inside a single pair of square brackets. One example is [A-Za-z], that tells the engine to find characters found in the uppercase or lowercase letters (A through Z and a through z, that is).

Character classes are generic, that means that developers do not need to specify exact words to match, since words often are misspelled and there are millions of them. Also, it would not be practical and regular expressions would not be as powerful.

We also have the ability to tell the engine to match all characters except for certain ones. One way to do this is to place a carrot after the left square bracket. One example is [^aeiou], that tells the engine to match any character, except the lowercase vowells.

Just as there are short-hand character classes to match certain characters, there are short-hand character classes to not match them.

  • \W This matches all but word characters.
  • \S This matches all but white space characters.
  • \B This matches all but word boundaries.

The next topic we will look at is the ability to group these character classes to break our regular expressions into groups. To create a group, we only need to surround our character classes and/or individual words in parentheses. One example is ([a-z]). The .NET Framework allows us to create as many of these groups as we want; however, it is recommended that regular expressions be as short as possible, for readability purposes.

Make sure that you do not put parentheses within parentheses, because that would mean something else and it would create confusion for the person reading the code.

One example of what not to do is (ab(ade([1-3]))).

You can, however, have something like (\b)([b-f]).

Character classes would not be as powerful if we didn't have the ability to tell the regular expression engine to match certain characters that are repeated one or more times. These pieces of syntax are known as quantifiers and like character classes, we can create them in a number of ways.

* This means zero or more times.

One example is [a-s]*, that will match the letters a through s zero or more times.

+ This means one or more times.

One example is [12]+, that will match the number twelve one or more times.

We can also create quantifiers using braces ({}, that is). One example is [14]{1,}, that will match the number fourteen one or more times.

We can also place another number next to the comma. One example is [14]{1,3}, that will match the number fourteen from 1 to three times.

In future articles, we will explore more on saying "no" to the regular expression engine, which is known as negation. For now, I'd like to try something on C# Corner. The next paragraphs will contain programming excercises to help us apply what we have learned, as well as bonus questions. You are welcome to choose which exercises you would like to try. I'd love to see your answers in the comments and we can all have some fun! So, for now, enjoy the exercises and happy coding to you!

Exercise 1: Think carefully about this regular expression and describe what it does.

    ([2-4])

Exercise 2: Here's a list of words.

    feel, fail, fall, fruit, free.

    Here is a regular expression.

    f[^r)

    Which words will the regular expression match?

Exercise 3: What is a character class?

Exercise 4: True or false question. The following code is valid C#, when working with regular expressions. If so, explain what it does, including the text inside quotation marks. Hint: You're welcome to search msdn.microsoft.com for this one only.

    var myExpression = new Regex(@"dogs");

Exercise 5: How many times will this regular expression be matched? If any, what will it match? How many groups are in the regular expression, if any?

    (Kevin[2,5})

Exercise 6: What is a group and how do you create one? What are their purposes?

Exercise 7: What are quantifiers and why would you use them?

Exercise 8: Is it a good idea to nest groups in regular expressions?

Exercise 9: Here is a list of words that will be referred to as list one.

    dog, drink, doubt, dogma, drool, desk, disk, dubious, dozen

    Write a regular expression that will match words that contain the letter d. The d must not be followed by an o.

Exercise 10: Write a regular expression that will match the character found in uppercase letters 2 or more times.