Text Scraping Using Regex in C#

The text from a website can be scraped in many ways. Now, I will show two simple ways to scrape text. They are:

  • WebClient class
  • WebRequest / WebResponse class

I have used both classes in the sample project.

I have scraped the following highlighted text from a website.

website

Using the WebClient

I start the code explanation with the WebClient class.

The WebClient class provides common methods for sending data or receiving data from any local, intranet, or Internet resource identified by a URI.

The Webclient methods have various methods to download data from the URL. But, I will use a method call DownloadString.

The method DownloadString allows the download of a string from any local, intranet, or internet resource identified by a URI and return the string.

Check the following code:

  1. WebClient wb= new WebClient();   
  2. String searchquery=textBox1.Text;   
  3. String scrapdata;   
  4. scrapdata=wb.DownloadString(searchquery);   
The preceding code shows the following.

A Wb object has been created for the Webclient class and the wb object uses the Downloadstring methods to download the string from the URI, then return the string that has been assigned to scrape the data. The code downloads only the source code of the website. We have split the specific text from the source code using the Regex.

Regexes

A regular expression is a pattern that can be matched against an input text.

Before using the Regex, include the following namespace in the C# code.
  1. using System.Text.RegularExpressions;   
I will use Regex.Matches.

Regex.Matches returns multiple Match objects. It matches multiple instances of a pattern and returns a MatchCollection. It is advantageous for extracting values, based on a pattern, if many values are expected.

Regex.Matches extracts text between specific tags. See the code below:
  1. MatchCollection data=Regex.Matches(scrapdata,@"<title>\s*(.+?)\s*</title>",RegexOptions.Singleline);   
I have extracted the data between the title tags.

The RegexOptions.Singleline option, or the s inline option, causes the regular expression engine to treat the input string as if it consists of a single line. It does this by changing the behavior of the period (.) language element so that it matches every character, instead of matching every character except for the newline character\n or \u000A.

Then, I have used a foreach loop to find the exact value.

See the code below:
  1. foreach (Match m in data)   
  2. {   
  3.    String downtitle = m.Groups[1].Value;   
  4.    MessageBox.Show(downtitle.ToString());   
  5. }   
There are two matching values found. See the image below:

code

So, I have extracted the second matching value using the foreach loop.

Using the WebRequest / WebResponse Class

The WebRequest is an abstract base class. So we actually don't use it directly. We can use it using its derived classes.

We need to use the Create method of WebRequest to create an instance of WebRequest. GetResponseStream returns a data stream. The following is the code:
  1. WebRequest request = WebRequest.Create (searchquery);// Create a request for the URL.     
  2. request.Credentials = CredentialCache.DefaultCredentials;// If required by the server,     
  3.                                 // set the credentials.    
  4. WebResponse response = request.GetResponse ();// Get the response.    
  5. Stream dataStream = response.GetResponseStream (); // Get the stream containing content     
  6.                             // returned by the server.    
  7. StreamReader reader = new StreamReader (dataStream);    // Open the stream using a StreamReader      
  8. string responseFromServer = reader.ReadToEnd ();    // Read the content.   
The source code has been downloaded to the responseFromServer string. Now, we can use the String in theRegex.Match to extract the specific data as before..

References