Screen Scraping using System.Net

Shantanu
12y
75.3k
0
0

Article

Purpose

Screen Scraping is used to extract data from a web page by scraping it instead of more direct access. It involves requesting the page and then parsing the response. It is useful in situations where direct access to the data is not there.

Implementation

Suppose we want to scrape Email addresses from live web sites. It would be a two-step process with first fetching the page and then parsing the response contents to extract the email addresses.

We have created a class called Extract (shown below) which has an API called GetPage() which fetches the page given a URL.

System.Net namespace contain classes which can be used to fetch the page. The classes we are going to use are HttpWebRequest and HttpWebResponse.

Below is the GetPage() API :

public string GetPage()

{

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(this.URL.Trim());

request.ContentType = "text/html";

request.Method = "GET";

request.Proxy = WebProxy.GetDefaultProxy();

request.Proxy.Credentials = CredentialCache.DefaultCredentials;

string Input = string.Empty;

try

{

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

if (response.StatusCode == HttpStatusCode.OK)

{

using (StreamReader sr = new StreamReader(response.GetResponseStream()))

{

Input = sr.ReadToEnd();

// Close and clean up the StreamReader

sr.Close();

}

catch (Exception ex) { }

return Input;

}

The next step is to extract the Email addresses from the returned page. This is done by using Regular Expressions in the ExtractEmails() API. The Regex used is not a complete solution to the Email pattern but works with most email addresses. It is only used to demonstrate the technique.

public const string ALLOWED_CHARS = @"[a-zA-Z0-9-_]";

public const string REGEX_EMAILS = @"^((?<emails>" + ALLOWED_CHARS + @"+(\." + ALLOWED_CHARS + @"+)*@" + ALLOWED_CHARS + @"+(\." + ALLOWED_CHARS + @"+)*)*(.|\n|\r\n)*?)+$";

public void ExtractEmails(string Input)

{

Match m = Regex.Match(Input, REGEX_EMAILS);

if (m.Success)

{

foreach (Capture c in m.Groups["emails"].Captures)

{

Emails.Add(c.Value.Trim());

}

Then the extracted Email addresses are displayed on the screen separated by the defined separator.

Below is a snapshot of the Win Email Extractor tool I have built to demonstrate the technique. As an example, we are scraping Email addresses of US Senators from a public site that hosts those Email addresses.

Below is link to the public site:

http://www.surrogacy.com/legals/senators.html

Before using the tool, see that the website can be accessed in the browser from the machine on which the tool is being run. Sometimes issues like Proxy, firewall can cause the tool to not work.