ARTICLE

Screen Scraping using System.Net

Posted by Shantanu Articles | Networking January 17, 2011
Screen Scraping is used to extract data from a web page by scraping it instead of more direct access. It involves requesting the page and then parsing the response. It is useful in situations where direct access to the data is not there.
Reader Level:
Download Files:
 

Purpose

Screen Scraping is used to extract data from a web page by scraping it instead of more direct access. It involves requesting the page and then parsing the response. It is useful in situations where direct access to the data is not there.

Implementation

Suppose we want to scrape Email addresses from live web sites.  It would be a two-step process with first fetching the page and then parsing the response contents to extract the email addresses.

We have created a class called Extract (shown below) which has an API called GetPage() which fetches the page given a URL.

1.gif
 
System.Net namespace contain classes which can be used to fetch the page. The classes we are going to use are HttpWebRequest and HttpWebResponse.  

Below is the GetPage() API :

        public string GetPage()
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(this.URL.Trim());
            request.ContentType = "text/html";
            request.Method = "GET";
            request.Proxy = WebProxy.GetDefaultProxy();
            request.Proxy.Credentials = CredentialCache.DefaultCredentials;
            string Input = string.Empty;
            try
            {
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                if (response.StatusCode == HttpStatusCode.OK)
                {
                    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
                    {
                        Input = sr.ReadToEnd();
                        // Close and clean up the StreamReader
                        sr.Close();
                    }
                }
            }
            catch (Exception ex) { }
            return Input;
        }

The next step is to extract the Email addresses from the returned page. This is done by using Regular Expressions in the ExtractEmails() API. The Regex used is not a complete solution to the Email pattern but works with most email addresses. It is only used to demonstrate the technique.

        public const string ALLOWED_CHARS = @"[a-zA-Z0-9-_]";
        public const string REGEX_EMAILS = @"^((?<emails>" + ALLOWED_CHARS + @"+(\." + ALLOWED_CHARS + @"+)*@" + ALLOWED_CHARS + @"+(\." + ALLOWED_CHARS + @"+)*)*(.|\n|\r\n)*?)+$";

        public void ExtractEmails(string Input)
        {
            Match m = Regex.Match(Input, REGEX_EMAILS);
            if (m.Success)
            {
                foreach (Capture c in m.Groups["emails"].Captures)
                {
                    Emails.Add(c.Value.Trim());
                }
            }
        }

Then the extracted Email addresses are displayed on the screen separated by the defined separator. 

Below is a snapshot of the Win Email Extractor tool I have built to demonstrate the technique. As an example, we are scraping Email addresses of US Senators from a public site that hosts those Email addresses.

Below is link to the public site:


2.gif
 
Before using the tool, see that the website can be accessed in the browser from the machine on which the tool is being run. Sometimes issues like Proxy, firewall can cause the tool to not work. 

Login to add your contents and source code to this article
post comment
     

Hi your code not working i do know why but it doea not work when i try to seek for emails it drops me this program turn off in the seconds. Why is this happining ?

Posted by polas anderson Mar 11, 2012

Exactly. Most is not all. Therefore, I see no reason to change the file extension. It should be .rar as that is the type of file it is. It would have saved me time by knowing it wouldn't work with Windows Explorer. That's why type extensions matter and why changing them to something that is wrong just doesn't make sense. No matter. It was still a well written article and thanks for it.

Posted by Kurt Bank Jan 20, 2011

I renamed it to .zip since most people use zip archivers and since most zip archivers open .rar (and vice-versa). Unfortunately, the zip utility in your explorer does not. And I do not have a zip archiver to post a zip version. You can download WinRar from http://download.cnet.com/WinRAR-32-bit/3000-2250_4-10007677.html for free.

Posted by Shantanu Jan 20, 2011

I don't understand why this web site uses rar files. Since the Windows Explorer supports zip files but not rar files, I think only zip should be used. For those that are interested, 7-Zip will open rar files. http://sourceforge.net/projects/sevenzip

Posted by Sam Hobbs Jan 20, 2011

Uh, I'm sorry, but that makes no sense. An extension should reflect the type of file it is. It's not a popularity contest. I don't have WinZip or WinRar anymore as I have no need for them. Since zip is now integrated into Windows Explorer I just use it. Simple is good. It doesn't support a rar, only zip. Surprisingly, it also doesn't support a rar file renamed to a zip?!?! Duh.

Posted by Kurt Bank Jan 20, 2011
COMMENT USING
PREMIUM SPONSORS
Over-C is a holistic consortium of communications and technology specialists. We build, deploy and market both business as well as consumer products and solutions.
Join a Chapter
SPONSORED BY
  • PDF reports have never been easier to create. With our included WYSIWYG Designer, you can layout your reports, set up your data source and let DynamicPDF ReportWriter do the rest.