ARTICLE

Simple Web Scrapper Loads Website Contents

Posted by Rumman Siddiqui Articles | ASP.NET Programming December 16, 2012
Web Scrapper can be used as a tool that loads website contents. Since it downloads all the data from a website I prefer to format it making it readable.
Reader Level:
Download Files:
 

Introduction

You can either use this data with console applications or with Windows/web applications. I used a console since this is introductory.

In the console application, add the following namespaces:

using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data

Loading the content

Create the WebRequest and WebResponse objects. See:

WebRequest request=System.Net.HttpWebRequest.Create(url); // url="http://www.google.com/";
WebResponse response=request.GetResponse();

Create the StreamReader object to store the response from the website and save it in any string type variable and close the stream. See:

StreamReader sr=new StreamReader(response.GetResponseStream());
string result=sr.ReadToEnd();
sr.Close();

To view the unformatted result simply write it to the console.

Console.WriteLine(result);

Formatting the result

To format the result we will use Regular Expression class functions. See:

result = Regex.Replace(result, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on the screen. See:

Console.WriteLine(result);

Login to add your contents and source code to this article
post comment
     

Mahesh, thanks for the suggestion. I have moved it to ASP.NET category.

Posted by Rumman Siddiqui Dec 17, 2012

Welcome Rumman. Can you move it to ASP.NET or other related category? Thanks!

Posted by Mahesh Chand Dec 16, 2012

This looks simple and useful. The sample code is general and useful for most web pages. An alternative would be to use the DOM to "scrape" the data. That would provide a more flexible and powerful formatting ability but requires more work to customize the processing of the data.

Posted by Sam Hobbs Dec 16, 2012
COMMENT USING
PREMIUM SPONSORS
Over-C is a holistic consortium of communications and technology specialists. We build, deploy and market both business as well as consumer products and solutions.
SPONSORED BY
  • PDF reports have never been easier to create. With our included WYSIWYG Designer, you can layout your reports, set up your data source and let DynamicPDF ReportWriter do the rest.
Get Career Advice from Experts