Simple Web Scrapper Loads Website Contents

Introduction

You can either use this data with console applications or with Windows/web applications. I used a console since this is introductory.

In the console application, add the following namespaces:

using System.Net; // to handle internet operations
using System.IO; // to use streams
using System.Text.RegularExpressions; // To format the loaded data

Loading the content

Create the WebRequest and WebResponse objects. See:

WebRequest request=System.Net.HttpWebRequest.Create(url); // url="http://www.google.com/";
WebResponse response=request.GetResponse();

Create the StreamReader object to store the response from the website and save it in any string type variable and close the stream. See:

StreamReader sr=new StreamReader(response.GetResponseStream());
string result=sr.ReadToEnd();
sr.Close();

To view the unformatted result simply write it to the console.

Console.WriteLine(result);

Formatting the result

To format the result we will use Regular Expression class functions. See:

result = Regex.Replace(result, "<script.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove scripts
result = Regex.Replace(result, "<style.*?</style>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets
result = Regex.Replace(result, "</?[a-z][a-z0-9]*[^<>]*>", ""); // Remove HTML tags
result = Regex.Replace(result, "<!--(.|\\s)*?-->", ""); // Remove HTML comments
result = Regex.Replace(result, "<!(.|\\s)*?>", ""); // Remove Doctype
result = Regex.Replace(result, "[\t\r\n]", " "); // Remove excessive whitespace

Now print the results on the screen. See:

Console.WriteLine(result);


Similar Articles