Scraping Web site Dynamic Data using WATIN
In this article, I have created a demonstration web site with Category and subsequent Item Listing page. I will be scraping this web site using .NET testing tools like Watin.
The main objective of this article is to demonstrate scraping of web pages using Testing tools like the Watin testing tool.
Generally, scraping of web pages is done with the HttpWebRequest and HttpWebResponse methods of C# in ASP.NET. However, it is observed that when server-side navigation is to be performed in the application then it becomes more difficult to fetch page data using the HttpWebRequest method (we need to perform some tricks to fetch the next page data).
The same thing can be done with the Watin Tool very easily and quickly. My objective here is not to challenge the HttpWebRequest and HttpWebResponse methods but to show how effectively we can do web site scraping using testing tools like Watin.
In this article I have used third-party tools like NUnit and Watin to demonstrate this example. Please refer to the following brief introduction for each tool and respective URL for further reference.
- About Watin: Watin is a third-party web application testing tool designed for .NET. You can obtain more information about this tool by visiting this site: http://www.watin.org/
- About NUnit: NUnit is a third-party unit-testing framework for all .NET languages. More information can be gathered by visiting this site: http://www.nunit.org
Using the code
This article consists of 2 applications. Please refer to the following brief details about this application.
A first application is a web-based application created in Visual Studio 2010 (.NET 4.0). This is a demonstration web site with category and item listing pages. This web site needs to be deployed on local / remote server II Server.
A second application is a window based class library project created using Visual Studio 2010 (.NET 4.0) and Watin DLL.
Please refer to the following pre-requisite software required to execute this demonstration:
- .NET Framework 4.0
- NUnit 2.6.2
Configure the web application
Please perform the following procedure to configure the web application:
- Deploy the web application in your IIS and assign .NET Framework 4.0 to this application and check that the application is running correctly in your workstation.
Configure the Web Scraping application
Perform the following procedure to configure the Web Scraping application:
- Open the configuration file (App.config or WatinWebScraping.dll.config) of this application and change the values of the following configuration keys as specified:
- WebApplicationPath: This is the application path where the demonstration web site is deployed. In the current application, I have deployed it on the localhost, therefore I have given the path http://localhost/WebApplication/CategoryListing.aspx.
- ScraperEnginePhysicalLocation: This is the physical location where the scraper web application is hosted.
I have defined the value of this path as: "D:\WebScraping\Web Scraper\WatinWebScraping\WatinWebScraping". I am using this path to store the scraped data in the text file.
Code Snippet of Demo Web Application
The following is the brief-level understanding of the code that resides in respective pages:
- CategoryListing.aspx: contains just a listing of categories in the form of hyperlinks.
- ItemListing.aspx: In this page, I have used a Grid View control and have used XMLDataSource instead of Database (for easy configuration) in the page.
Sample code snippet
Refer to the following code snippet for reference:
<asp:XmlDataSource ID="xmlSource" runat="server" DataFile="~/XMLDataBase/MenFashion.xml">
<asp:GridView ID="gvItemListing" runat="server" DataSourceID="xmlSource" AutoGenerateColumns = "false"
AllowPaging="true" PageSize="5" Width="100%" PagerSettings-Position="Bottom">
Code Snippet of WatinWebScraper Application
Here I will be explain the following things:
Initialization of Watin and NUnit in the application.
Using Watin for website navigation.
Using Regular Expression features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source.
Execute this application using NUnit.
Please refer to the following explanation of the respective sections.
Initializing Watin and NUnit in the application
To use Watin and NUnit in the application, add a reference to nunit.framework.dll, "Interop.SHDocVw.dll" and "WatiN.Core.dll".
Now add a reference to "NUnit.Framework" and "WatiN.Core" in this project.
As we will be using NUnit for scraping this application; therefore it requires mentioning "[TestFixture]" while creating the class for it and usage of "[Test]" and "[STAThread]" at the top of this method.
You can get more details of these attributes by referring to the http://nunit.org web site.
Using Watin for web scraping
// Create an instance of IE browser
IE ieInstance = new IE(webSitePath);
// This will opens IE browser in maximized mode
The Watin window can be hidden from the user while performing web scraping using the following code snippet. This code is currently commented out (kept in a comment).
User can also un-comment this code snippet.
//ieInstance.Visible = false;
// This will wait for the browser to complete loading of the page
// This will store page source in categoryPageSource variable
string categoryPageSource = ieInstance.Html;
Using Regular Expression features (RegEx and MatchCollection) of .NET to fetch respective data from the HTML page source
I have used regular expressions for fetching categories and to do iterative logic to fetch items in the respective categories and to move to the next page using regular expressions for fetching all the pages for the respective category items.
Please refer to the following regular expressions used for Category, Item fetching and page navigation respectively.
Category Regular Expression
The following regular expression will fetch all the URL categories from the CategoryListing.aspx page and will navigate in a recursive loop.
Item Regular Expression
The following regular expression will fetch ProductID, Product Name and Product Price for the respective item residing in the given page.
Paging Regular Expression
The following regular expression will fetch respective pages from the ItemListing.aspx page.
To use this regular expression in this application, I have used the "RegEx" class of the "System.Text.RegularExpression" namespace. RegEx will compile respective regular expression patterns using various options like "RegexOptions.Compiled", "RegexOptions.IgnoreCase", "RegexOptions.IgnorePatternWhitespace" and "RegexOptions.CultureInvariant".
Refer to the following code snippet for that:
// Regular expression for Category listing page
private const string _categoryRegEx = <A\S.*?class=bold\s.*?href="(?<href>.*?)"></span>
Regex categoryMatches = new Regex(_categoryRegEx, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.CultureInvariant);
To fetch records based upon regular expression it requires using MatchCollection to fetch a list of successful matches for the respective
HTML Source generated (string categoryPageSource = ieInstance.Html;) as per #2 above. Refer to the following code for reference:
MatchCollection categoryMatchCollection = categoryMatches.Matches(categoryPageSource);
Use a for loop to fetch the respective results of a single match. Refer to the following code for reference:
foreach (Match categoryMatch in categoryMatchCollection)
If a regular expression is created based upon the group,then use the GroupCollection method to fetch groups of the respective results. Refer to the following code for reference:
GroupCollection categoryGroup = categoryMatch.Groups;
A GroupCollection contains multiple associated groups. To fetch a category, I have used "href" as a group. Refer to the following code for reference:
string itemListingURL = Convert.ToString(categoryGroup["href"].Value);
Now, the itemListingURL variable will contain the href for the respective category. Now Watin will navigate to this URL as in the following.
The itemListingPath variable contains the full path of the item listing page for the respective category.
I have used the "WaitForComplete" method to wait until the respective page has loaded completely. Refer to the following code for reference::
Using the code above, the application will navigate to the item listing page. A similar operation needs to be performed for fetching items.
Once all items of the respective page are fetched and navigation to the next page has been done, Watin provides a Click event to do a click on a specific page.
The Click event can also be performed based upon other criteria also.
Please refer to the following:
You can get more information on all the preceding criteria by visiting visiting http://watin.org.
In this article, I have used "Find.ByText" to find a link by text and then perform the click event. You can also attach a regular expression with the above criteria.
Refer to the following code for reference.
// Fetches the page number of the current page.
string linkText = Convert.ToString(pagingGroup[_pageNumber].Value);
// Performs click event on the given link. For e.g if linkText contains "2" as a value then Watin will perform click event on this second link.
// Wait for the operation to complete
// Store the result of the page in itemListingPageSource variable
itemListingPageSource = ieInstance.Html;
Once the respective items in the web page are scraped, the current application will store respective items in "Output.txt" using StreamWriter of System.IO namespace.
Now open the "Output.txt" file and observe that it contains all the items of Men, Women and Children Categories.
Execute this application using NUnit
To execute the WatinWebScraper application, it requires doing the following procedure:
Open NUnit application.
Now click on "File" -- "Open Project" and navigate to the DLL file ("WatinWebScraping.dll") of the Watin Web Scraper application. Refer to the following image for reference.
Now click on the "Run" button as shown in the image above. Observe that the application will start scraping the Demo Web application by Navigating to the Category and all its respective items will be stored in the "Output.txt" file.