SIGN UP MEMBER LOGIN:    
ARTICLE

Website Recursive Url Parser

Posted by Jean Paul Articles | ASP.NET Programming March 28, 2011
In this article I am trying to share a piece of code that might be useful to some of the developers.
Reader Level:
Download Files:
 

In this article I am trying to share a piece of code that might be useful to some of the developers.

We can find a lot of code in C# that will parse the http urls in given string. But it is difficult to find a code that will:

  • Accept a url as argument, parse the site content
  • Fetch all urls in the site content, parse the site content of each urls
  • Repeat the above process until all urls are fetched.

Scenario

Taking the website http://valuestocks.in  (A Stock Market Site) as example I would like to get all the urls inside the website recursively.

Design

The main class is SpiderLogic which contains all necessary methods and properties.

Url1.gif

The GetUrls() method is used to parse the website and return the urls. There are two overloads for this method.

The first one takes 2 arguments. The url and and a Boolean indicating if recursive parsing is needed or not.

E.g.: GetUrls(http://www.google.com", true);

The second one is 3 arguments, url, base url and recursive Boolean.

This method is intended for usage like the url is a sub level of the base url. And the web page contains relative paths. So in order to construct the valid absolute urls, the second argument is necessary.

E.g.: GetUrls("http://www.whereincity.com/india-kids/baby-names/ ", http://www.whereincity.com/ , true);

Method Body of GetUrls()

public IList<string> GetUrls(string url, string baseUrl,
bool
recursive)
{
    if (recursive)
    {
        _urls.Clear();
        RecursivelyGenerateUrls(url, baseUrl);

        return _urls;
    }
    else
        return InternalGetUrls(url, baseUrl);
}


InternalGetUrls()

Another method of interest would be InternalGetUrls() which fetches the content of url, parses the urls inside it and constructs the absolute urls.

private IList<string> InternalGetUrls(string baseUrl, string absoluteBaseUrl)
{
    IList<string> list = new List<string>();

    Uri uri = null;
    if (!Uri.TryCreate(baseUrl, UriKind.RelativeOrAbsolute, out uri))
        return list;

    // Get the http content
    string siteContent = GetHttpResponse(baseUrl);

    var allUrls = GetAllUrls(siteContent);

    foreach (string uriString in allUrls)
    {
        uri = null;
        if (Uri.TryCreate(uriString, UriKind.RelativeOrAbsolute, out uri))
        {
            if (uri.IsAbsoluteUri)
            {
                if (uri.OriginalString.StartsWith(absoluteBaseUrl)) // If different domain / javascript: urls needed exclude this check
                {
                    list.Add(uriString);
                }
            }
            else
            {
                string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
                if (!string.IsNullOrEmpty(newUri))
                    list.Add(newUri);
            }
        }
        else
        {
            if (!uriString.StartsWith(absoluteBaseUrl))
            {
                string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
                if (!string.IsNullOrEmpty(newUri))
                    list.Add(newUri);
            }
        }
    }

    return list;
}

Handling Exceptions

There is an OnException delegate that can be used to get the exceptions occurring while parsing.

Tester Application

A tester windows application is included with the source code of the article.
You can try executing it.

The form accepts a base url as the input and clicking the Go button it parses the content of url and extracts all urls in it. If you need a recursive parsing please check the Is Recursive check box.

Url2.gif

Next Part

In the next part of the article, I would like to create a url verifier website that verifies all the urls in a website. I agree after doing a search we can find free providers like that. My aim is to learn & develop a custom code that could be extensible and reusable across multiple projects by community.

Login to add your contents and source code to this article
share this article :
post comment
 

Thanks a lot Sam. I prefer using the shldoc library as the SpiderLogic is a class library. Surely I will refer your articles. Thanks a lot for helping me. Have a nice day.

Posted by Jean Paul Mar 29, 2011

Very good. So I think what you can do is to create an InternetExplorer object and put the HTML in there. If you are not using Windows Forms then I think you can use the classes and interfaces in the ShlDocVw library or something like that. There are some articles in this web site that will help with that, including my article.

Posted by Sam Hobbs Mar 28, 2011

That is a cool idea Sam to use HTML DOM. I found that the regular expression is failing when given a particular website. I cannot find the reason. So in this case we should be able to run over with HTML DOM. Thanks for sharing the info. I will compare the speed of both as well.

Posted by Jean Paul Mar 28, 2011

I know that many people use RegularExpressions to parse HTML. Thank you for showing how to get all the urls in a website using RegularExpressions. I prefer to use the DOM though. The HTML can be put in a HtmlDocument and the Links Property can be used to get all the links in a web page. A lot of your code, such as recursion, is still useful even if the HtmlDocument class was used instead of RegularExpressions.

Posted by Sam Hobbs Mar 28, 2011
Nevron Gauge for SharePoint
Become a Sponsor
PREMIUM SPONSORS
  • ceTE software specializes in components for dynamic PDF generation and manipulation. The DynamicPDF™ product line allows you to dynamically generate PDF documents, merge PDF documents and new content to existing PDF documents from within your applications. Visit DynamicPDF here
    ceTE software specializes in components for dynamic PDF generation and manipulation. The DynamicPDF™ product line allows you to dynamically generate PDF documents, merge PDF documents and new content to existing PDF documents from within your applications. Visit DynamicPDF here
6 Months Free & No Setup Fees ASP.NET Hosting!
Become a Sponsor