Website Recursive Url Parser


In this article I am trying to share a piece of code that might be useful to some of the developers.

We can find a lot of code in C# that will parse the http urls in given string. But it is difficult to find a code that will:

  • Accept a url as argument, parse the site content
  • Fetch all urls in the site content, parse the site content of each urls
  • Repeat the above process until all urls are fetched.

Scenario

Taking the website http://valuestocks.in  (A Stock Market Site) as example I would like to get all the urls inside the website recursively.

Design

The main class is SpiderLogic which contains all necessary methods and properties.

Url1.gif

The GetUrls() method is used to parse the website and return the urls. There are two overloads for this method.

The first one takes 2 arguments. The url and and a Boolean indicating if recursive parsing is needed or not.

E.g.: GetUrls(http://www.google.com", true);

The second one is 3 arguments, url, base url and recursive Boolean.

This method is intended for usage like the url is a sub level of the base url. And the web page contains relative paths. So in order to construct the valid absolute urls, the second argument is necessary.

E.g.: GetUrls("http://www.whereincity.com/india-kids/baby-names/ ", http://www.whereincity.com/ , true);

Method Body of GetUrls()

public IList<string> GetUrls(string url, string baseUrl,
bool
recursive)
{
    if (recursive)
    {
        _urls.Clear();
        RecursivelyGenerateUrls(url, baseUrl);

        return _urls;
    }
    else
        return InternalGetUrls(url, baseUrl);
}


InternalGetUrls()

Another method of interest would be InternalGetUrls() which fetches the content of url, parses the urls inside it and constructs the absolute urls.

private IList<string> InternalGetUrls(string baseUrl, string absoluteBaseUrl)
{
    IList<string> list = new List<string>();

    Uri uri = null;
    if (!Uri.TryCreate(baseUrl, UriKind.RelativeOrAbsolute, out uri))
        return list;

    // Get the http content
    string siteContent = GetHttpResponse(baseUrl);

    var allUrls = GetAllUrls(siteContent);

    foreach (string uriString in allUrls)
    {
        uri = null;
        if (Uri.TryCreate(uriString, UriKind.RelativeOrAbsolute, out uri))
        {
            if (uri.IsAbsoluteUri)
            {
                if (uri.OriginalString.StartsWith(absoluteBaseUrl)) // If different domain / javascript: urls needed exclude this check
                {
                    list.Add(uriString);
                }
            }
            else
            {
                string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
                if (!string.IsNullOrEmpty(newUri))
                    list.Add(newUri);
            }
        }
        else
        {
            if (!uriString.StartsWith(absoluteBaseUrl))
            {
                string newUri = GetAbsoluteUri(uri, absoluteBaseUrl, uriString);
                if (!string.IsNullOrEmpty(newUri))
                    list.Add(newUri);
            }
        }
    }

    return list;
}

Handling Exceptions

There is an OnException delegate that can be used to get the exceptions occurring while parsing.

Tester Application

A tester windows application is included with the source code of the article.
You can try executing it.

The form accepts a base url as the input and clicking the Go button it parses the content of url and extracts all urls in it. If you need a recursive parsing please check the Is Recursive check box.

Url2.gif

Next Part

In the next part of the article, I would like to create a url verifier website that verifies all the urls in a website. I agree after doing a search we can find free providers like that. My aim is to learn & develop a custom code that could be extensible and reusable across multiple projects by community.


Similar Articles