Creating Link Extractor and Filter in C#: Part2

Introduction

Creating Link Extractor and Filter: Part 1

In this article I will complete my previous article Creating Link Extractor and Filter. In that article we covered how to extract all the links from the page and then list them in a list box. In this article we will create the filters. Filters are used for filtering the links by a supplied parameter. Filters make it much easier to find the desirable link. So let's start creating the filter for our links.

Creating the Link Filter

As I always say before coding anything, understand clearly what you intend to build. So let's define our requirements first for creating the filter.

Logic for Link Filter

  1. First we need a filter for filtering the links.
  2. To get input from the user about the filter we have various approaches. We can provide a text box for it but the problem with that is if the user provides something wrong. The second possibility is to provide check boxes and that seems to be good, the user can select multiple filters. We can also provide a radio box but again then the user can provide only one parameter. So the second option looks good.
  3. Now the user can provide multiple parameters. To maintain the parameters we can use a list. A list will manage all the filters. If the filter is unchecked then it will be removed; if the filter is checked then it will be added to the list.
  4. Now we have the filter parameter list. The next target is for when we start the filtering. For that we have two options. The first is to provide a button and then we can call the filtering method. The second possibility is we can add a change listener in check boxes so whenever any change happens in the filter parameter the sorting will be carried out.
  5. Both the methods have their own advantage . For this article I'll be implementing both ways.
  6. Now the catch comes. How to filter?
  7. To filter the links, first we need to understand the links. Almost all the useful links ends with some kind of extensions. And the filter list also contains the extensions. So if we can compare the extensions of the link and the extension in the filter list then we are done.
  8. So the task becomes how to get the page extension. For that we can use the Page.GetExtension() method. This method returns the extension of whatever file link you pass in.
  9. Now we have the page extension. Next we will compare the page extension with the extensions in the filter list. If we found the match then that link will be added in the URL list else it is not.
  10. At the end we just need to show the filtered list to the user. But in doing all this we also need to save all the links of the page so that if the user removes the filter we don't need to request the page again. It's for performance optimization.
  11. So all done. It's time to convert the preceding logic into code. Let's start coding.

Coding the filter

  1. Open the project that we created in Part 1.
  2. Add one group box on your form and name it "Choose Filters". You can find the group box under the containers section of the toolbar. To name it just open the property box and find the name field.

    img01.jpg
     
  3. Now add two checkboxes in a "Choose Filters" group box. To add the checkboxes just open the toolbar and drop them onto your form group box one by one. Now name them "PDF" and "ZIP".

    img 02.jpg
     
  4. Now add one button on your form and provide it the name "Sort Links".

    img 03.jpg
     
  5. Now select both of the checkboxes and double-click them. Two checked change handlers will be generated in your cs file.
  6. Write the following code in your check handlers 1 and 2 respectively.

        Check Box 1
     

       if (filterParameters.Contains(".zip"))

       {

           filterParameters.Remove(".zip");

       }

       else

       {

           filterParameters.Add(".zip");

       }

       sortLinks();

        Check Box 2

       if (filterParameters.Contains(".zip"))

       {

           filterParameters.Remove(".zip");

       }

       else

       {

           filterParameters.Add(".zip");

       }

       sortLinks();

  7. sortLinks() is a method that will sort our links. In the code above we are adding the parameter if the user clicks on the check box and removes them if the user clicks them again. Initially all the checkboxes are unchecked and the list will be empty. If the user checks any box then the check change handler will execute and it will add the extension to a list.

  8. Now before we implement sortLinks() we need to take a backup of all the links that we grabbed. Just add the following snippet in your grab button handler:
     

    List<string> allLinks = newList<string>();

    foreach (var item in checkedListBox1.Items)

    {

        allLinks.Add(item.ToString());

    }
     

  9. Now it's time to implement the sortLinks() method. It is very simple if we follow the logic stated above. The code will look like this:

     

    List<string> temp = newList<string>();   

    foreach (var item in allLinks)

    {

    if (filterParameters.Contains(Path.GetExtension(item)) || filterParameters.Count==0)

              {

                  temp.Add(item);

              }

          }

          updateList(temp);

      }

     

    The seconds condition resets the grid if no parameters are selected. The updateList() method updates the list of checked list boxes.
     

  10. To call this method from the sort link button, just double-click on that button and add the sortLinks() function call in it.

       sortLinks();
     

  11. The update function contains only this loop:
     

    checkedListBox1.Items.Clear();

    foreach (var item in temp)

    {

        checkedListBox1.Items.Add(item);

    }

     

  12. To make it much better, add the click listener to your check list box and in the click handler code add the following code:

    Clipboard.SetText(checkedListBox1.SelectedItem.ToString());

    This line will set the clipboard text equal to the URL you have clicked.


Output

op1.JPG

op2.JPG


op3.JPG


op4.JPG


op5.JPG


You can find the complete code below:

using System;

using System.Collections.Generic;

using System.ComponentModel;

using System.Data;

using System.Drawing;

using System.IO;

using System.Linq;

using System.Text;

using System.Threading.Tasks;

using System.Windows.Forms;

 

namespace linkGrabber

{

    public partial class Form1 : Form

    {

    public Form1()

    {

        InitializeComponent();

    }

 

    private void button1_Click(object sender, EventArgs e)

    {

        checkedListBox1.Items.Clear();

        allLinks.Clear();

        WebBrowser wb = new WebBrowser();

        wb.Url = new Uri(textBox1.Text);

        wb.DocumentCompleted += wb_DocumentCompleted;

    }

 

    void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)

    {

            HtmlDocument source = ((WebBrowser)sender).Document;

            extractLink(source);

            ((WebBrowser)sender).DocumentCompleted -= wb_DocumentCompleted;

    }

 

    List<string> allLinks = new List<string>();

    private void extractLink(HtmlDocument source)

    {

        HtmlElementCollection anchorList = source.GetElementsByTagName("a");

        foreach (var item in anchorList)

        {

            if (checkedListBox1.Items.Contains(item))

            {

                continue;

            }

            checkedListBox1.Items.Add(((HtmlElement)item).GetAttribute("href"));

        }

        foreach (var item in checkedListBox1.Items)

        {

            allLinks.Add(item.ToString());

        }

    }

 

    List<string> filterParameters = new List<string>();

 

    private void checkBox1_CheckedChanged(object sender, EventArgs e)

    {

            if (filterParameters.Contains(".pdf"))

            {

                filterParameters.Remove(".pdf");

            }

            else

            {

                filterParameters.Add(".pdf");

            }

            sortLinks();  

        }

 

        private void checkBox2_CheckedChanged(object sender, EventArgs e)

        {

            if (filterParameters.Contains(".zip"))

            {

                filterParameters.Remove(".zip");

            }

            else

            {

                filterParameters.Add(".zip");

            }

            sortLinks();

        }

 

        private void sortLinks()

        {

            List<string> temp = new List<string>();   

            foreach (var item in allLinks)

            {

if (filterParameters.Contains(Path.GetExtension(item)) || filterParameters.Count==0)

                {

                    temp.Add(item);

                }

            }

            updateList(temp);

        }

 

        private void updateList(List<string> temp)

        {

                checkedListBox1.Items.Clear();

                foreach (var item in temp)

                {

                    checkedListBox1.Items.Add(item);

                }

        }

 

        private void checkedListBox1_Click(object sender, EventArgs e)

        {

            Clipboard.SetText(checkedListBox1.SelectedItem.ToString());

        }

 

    }

}

 

 

 

Summary

Our link extractor and filter application is completed. Now you can grab the download links of any site that provide direct file links. You can extend it by adding more filters. You can also add a file downloader in it so that you can download files from it. You can also process those links that do not directly point to file but make server requests. If you extend this project then don't forget to share it in the comments. Thank you for reading this article and if you like it then you can always share it but don't forget to comment.


Similar Articles