Having trouble with Document_Complete Event of Axwebbrowser component for a crawler.

Dec 2 2009 2:55 PM
Hi folks,

This is the second time I am trying to ask a question in C# corner so pardon me if I am not very clear with what I want.

This is shortly what I want to do.

I want to write a crawler which would crawl a website and extract all the images from that website. I know that using the mshtml parser would make the work a whole lot easier.

So here's what I am trying to do.

I have a form where the user enters his web address and then clicks on the Submit button.

Once he clicks on it he navigates to the web page and a Document_Complete event is triggered.

Once this happens, all the document is parsed through an mshtml document and from there on I parse all the images in that file and save them to the disk. I also try to grab the 'alt' information of that image and write the alt information into a new text file. I have a counter which automatically generates a name for the saved file with the variable 'i'.

Now comes the crawling part.

The browser has to now navigate to each of these elements in the list and then download all the images in that link.
But this wouldn't be possible if the navigate page directly goes to the .gif link of that image because then its just an image stored on the server somewhere and its not in the proper html format.

So to overcome this, I also collect the anchor elements of the page first navigated to. Then collect all the anchor elements and now start navigating to each of these anchor elements and then parse the page to see if there are any images. I also have a condition stating that, parse the document only when the image height and width are above 300.

Now here's the problem.

In order to parse the links in the linkList I would have to Navigate to each of these links. So in a foreach loop I start navigating to each of these anchor elements in the linkList.

I know this is a very lousy way of crawling and doubt if it can even be termed as crawling because all it would do is crawl breadthwise and may not ever return to the linkList at all!!! But hey, what the heck, I am absolutely new to programming and am trying to figure out stuff for the first time.

So I was assuming that whenever the link navigates to the link in the linkList, the Document_Complete event would trigger and the whole process would repeat.

Believe me or not, it does work. I am able to parse upto 30 pages of the following site "http://www.saibaba.com". But the problem seems to be that, the document takes sometime before the Document_Complete event is generated which skips a lot of links in the list.

I was wondering if I am following the right procedure. Some in the community and in other research I did said that using HttpWebRequest and HttpWebResponse was a better way of doing this. Because Axwebbrowser component waits until the whole page is loaded and is associated with a UI component.

I also have no experience in threading. It would be of great help if some could suggest me a better way of doing this crawl. It would be of great help. Thanks in advance and I hope this post is seen by some benevolent angel soon and I would get help.

Regards.
Vinayak.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Net;
using System.IO;
using mshtml;


namespace SaiCrawler
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();

        }

        public int i = 1;

        // imgList is going to contain a list of all images.
        public List<string> imgList = new List<string>();

        //linkList is going to contain a list of all hyperlinks
        public List<string> linkList = new List<string>();

        // Will be called when the Document completes loading in the browser window.

        private void axWebBrowser1_DocumentComplete(object sender, AxSHDocVw.DWebBrowserEvents2_DocumentCompleteEvent e)
        {
           
            HttpWebRequest request = WebRequest.Create("http://www.saibaba.com") as HttpWebRequest;

            HttpWebResponse response = request.GetResponse() as HttpWebResponse;

            mshtml.IHTMLDocument2 doc1 = response.GetResponseStream() as mshtml.IHTMLDocument2;

            // a new IHTMLDocument2 is instantiated
            mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)axWebBrowser1.Document;


            // img Collection and linkCollection contain loaded images and links of a website
            IHTMLElementCollection imgCollection = doc.images;
            IHTMLElementCollection linkCollection = doc.links;

            IHTMLElementCollection linkCollection1 = doc1.links;

            try
            {
                // For each image in imgCollection, see if the image src contains a string
                // and if the image has a certain height and width limit,save the image and
                // write the alternate text information of that image into a separate text file.

                foreach (IHTMLImgElement img in imgCollection)
                {

                    if (img.src.Contains(""))
                    {


                        {
                            TextWriter tw = new StreamWriter("image" + i + ".txt");
                            PictureBox picture = new PictureBox();
                            picture.Load(img.src);
                            picture.Image.Save(i + ".jpeg");
                            imgList.Add(img.src);
                            tw.WriteLine(img.alt);
                            Console.WriteLine(img.href);
                            tw.Close();
                            i++;
                        }
                    }

                    foreach (IHTMLAnchorElement link in linkCollection)
                    {
                        if (link.href.Contains(""))
                        {
                            linkList.Add(link.href);
                        }
                    }

                }


            }

            catch (Exception exc)
            {
                MessageBox.Show(exc.ToString());
            }


            finally
            {

            }

            foreach (string s in linkList)
           {
               NavigateUrl(s);
           }

        }

        private void btnSubmit_Click(object sender, EventArgs e)
        {

            NavigateUrl(txtUrl.Text);

        }

        public void NavigateUrl(string url)
        {
            object miss = Type.Missing;

            axWebBrowser1.Navigate(url, ref miss, ref miss, ref miss, ref miss);
        }


    }
}


Answers (1)