Scrape html and sort the data in excel(c#)

Question

Hello, I am trying to scrape a website(its a public data) and store it in excel or some other database. I am new to all this. So I was able to download the html source to a text or excel but the data is very unorganized. Basically, I wanted to organize the data into some readable format. Following are the things I am trying to do:
1) Get data from the website with <div id="container"> and there are links within them. So I want to go to all the links and fetch data from there.
2). The collected data should be readable and formatted.
I could get the contents of the first page but could not get into the links. Could you please suggest how I should go. I looked up and found that htmlagilitypack is a way, but I have never used it before and I am stuck. I have included the codes that I have done so far.
Thank you
This is my form class:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Net;
using System.IO;

namespace tryScrape1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}

private void button1_Click(object sender, EventArgs e)
{

//If a correct url is entered in the textbox
try
{

//Gets the url entered by the user from the textbox
string url = textBox1.Text;

//Setting up the path for scrapped data
string directory = @"c:\temp\";
string filename = String.Format("scrapped_data.xls", DateTime.Now);
string path = Path.Combine(directory, filename);

//Class variable declaration
string sourceCode = GetSource.getSourceCode(url);

//Marks the start point of scrape
int startIndex = sourceCode.IndexOf("paddingbig");

//Marks the endpoint of the html to scrape
int endIndex = sourceCode.IndexOf("321,820");

//Gets the string between the specified startIndex and endIndex
sourceCode = sourceCode.Substring(startIndex, endIndex - startIndex);

//Request made to the url to access
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
StreamWriter sWriter = new StreamWriter(path);

sWriter.Write(sourceCode);
MessageBox.Show("Contents have been Scrapped!");
textBox1.Clear();
sWriter.Close();

}
//if the textbox is blank or incorrect url or if a url cannot be scrapped
catch(Exception)
{
MessageBox.Show("URL input cannot be blank.");
}
}
}}

GetSource Class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
using System.IO;

namespace tryScrape1
{
    class GetSource
    {
        public static string getSourceCode(string url)
        {

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            StreamReader streamReader = new StreamReader(response.GetResponseStream());
            string sourceCode = streamReader.ReadToEnd();
            streamReader.Close();
            response.Close();
            return sourceCode;
     }

}
}

Richard Porter · Answer

Hi, yes I believe the HtmlAgilityPack would simplify this task for you. I would recommend you to check out some articles that demonstrate HtmlAgilityPack's usage, for example: http://www.codeproject.com/Articles/659019/Scraping-HTML-DOM-elements-using-HtmlAgilityPack-H In short, in order to retrieve all the pages that are linked with the targeted page you would need to do something like the following: var url = new Uri ( "http://www.example.com/" ); var web = new HtmlWeb (); var homePageDocument = web.Load(url.ToString()); var homePageAnchorTags = homePageDocument.DocumentNode.SelectNodes( "//a" ); List HtmlDocument > linkedPages = new List HtmlDocument >(); foreach ( string anchorUrl in homePageAnchorTags .Select(a => a.Attributes[ "href" ].Value) .Where(u => url.Host == new Uri (u).Host)) { linkedPages.Add(web.Load(anchorUrl)); } Now regarding the second requirement, you have two options that come to my mind. First you can use a known workaround in which you save the HTML formatted content into a file with an excel extension. This can work because MS Excel can read HTML format, but when opening a file you will first get the warning that the file's format and extension are not the same. The second approach would be to create a real excel file which I'm afraid is not a simple task to achieve. You would need to parse the HTML formatted content and convert each element to some corresponding workbook element. You can however simplify this task as well by using an excel library for C# . This library can easily take care of converting a HTML formatted content to an excel file in .NET .

Sabin Sapkota · Answer

Thank you for your input. What I have been doing now is that I could scrape all the data from the links I wanted to a text file. Now I am working on parsing them to my needs..(which of course I was sweating on). But I will look into the library that you referenced and try to work it out. I will update on how things will turn out. Thank you

Scrape html and sort the data in excel(c#)

Insert Link

Embed YouTube Video

Table Options

Insert Image

Answers (2)