Create Your Own Web Scrapper Using Node.js

Want to make you own scrapper to scrap any data form any website and return it in JSON format so you can use it anywhere you like? If yes, then you are at right place.

In this article I will guide you how to scrap any website to get desired data using node.js and to obtain the data in JSON format which can be used e.g. make any app which will run on live data from the internet.

I will be using Windows 10 x64 and VS 2015 for this article and will scrap from a news website i.e.

  • Firstly, set up the IDE, go to the link and download the node.js pre build installer. For me it will be windows installer 64-bit.

    node

  • After installing it, open your Visual Studio and create a new project Templates, JavaScript, Node.js, Basic Node.js Express 4 Application.

    node

  • Now I have to add two packages in npm folder, i.e. ‘Request’ and ‘Cheerio’.

    node

  • And uninstall ‘jade’ by doing right click as we don’t need it now and I have to host my json to azure cloud service, so jade gives an exception. If you want to consume json directly in your application or hosting using other service than you don’t have to uninstall jade.

  • Now go to app.js and comment out the line numbers 14 and 15 as we are not using ‘Views’ .

    node

  • Also comment out ‘app.use('/', routes);’

  • Change app.use('/users', users); to app.use('/', users);

  • Now go to users.js as now we will do the main thing here. Firstly, add the files ‘cheerio’ and ‘request’.

    node

  • Create a variable to save the url of the link,

    var url = "http://www.thenews.com.pk/CitySubIndex.aspx?ID=14";

  • Modify the router.get() function as in the following: 
    1. router.get('/', function(req, res)  
    2. {  
    3.     request(url, function(error, response, body)  
    4.     {  
    5.         if (!error && response.statusCode === 200)  
    6.         {  
    7.             var data = scrapeDataFromHtml(body);  
    8.             res.send(data);  
    9.         }  
    10.         return console.log(error);  
    11.     });  
    12. });  
    node

  • Here comes the main and difficult part. Now we have to write the main logic of scrapping our website. You have to customize your function according to your website and the data you want to fetch. Let’s open the website in browser and develop the logic for it.

    node

  • I want to scrap out the following data, news headline, its description and the link to open the detail of the news. This data is changed dynamically and want to fetch the latest data.

    node

  • To fetch this data I have to study its DOM so I can write its jQuery to fetch it easily.

    node

  • I made a DOM tree so I can the write the logic to traverse it easily.

    node

  • The text in red are the nodes I have to reach in a loop to access the data from the website.

    node

  • I will write a function named as scrapedatafromthtml as in the following:
    1. var scrapeDataFromHtml = function(html)  
    2. {  
    3.     var data = {};  
    4.     var $ = cheerio.load(html);  
    5.     var j = 1;  
    6.     $('div.DetailPageIndexBelowContainer').each(function()  
    7.     {  
    8.         var a = $(this);  
    9.         var fullNewsLink = a.children().children().attr("href");  
    10.         var headline = a.children().first().text().trim();  
    11.         var description = a.children().children().children().last().text();  
    12.         var metadata = {  
    13.             headline: headline,  
    14.             description: description,  
    15.             fullNewsLink: fullNewsLink  
    16.         };  
    17.         data[j] = metadata;  
    18.         j++;  
    19.     });  
    20.     return data;  
    21. };  
  • This function will reach the ‘div’ using the class ‘.DetailPageIndexBelowContainer’ and will iterate its DOM to fetch the ‘fullNewsLink’, ‘headline’ and ‘description’. Then it will add these values in the array called ‘metadata’. I have another array called ‘data’ and will come the values from metadata on each iteration so in the end I can return my ‘data’ array as JSON. If you only want one thing from a website you don’t need to have loop for it or to create you other array. You can directly access them by traversing it and return the single array.

  • Now run it and check the output.

  • node
  • And yes! It’s running perfectly and returning you the required data in JSON format.

  • Source Code: https://github.com/umerqureshi93/webscrapper.