Web Scraping Using Python

Article

Introduction

Data are the most valuable asset for any organization. It helps them to learn about operational activities, also the need of the market, and the data of competitors on the internet which helps them plan out future perspectives. We are going to learn one of the most demanding concepts on the Internet, that guides many institutions to transform their business to the next level. This is how to collect data from the webpage/website which is known as "Web Scraping" using one of the most trending programming languages, Python.

Definition

The Process of extracting HTML data from a webpage/website.
Transforming HTML unstructured data to structure data into excel or dataset.
Let's study this concept with an example of extracting the name of weblinks available on the home page of www.c-sharpcorner.com website.

Step 1

To start with web scraping, we need two libraries: BeautifulSoup in bs4 and request in urllib. Import both of these Python packages.

#import packages(libraries)
from bs4 import BeautifulSoup
import urllib.request

Step 2

Select the URL to extract its HTML elements.

#target URL
url = "https://www.c-sharpcorner.com"

Step 3

We could access the content on this webpage and save the HTML in “myUrl” by using urlopen() function in the request.

#use request to open URL
myUrl = urllib.request.urlopen(url)

Step 4

Create an object of BeautifulSoup to further extract the webpage element data, using its various inbuilt functions.

# soup is an object of BeautifulSoup that will allow us to used all its inbuild functions to extract webpage element data
soup=BeautifulSoup(myUrl, 'html.parser')
# title of the page
print(soup.title)
# get attributes:
print(soup.title.name)
# get values:
print(soup.title.string)
# beginning navigation:
print(soup.title.parent.name)
# getting specific values:
print(soup.p)
# prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document
print(soup.prettify())

Step 5

Locate and scrape the services. Using the soup.find_all() function, extract the specific HTML element tag from the entire or specific portion of the webpage.

We should find the HTML services on this web page, extract them, and store them. For each element in the web page, they always have a unique HTML "ID" or "class". To check their ID or class, we would need to INSPECT element on the webpage.

# soup.find_all('div') function will extract all the div tags on the given url
div_list= soup.find_all('div')
# this will be all the div tag as a single element of review[] list
print(div_list)

Step 6

On inspecting the web page for extracting all the services names on the www.c-sharpcorner.com website, we located the ul tag with the class value as 'headerMenu' as the parent node.

To extract all the child node which is our target to extract all the weblink names on the www.c-sharpcorner.com website, we located the li tag as the target node.

# weblinks[] a list to store all the weblinks name on https://www.c-sharpcorner.com
weblinks=[]
# the outermost loop will help to extract all HTML element of div tag with class value as 'row service-ro no-margin'
for i in soup.find_all('ul',{'class':'headerMenu'}):
# the innnermost loop will help to extracdt all the HTML element of div tag with class value as col-lg-4 col-md-6 single-servic
for j in i.find_all('li'):
# to extract h4 HTML element in each j
per_service = j.find('a')
# this will print the service name
print(per_service.get_text())
# to append the service name in services list
weblinks.append(per_service.get_text())

Output of the above code:

TECHNOLOGIES
ANSWERS
LEARN
NEWS
BLOGS
VIDEOS
INTERVIEW
PREP
BOOKS
EVENTS
CAREER
MEMBERS
JOBS

Summary

This article taught the basics of how to extract HTML element data from any given URL.

Download the source code.