Extract an E-mail From a Webpage

Introduction

 
Webscraping is one of the most used techniques to learn about data on the Internet. It is the art of extracting the valuable and required data from a webpage which are considered as the input value for performing various computational operations to generate useful information. In this article, we will learn how we can collect email data published on any webpage. We are using Python, one of the most popular programming languages, to extract data elements value, as it has rich libraries that help to perform various required activities.
 
The following steps will help you to learn how to find an email on any webpage.
 
Step 1
 
We need to import all the essential libraries for our program.
  • BeautifulSoup: It is a Python library for extracting data out of HTML and XML files.
  • requests: The requests library allows us to send HTTP requests using Python.
  • urllib.parse: This module provides functions for manipulating URLs and their component parts, to either break them down or build them up.
  • collections: It provides different types of containers
  • re: A module that handles regular expressions.
  1. #import packages  
  2. from bs4 import BeautifulSoup  
  3. import requests  
  4. import requests.exceptions  
  5. from urllib.parse import urlsplit  
  6. from collections import deque  
  7. import re  
Step 2
 
Select the URL for extracting an email from a given URL. 
  1. # a queue of urls to be crawled  
  2. new_urls = deque(['https://www.gtu.ac.in/page.aspx?p=ContactUsA'])  
Step 3
 
We have to process the given URL only once, so keep track of your processed URLs.
  1. # a set of urls that we have already crawled  
  2. processed_urls = set()  
Step 4
 
While crawling the given URL, we may encounter more than one the email-ID so keep them in the collections. 
  1. # a set of crawled emails  
  2. emails = set()  
Step 5

Time to start crawling, we need to crawl all the URLs in the queue, maintain the list of crawled URLs & get the page content from the webpage. If any error is encountered, move to the next page.
  1. # process urls one by one until we exhaust the queue  
  2. while len(new_urls):  
  3.     # move next url from the queue to the set of processed urls  
  4.     url = new_urls.popleft()  
  5.     processed_urls.add(url)  
  6.     # get url's content  
  7.     print("Processing %s" % url)  
  8.     try:  
  9.         response = requests.get(url)  
  10.     except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):  
  11.         # ignore pages with errors   
  12.         continue  
Step 6

Now we need to extract some base parts of the current URL, an essential part for transfering relative links found in the document into absolute ones:
  1. # extract base url and path to resolve relative links  
  2. parts = urlsplit(url)  
  3. base_url = "{0.scheme}://{0.netloc}".format(parts)  
  4. path = url[:url.rfind('/')+1if '/' in parts.path else url  
Step 7

From the page content extract email(s) and add them to emails set. 
  1. # extract all email addresses and add them into the resulting set   
  2. new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))  
  3. emails.update(new_emails)  
Step 8

Once the current page is processed, its time to search links to other pages and add them to the URL queue (that's the magic of crawling). Get a Beautifulsoup object for parsing HTML pages. 
  1. # create a beutiful soup for the html document  
  2. soup = BeautifulSoup(response.text)  
Step 9
 
The soup object contains HTML elements. Now find all the anchor tags with its href attributes to resolve relative links and keep a record of processed URLs.
  1. # find and process all the anchors in the document  
  2. for anchor in soup.find_all("a"):  
  3.     # extract link url from the anchor  
  4.     link = anchor.attrs["href"if "href" in anchor.attrs else ''  
  5.     # resolve relative links  
  6.     if link.startswith('/'):  
  7.         link = base_url + link  
  8.     elif not link.startswith('http'):  
  9.         link = path + link  
  10.     # add the new url to the queue if it was not enqueued nor processed yet  
  11.     if not link in new_urls and not link in processed_urls:  
  12.         new_urls.append(link)  
Step 10

List out all the email-ID extracted from the given URL.
  1. for email in emails:  
  2.     print(email)  

Summary

 
This article taught how to perform webscraping, particularly if you are targeting any data on an HTML page using Python packages such as BeautifulSoup, collections, requests, re, and urllib.parse.


Similar Articles