/  Technology   /  Python   /  WebScraping using BeautifulSoup in python
WebScraping using BeautifulSoup in python 6 (i2tutorials)

WebScraping using BeautifulSoup in python

Webscraping

The whole process of extracting data from websites or websites is called webscraping. People generally use webscraping to build marketing strategy, monitor them and hence upgrade their business.

Webscrapping provides a way to get information like website prices, images, data for analyzing purposes. Also

many websites do not provide APIS, so webscraping is the only way avaliable.

Scraping is legal till you are not causing any damage and you are doing it responsibly.

In this article, we will discuss about webscraping using beautiful soup.

Python Libraries Required For Web-scrapping

Requisites: used for fetching URLs

BeautifulSoup : used for pulling out information from webpage

Beautiful soup alone is not capable of fetching the webpage that’s why we will use the combination of request and Beautiful soup.

Installing The Required Libraries

One of the easy ways to install packages written in python is by using pip.

  • pip install requests
  • pip install html5lib
  • pip install bs4

For webscraping using BeautifulSoup you should know some basic HTML. So we will start by taking an example of scraping data WorldoMeters website.

Fetching all the HTML Content

import requests
link = "https://www.worldometers.info/coronavirus#countries"
l=requests.get(link).text
print(l) 

In the above code we specify the URL of the webpage and send the HTTP request to it which is saved in the response object here l. By printing l we get the raw HTML code of the webpage.

Parsing a page with BeautifulSoup

from bs4 impart beautiful soup 
import requests
link = "https:///www.worldometers.info/coronavirus#countries"
l=requets.get(link).text
soup=beautifulsoup (l,'html5lib')	
print(soup. prettify())

output :

We can use BeautifulSoup library to parse HTML document . we can now print the HTML content of the page formatted nicely using the prettify method on the beautifulsoup object.

Working with HTML tags

soup.title: To get tittle of webpage

output :  <tittle> Coronavirus Update (Live): 3,456,223 Cases and 243,024 Deaths from COVID-19 Virus Pandemic – Worldometer</tittle>

soup.tittle.string :It will give only string without tags

output : ‘Coronavirus Update (Live): 3,456,223 Cases and 243,024 Deaths from COVID-19 Virus Pandemic – Worldometer’

soup. find _all (‘a’) : It will give all the links presented in the webpage.

output :

To get the HTML content of the table as we are interested in scraping data from it:

all _tabies=soup. find_all('table')
print(all_tables)

output:It will return all the different table tags in the webpage

As we are interested in country wise information table & to get id or class of that table you need to right-click on table in the webpage and click inspect to get the required HTML code.

table = soup.find('table',id='main_table_countries_today')

output : It will return the HTML content of the table we are interested in as we have specified the class of the table from which we want to extend data.

As we got the HTML content if the required table we can extract data from it.

To get all the links that are within the table you can use.

tab_links=table.find_all('a')
tab_links      

output :

To get name of the countries from the table you can use.

for i in tab_links:
    print(i.get('href))

Output:

You can further extract more data from any website & can store it in CSV file.                                  

Leave a comment