How to Scrape WebPages using Scrapy & Python
The whole process of extracting data from the website is called webscraping. It is also called web scrawling. Python program we will use to scrap the data is called spider. In this tutorial we will walk through an introduction to Scrapy.
Scrapy is a framework written in python which is used to extract data from websites. Scrapy provides a built-in mechanism for extracting data. It is maintained by Scrapinghub ltd.
Installation
We need to start by creating a virtual environment. A virtual env is nothing but an isolated installation of python and its dependencies that can be used to better manage the dependencies and keep the installation of package to a minimum per project.
For that we need to install virtualenv, which is basically a virtual python environment builder.
pip install virtualenv
After installing virtualenv we are ready to create our virtual environment and install scrapy. Follow the below steps :
py -m venv scrap
cd scrap\Scripts
conda activate scrap
pip install scrapy
Creating a project
After the successful installation of scrapy in virtual env we need to create a scrapy project which will hold the code for running spider. Before you start scraping enter a directory where you’d like to store your code and run:
scrapy startproject quotes
As the project is created, the folder will be in the current directory with the same name as project name, with the following contents:
quotes/
scrapy.cfg # deploy configuration file
quotes/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # folder in which all spiders will be kept
__init__.py
Scrapy.Spider
It provides a default start_request() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.
It has the following attribute & methods :
name : Name of the spider, it must be unique for each spider.
start_urls : URLs from where the spider will begin to crawl from.
start_requests : When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.
Parse(response) : A method that will be called to handle the response downloaded for each of the requests made.
Write the code for our first spider & save it under the project/spiders directory in your project.
To put your spider to work, go to the projects top-level directory and run the following command in command prompt:
scrapy crawl name
This command runs the spider with a specified name and will send a request to the site/domain. The output will be saved in the current directory.
Example:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quote"
start_urls = [
'http://quotes.toscrape.com/'
def parse(self, response):
all_quote=response.css('div.quote')
for quote in all_quote:
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
Extracting data
The best way to learn how to extract data with Scrapy is trying selectors using the scrapy shell. scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line type scrapy shell
Using the shell, you can try selecting elements using CSS or Xpath with the response object.
Now let us try scraping some data from sample HTML code using shell.
Sample HTML code :
<html>
<head>
<base href=’http://scrapy.com/’ />
<title> How to scrape </title>
</head>
<body>
<div id=’images’>
<a href = ‘image1.html’> Name : scrapy 1 <br/> <img src=’image1.jpg’ /></a>
<a href = ‘image2.html’> Name : scrapy 2 <br/> <img src=’image2.jpg’ /></a>
<a href = ‘image3.html’> Name : scrapy 3 <br/> <img src=’image3.jpg’ /></a>
</div>
</body>
</html>
Open Shell, after opening there will be a response variable and its attached selector in response.selector attribute. So by looking at HTML code, we will construct our CSS or Xpath selector.
CSS Selector
The idea of selecting a particular HTML tag or particular CSS inside the source code is called as CSS selector.
Xpath Selector
Xpath is another selector like CSS which has different syntax.
>>> response.css(“title::text).get()
'How to scrape'
>>> response.xpath('//title/text()').getall()
['How to scrape']
>>> response.xpath('//title/text()').get()
'How to scrape'
.get() always returns a single result, if there are several matches. Content of the first match is returned, if no matches null/none is returned.
.getall() returns a list with all results.
If you want to extract only the first matched element, you can call the selector .get() or .extract_first().
For selecting nested data we can use both Xpath & CSS simultaneously.
>>>response.css(‘img’).xpath(‘@src’).getall()
[‘image1.jpg’ , ‘image2.jpg’ , ‘image3.jpg’]
>>>response.xpath(“//div[@ id=’images’]/a/text()”].getall()
‘Name : Scrapy 1’, ‘Name : Scrapy 2’, ‘Name : Scrapy 3’
Different ways to extract base URL :
>>>response.css(‘base’).attrib[‘href’]
>>>response.xpath(‘//base/@href’).get()
>>>response.css(‘base::attr(href)’).get()
>>>response.css(‘base’).attrib[‘href’]
Output :
‘http://scrapy.com/’
>>>response.xpath(‘//a[contains(@href,”image”)]/@href’).getall()
>>>response.css(‘a[href*=image]::attr(href)’).getall()
Output :
[‘image1.html’, ‘image2.html’, ‘image3.html’]
Storing the Data
Storing the scraped data in scrapy is simple task. You can store data in certain formats like CSV,JSON,Excel, etc., by using following commands :
scrapy crawl name -o filename.csv
This will generate filename.csv file containing al scraped data.
Similarly,
scrapy crawl name -o filename.json
It will generate JSON file.