/  Technology   /  How to Scrape WebPages using Scrapy & Python

How to Scrape WebPages using Scrapy & Python

The whole process of extracting data from the website is called webscraping. It is also called web scrawling. Python program we will use to scrap the data is called spider. In this tutorial we will walk through an introduction to Scrapy.

Scrapy is a framework written in python which is used to extract data from websites. Scrapy provides a built-in mechanism for extracting data. It is maintained by Scrapinghub ltd.


Installation

We need to start by creating a virtual environment. A virtual env is nothing but an isolated installation of python and its dependencies that can be used to better manage the dependencies and keep the installation of package to a minimum per project.

For that we need to install virtualenv, which is basically a virtual python environment builder.

pip install virtualenv


After installing virtualenv we are ready to create our virtual environment and install scrapy. Follow the below steps :

py -m venv scrap
cd scrap\Scripts
conda activate scrap
pip install scrapy


Creating a project

After the successful installation of scrapy in virtual env we need to create a scrapy project which will hold the code for running spider. Before you start scraping enter a directory where you’d like to store your code and run:

scrapy startproject quotes 


As the project is created, the folder will be in the current directory with the same name as project name, with the following contents:

quotes/
scrapy.cfg            # deploy configuration file
quotes/             # project's Python module, you'll import your code from here
 __init__.py
 items.py          # project items definition file
 middlewares.py    # project middlewares file
 pipelines.py      # project pipelines file
 settings.py       # project settings file
 spiders/          # folder in which all spiders will be kept
 __init__.py


Scrapy.Spider

It provides a default start_request() implementation which sends requests from the start_urls spider attribute and calls the spider’s method parse for each of the resulting responses.


It has the following attribute & methods :


name : Name of the spider, it must be unique for each spider.

start_urls : URLs from where the spider will begin to crawl from.

start_requests : When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.

Parse(response) : A method that will be called to handle the response downloaded for each of the requests made.

Write the code for our first spider & save it under the project/spiders directory in your project.

To put your spider to work, go to the projects top-level directory and run the following command in command prompt:

scrapy crawl name


This command runs the spider with a specified name and will send a request to the site/domain. The output will be saved in the current directory.


Example:

import scrapy
class QuotesSpider(scrapy.Spider):
name = "quote"
start_urls = [
'http://quotes.toscrape.com/'
 def parse(self, response):
 all_quote=response.css('div.quote')
 for quote in all_quote:
 yield {
 'text': quote.css('span.text::text').get(),
 'author': quote.css('small.author::text').get(),
 'tags': quote.css('div.tags a.tag::text').getall(),
 }


Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the scrapy shell. scrapy provides a shell of its own that you can use to experiment. To start the scrapy shell in your command line type scrapy shell

Using the shell, you can try selecting elements using CSS or Xpath with the response object.

Now let us try scraping some data from sample HTML code using shell.


Sample HTML code :

<html>
<head>
<base href=’http://scrapy.com/’  />
<title> How to scrape </title>
</head>
<body>
<div id=’images’>
<a href = ‘image1.html’> Name : scrapy 1 <br/> <img src=’image1.jpg’ /></a>
<a href = ‘image2.html’> Name : scrapy 2 <br/> <img src=’image2.jpg’ /></a>
<a href = ‘image3.html’> Name : scrapy 3 <br/> <img src=’image3.jpg’ /></a>
</div>
</body>
</html>


Open Shell, after opening there will be a response variable and its attached selector in response.selector attribute. So by looking at HTML code, we will construct our CSS or Xpath selector.


CSS Selector

The idea of selecting a particular HTML tag or particular CSS inside the source code is called as CSS selector.


Xpath Selector

Xpath is another selector like CSS which has different syntax.

>>> response.css(“title::text).get()
'How to scrape'
>>> response.xpath('//title/text()').getall()
['How to scrape']
>>> response.xpath('//title/text()').get()
'How to scrape'


.get() always returns a single result, if there are several matches. Content of the first match is returned, if no matches null/none is returned.

.getall() returns a list with all results.

If you want to extract only the first matched element, you can call the selector .get() or .extract_first().

For selecting nested data we can use both Xpath & CSS simultaneously.

>>>response.css(‘img’).xpath(‘@src’).getall()
[‘image1.jpg’ , ‘image2.jpg’ , ‘image3.jpg’]

>>>response.xpath(“//div[@ id=’images’]/a/text()”].getall()
‘Name : Scrapy 1’, ‘Name : Scrapy 2’, ‘Name : Scrapy 3’


Different ways to extract base URL :

>>>response.css(‘base’).attrib[‘href’]
>>>response.xpath(‘//base/@href’).get()
>>>response.css(‘base::attr(href)’).get()
>>>response.css(‘base’).attrib[‘href’]

Output :

‘http://scrapy.com/’


>>>response.xpath(‘//a[contains(@href,”image”)]/@href’).getall()
>>>response.css(‘a[href*=image]::attr(href)’).getall()

Output :

[‘image1.html’, ‘image2.html’, ‘image3.html’]


Storing the Data

Storing the scraped data in scrapy is simple task. You can store data in certain formats like CSV,JSON,Excel, etc., by using following commands :

scrapy crawl name -o filename.csv

This will generate filename.csv file containing al scraped data.

Similarly,

scrapy crawl name -o filename.json

It will generate JSON file.

Leave a comment