Web Scraping with Python: All You Need to Get Started

An introductory guide to Python web scraping with a step-by-step tutorial.

web scraping with python thumbnail

Python is the probably the most popular language for machine learning and data analysis. But it’s also a great choice for web data extraction. Adding this skill to your portfolio makes a lot of sense if you’re working with data, and it can also bring profitable opportunities.

This guide will give you all you need to start web scraping with Python. It explains why you should invest your time into Python, introduces the libraries and websites for practicing web scraping. You’ll also find a step-by-step tutorial for building a web scraper that you can replicate on your own computer. Let’s begin!

Contents

What Is Web Scraping in Python?

Web scraping refers to the process of downloading data off web pages and structuring it for further analysis. You can scrape by hand; but it’s much faster to write an automated script to do it for you.

With this approach, you don’t exactly download a web page as people see it. Rather, you extract its underlying HTML skeleton and work from there. If you’re not sure what that is, try clicking the right mouse button on this page and selecting Inspect. You should now see it as web scraper does:

web scraping with python inspected

The page we’re on the way web scrapers see it.

Where does Python come in? Python provides the libraries and frameworks you need to successfully locate, download, and structure data from the web – in other words, scrape it.

Why Choose Python for Web Scraping

If you don’t have much programming experience – or know another programming language – you may wonder if it’s worth learning Python over the alternatives. Here are a few reasons why you should consider it:

  • Simple language. Python’s syntax is relatively human-readable and easy to understand at a glance. What’s more, you don’t need to compile code, which makes it simple to debug and experiment.
  • Great tools for web scraping. Python has some of the staple libraries for data collection, such as Requests with over 200 million monthly downloads.
  • Strong community. You won’t have issues getting help or finding solutions on platforms like Stack Overflow.
  • Popular choice for data analysis. Python ties in beautifully with the broader ecosystem of data analysis (Pandas, Matplotlib) and machine learning (Tensorflow, PyTorch).

Is Python the best language for web scraping? I wouldn’t make such sweeping statements. Node.js also has a very strong ecosystem, and you could just as well scrape using Java, PHP, or even cURL. But if you have no strong reasons to do so, you won’t regret going with Python.

Steps to Build a Python Web Scraper

Suppose you want to write a Python web scraper. Where do you start? These three steps should get you on track.

Step 1: Pick Your Web Scraping Libraries

There’s no shortage of Python web scraping libraries. But if this is your first web scraping project, I strongly suggest starting with Requests and Beautiful Soup.

Requests is an HTTP client that lets you download pages. The basic configuration only requires a few lines of code, and you can customize the request to a great extent, adding headers, cookies, and other parameters as you move on to more complex targets.

Beautiful Soup is a data parsing library – it extracts data from the HTML code you’ve downloaded and transforms it into a structured format. Beautiful Soup’s syntax is simple to grasp, while the tool itself is powerful, well documented, and lenient for beginners.

Once you get the hang of the first two, you should also learn to work with a headless browser library. They’re becoming increasingly necessary as web developers implement JavaScript frameworks and build dynamic single page applications. While imperfect, Selenium is a good library to start with, thanks to its prevalence.

Finally, as you increase your project’s scope, you should look into proxies. They offer the easiest way to avoid blocks by giving you more IP addresses. At first, you may consider a free proxy list (because it’s free!), but I recommend investing into paid rotating proxies.

Step 2: Devise a Web Scraping Project

The second step is to decide on your web scraping target and project parameters.

If you have no business use case in mind, it can be hard to find worthwhile ideas. I recommend practicing with dummy websites. They are specially designed for being scraped, so you’ll be able to try out various techniques in a safe environment. You can find several such websites in our list of websites to practice your web scraping skills.

If you’d rather scrape real targets, it’s a good idea to begin with something simple. Popular websites like Google and Amazon offer valuable information, but you’ll encounter serious web scraping challenges like CAPTCHAs as soon as you start to scale. They may be hard to tackle without experience.

In any case, there are guidelines you should follow to avoid trouble. Try not to overload the server. And be very cautious about scraping data behind a login – it’s gotten multiple companies sued. You can find more pieces of advice in our article on web scraping best practices.

At the end, your project should have at least these basic parameters: a target website, the list of URLs you want to scrape from it, and the data points you’re interested in.

Step 3: Write the Script

The third step is to build your web scraper. You can use any code editor, such as Visual Studio Code, an editor you’re comfortable with, or even your operating system’s text editor.

blank visual studio

Your blank canvas.

From this point on, everything depends on your web scraping project. To prevent this guide feeling like the infamous owl drawing tutorial, we’ll build a simple web scraper to help you understand the basic principles behind web data extraction.

Python Web Scraping Tutorial

Imagine that you run an online book business. Your main competitor sells books at books.toscrape.com, and you’d like to learn more about its catalogue.

books to scrape main page

The main page of books.toscrape.com

There are 1,000 books in total, with 20 books per page. Copying all 50 pages by hand would be madness, so we’ll build a simple web scraper to do it for us. It’ll collect the title, price, rating, and availability of your competitor’s books and write this information into a CSV file.

Prerequisites

The tools we’ll need are:

  • Python 3, if it’s not already on your computer, and Pip to install the necessary libraries.
  • BeautifulSoup. You can add it by running pip install beautifulsoup4 in your operating system’s terminal.
  • Requests. You can add it by running pip install requests.

Importing the Libraries

The first thing we’ll need to do is to import Requests and Beautiful Soup:

from bs4 import BeautifulSoup
import requests

We’ll need one more library to add the scraped data to a CSV file:

import CSV

Downloading the Page

Now, it’s time to download the page. The Requests library makes it easy – we just need to write two lines of code:

url = "http://books.toscrape.com/"
r = requests.get(url)

Some websites may not let you get off so easily – you may need to send a different user agent and jump through other hoops. (A talk in the OxyCon conference called How to Scrape the Web of Tomorrow does a very entertaining job presenting them.) But let’s keep it simple for now.

Extracting the Relevant Data Points

Now, it’s time to extract the data we need with Beautiful Soup. We’ll first create a Beautiful Soup object for the page we scraped:

soup = BeautifulSoup(r.content , "html.parser")

Then, we’ll need to figure out where particular data points are located. We wanted four elements: the book title, price, rating, and availability. After inspecting the page, we can see they’re all under a class called product_pod:

books to scrape inspected view

We could try extracting the whole class:

element = soup.find_all(class_="product_pod") print(element)
But that returns way too much (and way too messy!) data:
books to scrape product pod output
So, we’ll need to be more specific. There are 20 books in total; let’s create a loop for the relevant elements of each product_pod:
for element in soup.find_all(class_='product_pod'):

Now, let’s extract the page title. It’s nested in H3, under the <a> tag. We can simply use H3 and strip away unnecessary data (such as the URL) by specifying that we want a string:

book_title = element.h3.string 

The book price is under a class called price_colorWe’ll extract it without the pound symbol:

book_price = element.find(class_='price_color').string.replace('£','')

The book ratings are under the p tag. They’re more problematic to get because the rating is in the name itself, and the name includes two words (when we only need one!). It’s possible to solve it by extracting the class as a list:

book_rating = element.p['class'][1]

Finally, there’s the stock availability. It’s under a class called instock. We’ll extract the text and strip away unnecessary elements (mostly blank spaces):

 #book_stock = element.find(class_="instock").get_text().strip()

Here’s the full loop:

for element in soup.find_all(class_='product_pod'):     
    book_title = element.h3.string
    #getting the price and ridding ourselves of the pound sign     
    book_price = element.find(class_='price_color').string.replace('£','')     
    #getting the rating     
    #finding element class but it has two: star-rating and 'num' 
    #e.g. 'One' so we're only getting the second one     
    book_rating = element.p['class'][1]     
    #finding availability     
    book_stock = element.find(class_="instock").get_text().strip()
    #we can also use:     
    #book_stock = element.select_one(".instock").get_text().strip()     
    
    #print out to double check     
    print(book_title)     
    print(book_url)     
    print(book_price)     
    print(book_rating)     
    print(book_stock)

Don’t be afraid to play around with the code. For example, try removing [1] from the book_rating object. Or, delete .strip and see what happens. Beautiful Soup has many features, so there’s rarely just one proper way to extract content. Make Beautiful Soup’s documentation your friend.

Exporting the Output to CSV

At this point, our script returns all the data points we wanted, but they’re not very easy to work with. Let’s export the ouput to a CSV file.

First, we’ll create an empty list object:

books = []

Then, we’ll append our data points to the list:

books.append({     
    'title': book_title,     
    'price': book_price,
    'rating': book_rating,     
    'stock': book_stock,
    'url': book_url
    }     
    )

Finally, we’ll write the data to a CSV file:

with open("books_output.csv", "a") as f:     
    for book in books:         
        f.write(f"{book['title']},{book['price']},{book['rating']},{book['stock']},{book['url']}\n")

This gives us a nice and formatted CSV file (minus the URL – it’s on you to figure out how to extract it!):

bookstoscrape csv output

Here’s the complete script:

from bs4 import BeautifulSoup import requests import CSV url = "http://books.toscrape.com/"
r = requests.get(url) soup = BeautifulSoup(r.content , "html.parser") #books will be a list of dicts
books = []

for element in soup.find_all(class_='product_pod'):     
    book_title = element.h3.string
    #getting the price and ridding ourselves of the pound sign     
    book_price = element.find(class_='price_color').string.replace('£','')     
    #getting the rating     
    #finding element class but it has two: star-rating and 'num'      
    #e.g. 'One' so we're only getting the second one     
    book_rating = element.p['class'][1]     
    #finding availability     
    book_stock = element.find(class_="instock").get_text().strip()
    #we can also use:     
    #book_stock = element.select_one(".instock").get_text().strip()    

    books.append({     
    'title': book_title,     
    'price': book_price,     
    'rating': book_rating,     
    'stock': book_stock,     
    'url': book_url
    }     
    )

#write it to a csv file with open("books_output.csv", "a") as f:
#not writing in a header row in this case
#if using "a" then you can open a file if it exists and append to it
#can also use "w", then it would overwrite a file if it exists
    for book in books:
        f.write(f"{book['title']},{book['price']},{book['rating']},{book['stock']},{book['url']}\n")

Next Steps

Wait, that’s only one page. Weren’t we supposed to scrape all 50?

That’s right. But at this point, you should start getting the hang of it. So, why not try scraping the other pages yourself? You can use the following tutorial for guidance:

Scraping Multiple Pages with Beautiful Soup

A step-by-step tutorial showing how to get the URLs of multiple pages and scrape them.

It would also be useful to get the information of individual books. Why not try scraping them, as well? It’ll require extracting data from a table; here’s another tutorial to help you. You can find the full list of guides in our knowledge base.

Extracting Data from a Table with Beautiful Soup

A step-by-step guide showing how to scrape data from a table.

Submit a comment

Your email address will not be published.