Web Scraping with Python: All You Need to Get Started
An introductory guide to Python web scraping with a step-by-step tutorial.

Python is the probably the most popular language for machine learning and data analysis. But it’s also a great choice for web data extraction. Adding this skill to your portfolio makes a lot of sense if you’re working with data, and it can also bring profitable opportunities.
This guide will give you all you need to start web scraping with Python. It explains why you should invest your time into Python, introduces the libraries and websites for practicing web scraping. You’ll also find a step-by-step tutorial for building a web scraper that you can replicate on your own computer. Let’s begin!
Contents
- What Is Web Scraping in Python?
- Why Choose Python for Web Scraping
- Steps to Build a Python Web Scraper
- Python Web Scraping Tutorial
What Is Web Scraping in Python?
Web scraping refers to the process of downloading data off web pages and structuring it for further analysis. You can scrape by hand; but it’s much faster to write an automated script to do it for you.
With this approach, you don’t exactly download a web page as people see it. Rather, you extract its underlying HTML skeleton and work from there. If you’re not sure what that is, try clicking the right mouse button on this page and selecting Inspect. You should now see it as web scraper does:
Where does Python come in? Python provides the libraries and frameworks you need to successfully locate, download, and structure data from the web – in other words, scrape it.
Why Choose Python for Web Scraping
If you don’t have much programming experience – or know another programming language – you may wonder if it’s worth learning Python over the alternatives. Here are a few reasons why you should consider it:
- Simple language. Python’s syntax is relatively human-readable and easy to understand at a glance. What’s more, you don’t need to compile code, which makes it simple to debug and experiment.
- Great tools for web scraping. Python has some of the staple libraries for data collection, such as Requests with over 200 million monthly downloads.
- Strong community. You won’t have issues getting help or finding solutions on platforms like Stack Overflow.
- Popular choice for data analysis. Python ties in beautifully with the broader ecosystem of data analysis (Pandas, Matplotlib) and machine learning (Tensorflow, PyTorch).
Is Python the best language for web scraping? I wouldn’t make such sweeping statements. Node.js also has a very strong ecosystem, and you could just as well scrape using Java, PHP, or even cURL. But if you have no strong reasons to do so, you won’t regret going with Python.
Steps to Build a Python Web Scraper
Suppose you want to write a Python web scraper. Where do you start? These three steps should get you on track.
Step 1: Pick Your Web Scraping Libraries
There’s no shortage of Python web scraping libraries. But if this is your first web scraping project, I strongly suggest starting with Requests and Beautiful Soup.
Requests is an HTTP client that lets you download pages. The basic configuration only requires a few lines of code, and you can customize the request to a great extent, adding headers, cookies, and other parameters as you move on to more complex targets.
Beautiful Soup is a data parsing library – it extracts data from the HTML code you’ve downloaded and transforms it into a structured format. Beautiful Soup’s syntax is simple to grasp, while the tool itself is powerful, well documented, and lenient for beginners.
Once you get the hang of the first two, you should also learn to work with a headless browser library. They’re becoming increasingly necessary as web developers implement JavaScript frameworks and build dynamic single page applications. While imperfect, Selenium is a good library to start with, thanks to its prevalence.
Finally, as you increase your project’s scope, you should look into proxies. They offer the easiest way to avoid blocks by giving you more IP addresses. At first, you may consider a free proxy list (because it’s free!), but I recommend investing into paid rotating proxies.
Step 2: Devise a Web Scraping Project
The second step is to decide on your web scraping target and project parameters.
If you have no business use case in mind, it can be hard to find worthwhile ideas. I recommend practicing with dummy websites. They are specially designed for being scraped, so you’ll be able to try out various techniques in a safe environment. You can find several such websites in our list of websites to practice your web scraping skills.
If you’d rather scrape real targets, it’s a good idea to begin with something simple. Popular websites like Google and Amazon offer valuable information, but you’ll encounter serious web scraping challenges like CAPTCHAs as soon as you start to scale. They may be hard to tackle without experience.
In any case, there are guidelines you should follow to avoid trouble. Try not to overload the server. And be very cautious about scraping data behind a login – it’s gotten multiple companies sued. You can find more pieces of advice in our article on web scraping best practices.
At the end, your project should have at least these basic parameters: a target website, the list of URLs you want to scrape from it, and the data points you’re interested in.
Step 3: Write the Script
The third step is to build your web scraper. You can use any code editor, such as Visual Studio Code, an editor you’re comfortable with, or even your operating system’s text editor.
From this point on, everything depends on your web scraping project. To prevent this guide feeling like the infamous owl drawing tutorial, we’ll build a simple web scraper to help you understand the basic principles behind web data extraction.
Python Web Scraping Tutorial
Imagine that you run an online book business. Your main competitor sells books at books.toscrape.com, and you’d like to learn more about its catalogue.
There are 1,000 books in total, with 20 books per page. Copying all 50 pages by hand would be madness, so we’ll build a simple web scraper to do it for us. It’ll collect the title, price, rating, and availability of your competitor’s books and write this information into a CSV file.
Prerequisites
The tools we’ll need are:
- Python 3, if it’s not already on your computer, and Pip to install the necessary libraries.
- BeautifulSoup. You can add it by running
pip install beautifulsoup4
in your operating system’s terminal. - Requests. You can add it by running
pip install requests
.
Importing the Libraries
The first thing we’ll need to do is to import Requests and Beautiful Soup:
from bs4 import BeautifulSoup
import requests
We’ll need one more library to add the scraped data to a CSV file:
import CSV
Downloading the Page
Now, it’s time to download the page. The Requests library makes it easy – we just need to write two lines of code:
url = "http://books.toscrape.com/"
r = requests.get(url)
Some websites may not let you get off so easily – you may need to send a different user agent and jump through other hoops. (A talk in the OxyCon conference called How to Scrape the Web of Tomorrow does a very entertaining job presenting them.) But let’s keep it simple for now.
Extracting the Relevant Data Points
Now, it’s time to extract the data we need with Beautiful Soup. We’ll first create a Beautiful Soup object for the page we scraped:
soup = BeautifulSoup(r.content , "html.parser")
Then, we’ll need to figure out where particular data points are located. We wanted four elements: the book title, price, rating, and availability. After inspecting the page, we can see they’re all under a class called product_pod
:
We could try extracting the whole class:
element = soup.find_all(class_="product_pod")
print(element)
product_pod
:for element in soup.find_all(class_='product_pod'):
Now, let’s extract the page title. It’s nested in H3
, under the <a>
tag. We can simply use H3 and strip away unnecessary data (such as the URL) by specifying that we want a string:
book_title = element.h3.string
The book price is under a class called price_color
. We’ll extract it without the pound symbol:
book_price = element.find(class_='price_color').string.replace('£','')
The book ratings are under the p
tag. They’re more problematic to get because the rating is in the name itself, and the name includes two words (when we only need one!). It’s possible to solve it by extracting the class as a list:
book_rating = element.p['class'][1]
Finally, there’s the stock availability. It’s under a class called instock
. We’ll extract the text and strip away unnecessary elements (mostly blank spaces):
#book_stock = element.find(class_="instock").get_text().strip()
Here’s the full loop:
for element in soup.find_all(class_='product_pod'):
book_title = element.h3.string #getting the price and ridding ourselves of the pound sign
book_price = element.find(class_='price_color').string.replace('£','')
#getting the rating
#finding element class but it has two: star-rating and 'num' #e.g. 'One' so we're only getting the second one
book_rating = element.p['class'][1]
#finding availability
book_stock = element.find(class_="instock").get_text().strip() #we can also use:
#book_stock = element.select_one(".instock").get_text().strip()
#print out to double check
print(book_title)
print(book_url)
print(book_price)
print(book_rating)
print(book_stock)
Don’t be afraid to play around with the code. For example, try removing [1]
from the book_rating
object. Or, delete .strip
and see what happens. Beautiful Soup has many features, so there’s rarely just one proper way to extract content. Make Beautiful Soup’s documentation your friend.
Exporting the Output to CSV
At this point, our script returns all the data points we wanted, but they’re not very easy to work with. Let’s export the ouput to a CSV file.
First, we’ll create an empty list object:
books = []
Then, we’ll append our data points to the list:
books.append({
'title': book_title,
'price': book_price, 'rating': book_rating,
'stock': book_stock, 'url': book_url }
)
Finally, we’ll write the data to a CSV file:
with open("books_output.csv", "a") as f:
for book in books:
f.write(f"{book['title']},{book['price']},{book['rating']},{book['stock']},{book['url']}\n")
This gives us a nice and formatted CSV file (minus the URL – it’s on you to figure out how to extract it!):
Here’s the complete script:
from bs4 import BeautifulSoup
import requests
import CSV
url = "http://books.toscrape.com/"
r = requests.get(url)
soup = BeautifulSoup(r.content , "html.parser")
#books will be a list of dicts books = []
for element in soup.find_all(class_='product_pod'):
book_title = element.h3.string #getting the price and ridding ourselves of the pound sign
book_price = element.find(class_='price_color').string.replace('£','')
#getting the rating
#finding element class but it has two: star-rating and 'num'
#e.g. 'One' so we're only getting the second one
book_rating = element.p['class'][1]
#finding availability
book_stock = element.find(class_="instock").get_text().strip() #we can also use:
#book_stock = element.select_one(".instock").get_text().strip()
books.append({
'title': book_title,
'price': book_price,
'rating': book_rating,
'stock': book_stock,
'url': book_url }
)
#write it to a csv file
with open("books_output.csv", "a") as f:
#not writing in a header row in this case
#if using "a" then you can open a file if it exists and append to it
#can also use "w", then it would overwrite a file if it exists
for book in books:
f.write(f"{book['title']},{book['price']},{book['rating']},{book['stock']},{book['url']}\n")
Next Steps
Wait, that’s only one page. Weren’t we supposed to scrape all 50?
That’s right. But at this point, you should start getting the hang of it. So, why not try scraping the other pages yourself? You can use the following tutorial for guidance:
Scraping Multiple Pages with Beautiful Soup
A step-by-step tutorial showing how to get the URLs of multiple pages and scrape them.
It would also be useful to get the information of individual books. Why not try scraping them, as well? It’ll require extracting data from a table; here’s another tutorial to help you. You can find the full list of guides in our knowledge base.
Extracting Data from a Table with Beautiful Soup
A step-by-step guide showing how to scrape data from a table.