Web Scraping with Python: All You Need to Get Started
An introductory guide to Python web scraping with a step-by-step tutorial.
Python is the probably the most popular language for machine learning and data analysis. But it’s also a great choice for web data extraction. Adding this skill to your portfolio makes a lot of sense if you’re working with data, and it can also bring profitable opportunities.
This guide will give you all you need to start web scraping with Python. It explains why you should invest your time into Python, introduces the libraries and websites for practicing web scraping. You’ll also find a step-by-step tutorial for building a web scraper that you can replicate on your own computer. Let’s begin!
What Is Web Scraping in Python?
Web scraping refers to the process of downloading data off web pages and structuring it for further analysis. You can scrape by hand; but it’s much faster to write an automated script to do it for you.
With this approach, you don’t exactly download a web page as people see it. Rather, you extract its underlying HTML skeleton and work from there. If you’re not sure what that is, try clicking the right mouse button on this page and selecting Inspect. You should now see it as web scraper does:
Where does Python come in? Python provides the libraries and frameworks you need to successfully locate, download, and structure data from the web – in other words, scrape it.
Why Choose Python for Web Scraping
If you don’t have much programming experience – or know another programming language – you may wonder if it’s worth learning Python over the alternatives. Here are a few reasons why you should consider it:
- Simple language. Python’s syntax is relatively human-readable and easy to understand at a glance. What’s more, you don’t need to compile code, which makes it simple to debug and experiment.
- Great tools for web scraping. Python has some of the staple libraries for data collection, such as Requests with over 200 million monthly downloads.
- Strong community. You won’t have issues getting help or finding solutions on platforms like Stack Overflow.
- Popular choice for data analysis. Python ties in beautifully with the broader ecosystem of data analysis (Pandas, Matplotlib) and machine learning (Tensorflow, PyTorch).
Sending HTTP Requests with Python
Requests
If this is your first web scraping project, I strongly suggest starting with HTTP client Requests. It’s an HTTP client that lets you download pages. The basic configuration only requires a few lines of code, and you can customize the request greatly, adding headers, cookies, and other parameters as you move on to more complex targets.
To start using Requests, you need to install it by running the following command: pip install requests. Then, create a new file and start setting up your scraper.
Here’s how you can send GET and POST requests, and Headers:
- Sending GET request
import requests
# Send a GET request to a URL
response = requests.get('https://example.com')
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Print the response content
print(response.text)
else:
print('Error:', response.status_code)
- Sending POST request
import requests
# Data to be sent with the POST request
data = {'key': 'value'}
# Send a POST request to a URL with the data
response = requests.post('https://example.com', data=data)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Print the response content
print(response.text)
else:
print('Error:', response.status_code)
- Sending Headers
import requests
# Define custom headers
headers = {'User-Agent': 'MyCustomUserAgent'}
# Send a GET request with custom headers
response = requests.get('https://example.com', headers=headers)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Print the response content
print(response.text)
else:
print('Error:', response.status_code)
Python’s Requests isn’t the only HTTP client used for web scraping. There are other alternatives you can try out.
We compiled a list of the best Python HTTP clients for you to try.
Sending Python Asynchronous Requests with aiohttp
At some point you reach the scale where synchronous tools like Requests start limiting you. Asynchronous web scraping is a technique used to send multiple requests without waiting for a response. Python’s aiohttp library is designed for this purpose.
The tool is built on top of asyncio, Python’s built-in asynchronous I/O framework. It can manage multiple requests without blocking the main program’s execution.
If you need to set up a custom API or endpoint for your web scraper, aiohttp allows you to build web applications and APIs for managing asynchronous connections.
Here’s an example of web scraping with aiohttp:
import aiohttp
import asyncio
async def fetch_data(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
async def main():
urls = [
'https://example.com/1',
'https://example.com/2',
'https://example.com/3',
# Add more URLs as needed
]
tasks = [fetch_data(url) for url in urls]
results = await asyncio.gather(*tasks)
for result in results:
print(result)
if __name__ == "__main__":
asyncio.run(main())
We’ve prepared an extensive tutorial with a real-life example if you want to practice sending asynchronous requests. You’ll learn to build a script that goes through multiple pages and scrapes the necessary information.
A tutorial on using aiohttp for sending asynchronous requests and web scraping multiple pages in parallel.
Web Parsing with Python
Parsing with Python involves analyzing structured data, usually text, and extracting the information you need. With Python, you can parse different types of data, including XML, JSON, HTML, and more.
Python has various libraries for handling HTML, such as BeautifulSoup and lxml. These libraries allow you to navigate through HTML documents and extract target data. However, they can’t function as standalone tools, so you’ll also need to use an HTTP client to build a scraper.
Beautiful Soup
Beautiful Soup is a data parsing library that extracts data from the HTML code you’ve downloaded and transforms it into a structured format. Its syntax is simple to grasp, while the tool itself is powerful, well-documented, and lenient for beginners.
The library is very powerful. It comes with three in-built parsers – lxml, html.parser, HTML5lib – so you can use each to your advantage. It’s also light on resources, which makes it faster than other tools. Additionally, Beautiful Soup is one of a few libraries that can handle broken pages.
To start using Beautiful Soup you need to install it by running the following command: pip install beautifulsoup
Here’s an example of how to scrape with Beautiful Soup:
from bs4 import BeautifulSoup
html_data = """
Welcome to my website
This is a paragraph.
Link
"""
soup = BeautifulSoup(html_data, 'html.parser')
print("Title:", soup.h1.text)
print("Paragraph:", soup.p.text)
print("Link URL:", soup.a['href'])
To learn more about web scraping Beautiful Soup, you can go straight to our step-by-step tutorial. You’ll find out how to build a simple scraper with a real-life example. The guide will also teach you how to extract the data from a single static page and add it to a CSV file for further analysis.
A step-by-step guide to web scraping with Beautiful Soup.
lxml
lxml is a powerful parsing library. The tool wraps two C libraries – libxml2 and libxalt – making lxml extensible. It includes their features like speed, and the simplicity of a native Python API.
The key benefit of lxml is that it doesn’t use a lot of memory, so you can parse large amounts of data fast. The library also supports three schema languages and can fully implement XPath, which is used to identify elements in XML documents.
Here’s an example of parsing with lxml:
from lxml import etree
html_data = """
Text
This is a paragraph.
Link
"""
# Parse the HTML
root = etree.HTML(html_data)
# Extract information
title = root.find('.//h1').text
paragraph = root.find('.//p').text
link_url = root.find('.//a').get('href')
# Print the results
print("Title:", title)
print("Paragraph:", paragraph)
print("Link URL:", link_url)
To learn more about web scraping static pages, check out our tutorial with lxml and Requests. You’ll learn how to fetch HTML data, parse it, add a browser fingerprint, and extract data from a popular movie platform, IMDb.
A step-by-step guide to web scraping with Beautiful Soup.
Building a Python Headless Browser with Selenium
Headless browsers become necessary when websites dynamically generate content through client-side scripting languages like JavaScript. Unlike static web pages, where the HTML content can be retrieved directly, dynamic web pages require rendering it first.
Selenium is one of the oldest tools to control headless browsers programmatically. Although its primary use is browser automation, Selenium remains a popular choice for web scraping JavaScript-based websites. It can control most popular web browsers and supports a variety of programming languages.
The library also includes packages to hide your digital fingerprint. Properly configured, it can automate logins or pass reCAPTCHAs, which are often a major roadblock for scraping popular websites.
Here’s an example of web scraping with Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Set up Selenium WebDriver (make sure to download the appropriate driver for your browser)
driver = webdriver.Chrome('/path/to/chromedriver')
# Navigate to the webpage
driver.get('https://example.com')
try:
# Wait for the element to be loaded (for demonstration purposes, let's wait for the page title)
WebDriverWait(driver, 10).until(EC.title_contains("Example"))
# Once loaded, find the element(s) you want to scrape
elements = driver.find_elements(By.CSS_SELECTOR, 'p') # Example: find all elements
# Extract data from the elements
for element in elements:
print(element.text)
finally:
# Close the WebDriver session
driver.quit()
If you want to learn more about web scraping with Selenium, we’ve prepared a guide for beginners. In this tutorial, you’ll learn how to deal with JavaScript-generated content, scrape multiple pages, handle delayed rendering, and write the results into a CSV file.
A step-by-step guide to web scraping with Selenium.
Selenium is a good library to start with, thanks to its prevalence, but it has some drawbacks like speed and difficulty setting up. If this is a dealbreaker for you, Python has other frameworks that can handle JavaScript-based websites.
Which one is better for web scraping?
Which tool better fits your project needs?
Other Python Web Scraping Libraries
There’s no shortage of Python web scraping libraries. The tool you choose depends on your project parameters, including the type of the target website (static or dynamic), the amount of data you want to scrape, and other aspects.
Get acquainted with the main Python web scraping libraries and find the best fit for your scraping project.
Another way to go about web scraping with Python is by using the cURL command-line tool. It offers several advantages over Python libraries, such as HTTP client Requests. For example, it’s very fast and you can send many requests over multiple connections.
This tutorial will show you the basics to use cURL with Python for gathering data.
Social Media Web Scraping with Python
Web scraping social media isn’t easy. Social platforms are probably the most strict about automated data collection and often look for ways to shut down web scraping companies. And while gathering publicly available information is legal, it’s not without roadblocks.
Many businesses and individual marketers use social media scraping to perform sentiment analysis, analyze market trends, monitor online branding, or find influencers.
If you want to try your skills in scraping social media, we have prepared Python guides covering the most popular platforms. Remember that all the scrapers are built for publicly available data. Also, social media platforms often make updates that can impact your scraper, so you might need to make some adjustments.
A step-by-step tutorial on how to scrape Reddit data using Python.
A step-by-step example of how to scrape Facebook using Python.
A step-by-step guide on how to scrape Instagram using Python.
Python Tips for Web Scraping
Devise a Web Scraping Project
If you have no business use case in mind, it can be hard to find worthwhile ideas. I recommend practicing with dummy websites. They are specially designed for being scraped, so you’ll be able to try out various techniques in a safe environment. You can find several such websites in our list of websites to practise your web scraping skills.
If you’d rather scrape real targets, it’s a good idea to begin with something simple. Popular websites like Google and Amazon offer valuable information, but you’ll encounter serious web scraping challenges like CAPTCHAs as soon as you start to scale. They may be hard to tackle without experience.
In any case, there are guidelines you should follow to avoid trouble. Try not to overload the server. And be very cautious about scraping data behind a login – it’s gotten multiple companies sued. You can find more pieces of advice in our article on web scraping best practices.
At the end, your project should have at least these basic parameters: a target website, the list of URLs you want to scrape from it, and the data points you’re interested in. If you’re short on ideas on what to scrape, we prepared a list of practical Python projects for you to try.
Ideas for beginners and advanced users.
Use Proxies
As you increase your project’s scope, you should look into proxies. They offer the easiest way to avoid blocks by giving you more IP addresses. At first, you may consider a free proxy list (because it’s free!), but I recommend investing into paid rotating proxies.
A proxy server routes traffic through itself, changing your real location and providing you with a different IP address in the meantime. This is especially important when you need to mask your identity when performing automated actions.
If you plan to scrape serious targets like Amazon, you should go with residential proxies. You’ll get a pool of IP addresses that come from real devices on Wi-Fi that are very hard to detect.
For unprotected websites, you can save money by choosing datacenter proxies. They’re much faster than their residential counterparts but relatively easy to detect.
Setting proxies with your scraper isn’t that difficult, especially when most proxy service providers include instructions on how to do so with the most popular languages and tools.
This is a step-by-step guide on how to set up and authenticate a proxy with Selenium using Python.
Practice Your Skills: a Quick Python Web Scraping Tutorial
Imagine that you run an online book business. Your main competitor sells books at books.toscrape.com, and you’d like to learn more about its catalogue.
There are 1,000 books in total, with 20 books per page. Copying all 50 pages by hand would be madness, so we’ll build a simple web scraper to do it for us. It’ll collect the title, price, rating, and availability of your competitor’s books and write this information into a CSV file.
Prerequisites
- Python 3, if it’s not already on your computer, and Pip to install the necessary libraries.
- BeautifulSoup. You can add it by running
pip install beautifulsoup4
in your operating system’s terminal. - Requests. You can add it by running
pip install requests
.
Importing the Libraries
The first thing we’ll need to do is to import Requests and Beautiful Soup:
from bs4 import BeautifulSoup
import requests
We’ll need one more library to add the scraped data to a CSV file:
import CSV
Downloading the Page
Now, it’s time to download the page. The Requests library makes it easy – we just need to write two lines of code:
url = "http://books.toscrape.com/"
r = requests.get(url)
Extracting the Relevant Data Points
Now, it’s time to extract the data we need with Beautiful Soup. We’ll first create a Beautiful Soup object for the page we scraped:
soup = BeautifulSoup(r.content , "html.parser")
We could try extracting the whole class:
element = soup.find_all(class_="product_pod") print(element)
product_pod
:
for element in soup.find_all(class_='product_pod'):
Now, let’s extract the page title. It’s nested in H3
, under the <a>
tag. We can simply use H3 and strip away unnecessary data (such as the URL) by specifying that we want a string:
book_title = element.h3.string
The book price is under a class called price_color
. We’ll extract it without the pound symbol:
book_price = element.find(class_='price_color').string.replace('£','')
The book ratings are under the p
tag. They’re more problematic to get because the rating is in the name itself, and the name includes two words (when we only need one!). It’s possible to solve it by extracting the class as a list:
book_rating = element.p['class'][1]
Finally, there’s the stock availability. It’s under a class called instock
. We’ll extract the text and strip away unnecessary elements (mostly blank spaces):
#book_stock = element.find(class_="instock").get_text().strip()
Here’s the full loop:
for element in soup.find_all(class_='product_pod'):
book_title = element.h3.string
#getting the price and ridding ourselves of the pound sign
book_price = element.find(class_='price_color').string.replace('£','')
#getting the rating
#finding element class but it has two: star-rating and 'num'
#e.g. 'One' so we're only getting the second one
book_rating = element.p['class'][1]
#finding availability
book_stock = element.find(class_="instock").get_text().strip()
#we can also use:
#book_stock = element.select_one(".instock").get_text().strip()
#print out to double check
print(book_title)
print(book_url)
print(book_price)
print(book_rating)
print(book_stock)
[1]
from the book_rating
object. Or, delete .strip
and see what happens. Beautiful Soup has many features, so there’s rarely just one proper way to extract content. Make Beautiful Soup’s documentation your friend. Exporting the Output to CSV
At this point, our script returns all the data points we wanted, but they’re not very easy to work with. Let’s export the ouput to a CSV file.
First, we’ll create an empty list object:
books = []
Then, we’ll append our data points to the list:
books.append({
'title': book_title,
'price': book_price,
'rating': book_rating,
'stock': book_stock,
'url': book_url
}
)
Finally, we’ll write the data to a CSV file:
proxies = {
'http': 'http://user:password@host:PORT',
'https': 'http://user:password@host:PORT',
}
response = requests.get('URL', proxies = proxies)
Finally, we’ll write the data to a CSV file:
with open("books_output.csv", "a") as f:
for book in books:
f.write(f"{book['title']},{book['price']},{book['rating']},{book['stock']},{book['url']}\n")
Here’s the complete script:
from bs4 import BeautifulSoup import requests import CSV url = "http://books.toscrape.com/"
r = requests.get(url) soup = BeautifulSoup(r.content , "html.parser") #books will be a list of dicts
books = []
for element in soup.find_all(class_='product_pod'):
book_title = element.h3.string
#getting the price and ridding ourselves of the pound sign
book_price = element.find(class_='price_color').string.replace('£','')
#getting the rating
#finding element class but it has two: star-rating and 'num'
#e.g. 'One' so we're only getting the second one
book_rating = element.p['class'][1]
#finding availability
book_stock = element.find(class_="instock").get_text().strip()
#we can also use:
#book_stock = element.select_one(".instock").get_text().strip()
books.append({
'title': book_title,
'price': book_price,
'rating': book_rating,
'stock': book_stock,
'url': book_url
}
)
#write it to a csv file with open("books_output.csv", "a") as f:
#not writing in a header row in this case
#if using "a" then you can open a file if it exists and append to it
#can also use "w", then it would overwrite a file if it exists
for book in books:
f.write(f"{book['title']},{book['price']},{book['rating']},{book['stock']},{book['url']}\n")
Next Steps
Wait, that’s only one page. Weren’t we supposed to scrape all 50?
That’s right. But at this point, you should start getting the hang of it. So, why not try scraping the other pages yourself? You can use the following tutorial for guidance:
A step-by-step tutorial showing how to get the URLs of multiple pages and scrape them.
It would also be useful to get the information of individual books. Why not try scraping them, as well? It’ll require extracting data from a table; here’s another tutorial to help you. You can find the full list of guides in our knowledge base.
A step-by-step guide showing how to scrape data from a table.