How to scrape multiple pages using Beautifulsoup

A step-by-step guide on how to scrape multiple pages using Beautifulsoup.

Important: we’ll use a real-life example in this tutorial, so you’ll need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

from bs4 import BeautifulSoup

Step 2. Then, import requests library.

import requests

Step 3. Get a source code of your target landing page. We will be using our Guides page in this example.

r=requests.get("https://proxyway.com/guides/")

A universal code might look like this:

r=requests.get("Your URL")

Step 4. Get the link to the next page by finding the a tag with the class of next. Then, you only need to get the href from this element and perform a new request.

How to scrape multiple pages using Beautifulsoup

Step 5. You can put the entire code into a single function:

def scrape_page(url):
    print ("URL: " + url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    get_data(soup)
    next_page_link = soup.find("a", class_="next")
    if next_page_link is not None:
        href = next_page_link.get("href")
        scrape_page(href)
    else:
        print ("Done")

This is the output of the script. It shows the page URL you’ve just scraped.

Scrape multiple pages

Let’s look at the code step-by-step:

1. Performing a request to get the page.

r = requests.get(url)

2. Parsing the page and turning into a BeautifulSoup object.

soup = BeautifulSoup(r.content, "html.parser")

3. Passing the soup object into a different function where you could scrape page data before moving on to the next page.

get_data(soup)

4. Finding the next link element.

next_page_link = soup.find("a", class_="next")

5. If such an element exists, then there is another page you can scrape; if not, you’re done.

if next_page_link is not None:

6. Getting the href attribute from the link element. This is the URL of the next page you’re scraping.

href = next_page_link.get("href")

7. Calling the same function again and passing it the new URL to scrape the next page.

scrape_page(href)

Results:
Congratulations, you’ve scraped multiple pages using Beautifulsoup. Here’s the full script:

from bs4 import BeautifulSoup
import requests

start_url = "https://proxyway.com/guides"

def scrape_page(url):
    print ("URL: " + url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    get_data(soup)
    next_page_link = soup.find("a", class_="next")
    if next_page_link is not None:
        href = next_page_link.get("href")
        scrape_page(href)
    else:
        print ("Done")


def get_data(content):
    #we could do some scraping of web content here
    pass

def main():
    scrape_page(start_url)
    

if __name__ == "__main__":
    main()
Best value provider
Success rate: 98.69%
Response time: 1.29s

Best premium provider
Success rate: 99.57%
Response time: 0.86s

Most customizable provider
Success rate: 99.47%
Response time: 1.08s

Submit a comment

Your email address will not be published.

Rate this post