How to Scrape Google Search Results
We explore two methods with step-by-step instructions.
This article will give you a hands-on example of how you can scrape Google Search results. It showcases two methods for scraping content from Google Search: doing everything yourself and using a SERP API. Unless the search engine experiences major changes, you should be able to reproduce the code examples by only adjusting a few parameters. Let’s get started!
Wait – What About the Google Search API?
- The API is made for searching within one website or a small group of sites. You can configure it to search the whole web but that requires tinkering.
- The API provides less and more limited information compared to both the visual interface and web scraping tools.
- The API costs a lot of money: 1,000 requests will leave you $5 poorer, which is a daylight robbery. There are further limits on the number of daily requests you can make.
Overall, the Google Search API isn’t really something you’d want to use considering its limitations. Trust me, you’ll save both money and sanity by going the web scraping route.
Building a Basic Web Scraper
This part of the tutorial will show you how to build a very basic scraper that can extract results from Google Search. Its functions will include downloading the HTML code, parsing the page title, description, and URL, and saving the data in .JSON format.
First, we need to import the necessary modules:
import json
from urllib.parse import urlencode
import requests
from lxml import etree
We need json because that will be our output format. Urlencode will help us avoid encoding issues when working with Google’s URL parameters (more specifically, the + symbol). Requests is the standard Python library for making HTTP requests. And lxml will be our parser; alternatively, you could use Beautiful Soup.
The second step is to create a function for our query:
def do_it_yourself_simple():
"""If you had to do it yourself"""
domain = 'com'
payload = {
'q': 'car insurance', # Query.
# You'd need to figure out how to generate this.
'uule': 'w+CAIQICIHR2VybWFueQ', # Location.
}
results = [] # Parsed results.
params = urlencode(payload)
url = f'https://www.google.{domain}/search?{params}'
# Scrape.
response = requests.get(url=url)
Now, we’ll need to parse the data we scraped. First, let’s create an XML tree from the HTML:
# Parse.
parser = etree.HTMLParser()
tree = etree.fromstring(
response.text,
parser,
)
Then, let’s find the data we need. The standard way to do this is by inspecting the page. On Chrome, that’s right click -> Inspect. You’ll then want to click the visual inspection button; it will help you find the right elements:
The aim here is to find the element that covers the whole box for one search result. If you can’t do that, don’t worry: simply move up the tree until you locate it.
In our case, the element was nested inside .div, as a class called ZINbbc. It might be different for you because Google generates class names dynamically. We selected it using XPath syntax:
result_elements = tree.xpath(
'.//div['
' contains(@class, "ZINbbc") '
' and not(@style="display:none")'
']'
'['
' descendant::div[@class="kCrYT"] '
' and not(descendant::*[@class="X7NTVe"])' # maps
' and not(descendant::*[contains(@class, "deIvCb")])' # stories
']',
)
Notice that we did two more things here. The first was to remove elements that are hidden from regular users – we don’t need them. The second excluded classes that contain data from Maps and Stories, so that we’d only scrape organic search results.
The next step is to loop through each relevant result and extract the data we need from them. Using XPath again, we specified that we’ll need the URLs (a.href), page titles of each result (h3), and their meta descriptions (the BNeawe class):
for element in result_elements:
results.append(
{
'url': element.xpath('.//a/@href')[0],
'title': element.xpath('.//h3//text()')[0],
'description': element.xpath(
'.//div[contains(@class, "BNeawe")]//text()',
)[-1],
# Other fields you would want.
}
)
Finally, we write the data we just scraped and parsed into a .JSON file:
with open('diy_parsed_result_simple.json', 'w+') as f:
f.write(json.dumps(results, indent=2))
That’s it! We’ve just scraped the first Google Search results page for the query car insurance.
Adding Advanced Features
Of course, our Google scraper is very basic, and there’s no way we could scale it without further improvements. So, let’s try pimping it with a few advanced features.
A word of caution: though we call these features advanced (and compared to the barebones script, they are), adding just them won’t be enough to sustain any serious web scraping operation. It takes more than that to scrape Google seriously. Still, it’s a good starting point.
User Agent
A user agent shows the website which device and browser you’re using to connect. It’s important because without one, your web scraper will be painfully obvious, and no one likes to be scraped. So while it’s not enough by itself, a real user agent goes a long way towards avoiding unwanted blocks.
Adding a user agent to our code is actually pretty simple. Here’s the same function we made at the beginning. Only this time it’s with five different user agents in the headers object that are chosen at random:
def do_it_yourself_advanced():
"""If you had to do it yourself"""
domain = 'com'
pages = 2
payload = {
'q': 'car insurance', # Query.
# You'd need to figure out how to generate this.
'uule': 'w+CAIQICIHR2VybWFueQ', # Location.
}
headers = {
'User-Agent': random.choice(
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',
'Mozilla/5.0 (Windows NT 10.0; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
]
)
}
Note: to randomize the user agents, we’ll need to import an additional module called random.
Pagination
Of course, we wouldn’t want to scrape just one page of Google Search results. What’s the point? So, we’ll want to add pagination. It’s pretty simple to do, actually – we just need one more parameter called pages and a loop:
def do_it_yourself_advanced():
"""If you had to do it yourself"""
domain = 'com'
pages = 2
payload = {
'q': 'car insurance', # Query.
# You'd need to figure out how to generate this.
'uule': 'w+CAIQICIHR2VybWFueQ', # Location.
}
headers = {
'User-Agent': random.choice(
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',
'Mozilla/5.0 (Windows NT 10.0; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
]
)
}
results = [] # Parsed results.
for count in range(0, pages * 10, 10):
payload['start'] = count # One page is equal to 10 google results.
params = urlencode(payload)
url = f'https://www.google.{domain}/search?{params}'
Proxies
Another thing we might need for scaling up is proxies. Websites frown upon many connection requests from the same IP address, so it’s a good idea to rotate them once in a while. Including proxies is a two-line affair; we’ll nest them inside our pagination loop:
for count in range(0, pages * 10, 10):
payload['start'] = count # One page is equal to 10 google results.
params = urlencode(payload)
url = f'https://www.google.{domain}/search?{params}'
proxies = {
'https': 'http://:@:',
}
Be aware that this approach assumes you’re using a rotating proxy network, and you won’t need to handle the rotation logic by yourself.
Here’s the full code sample:
from urllib.parse import urlencode
import requests
import random
from lxml import etree
def do_it_yourself_advanced():
"""If you had to do it yourself"""
domain = 'com'
pages = 2
payload = {
'q': 'car insurance', # Query.
# You'd need to figure out how to generate this.
'uule': 'w+CAIQICIHR2VybWFueQ', # Location.
}
headers = {
'User-Agent': random.choice(
[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0',
'Mozilla/5.0 (Windows NT 10.0; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
]
)
}
results = [] # Parsed results.
for count in range(0, pages * 10, 10):
payload['start'] = count # One page is equal to 10 google results.
params = urlencode(payload)
url = f'https://www.google.{domain}/search?{params}'
proxies = {
'https': 'http://:@:',
}
# Scrape.
response = requests.get(url=url, proxies=proxies, headers=headers)
# Parse.
parser = etree.HTMLParser()
tree = etree.fromstring(
response.text,
parser,
)
result_elements = tree.xpath(
'.//div['
' contains(@class, "ZINbbc") '
' and not(@style="display:none")'
']'
'['
' descendant::div[@class="kCrYT"] '
' and not(descendant::*[@class="X7NTVe"])' # maps
' and not(descendant::*[contains(@class, "deIvCb")])' # stories
']',
)
for element in result_elements:
results.append(
{
'url': element.xpath('.//a/@href')[0],
'title': element.xpath('.//h3//text()')[0],
'description': element.xpath(
'.//div[contains(@class, "BNeawe")]//text()',
)[-1],
# Other fields you would want.
}
)
with open('diy_parsed_result_advanced.json', 'w+') as f:
f.write(json.dumps(results, indent=2))
That’s it for the first part! Now, let’s try doing the same thing with a dedicated tool – SERP API.
Using a SERP API
def serpmaster():
"""If you used SERPMaster"""
payload = {
'scraper': 'google_search',
'q': 'car insurance',
'domain': 'com',
'geo': 'Germany',
'parse': 'true',
}
pages = 2
for i in range(1, pages + 1):
payload['page'] = i
response = requests.post(
url='https://rt.serpmaster.com/',
auth=('username', 'password'),
json=payload,
)
with open(f'serpmaster_parsed_result_page_{i}.json', 'w+') as f:
f.write(json.dumps(response.json(), indent=2))
Compared to our previous script, this one is much more efficient. The one function includes:
- A payload with some simple parameters,
- A pagination loop,
- A POST request to SERPMaster with login credentials,
- and a .JSON file for the output.
That’s it. We don’t need to worry about forming the right URL, our parsing logic is encapsulated in one line, we can select any geo-location we like, and the proxies are already built in. This approach will also scale well without much effort.
BUT: convenience costs money, so that’s one thing that might hold you back from jumping into a SERP API. That said, SERPMaster has a free trial; you can try out a few hundred requests to see if it’s worth it for you.
Any Other Methods?
Are there any other ways to scrape Google Search? Yes. The alternatives would be to use a visual web scraper, browser extension, or data collection service. Let’s briefly run through each.
Visual Web Scraper
Visual web scrapers are programs that let you extract data from Google without any coding experience. They give you a browser window, where you simply point and click the data points you want to scrape and download them in the format of your choice. The hardest part is building a proper workflow of paginations and action loops, but that’s still easy compared to writing the code by yourself.
When should you get a visual web scraper? When you need a small-moderate amount of data and have no coding experience.
Which visual web scrapers to use? ParseHub and Octoparse are two great options. We’re partial to Octoparse because it has a lighter UI and premade templates for quick basic scraping.
Browser Extension
Browser extensions provide one of the simplest ways to start scraping Google Search. All you need to do is add them to your browser. Afterwards, the process is very similar to a visual web scraper: point and click a webpage’s elements and download them to your computer. Such extensions are surprisingly powerful; they can handle JS, pagination, and even perform actions like form filling.
When should you use a web scraping browser extension? When you need quick and not very elaborate data from Google Search.
Which browser extensions to use? We like Web Scraper. It’s a free extension for Chrome and Firefox that embeds itself into the developer tools. An alternative would be Data Miner for Chrome. The latter is a little easier to use and has thousands of public recipes (pre-built scrapers you can use).
Data Collection Service
A data collection service is the easiest method for getting data from Google Search. You specify your requirements, budget, and then receive the results all nicely formatted for further use. That’s about it. You don’t need to build or maintain a scraper, worry about the scraping logic, or even the legal aspects of your scraping project. Your only worry will be money.
When should you use a data collection service? This one’s pretty simple: when you’re running a mid to large-scale project, have the funds, and no one to build a web scraper for you.
Which data collection service to choose? There’s no shortage of companies that provide data collection services. Some examples would be ScrapingHub and Bright Data.
Frequently Asked Questions About How to Scrape Google Search Results
Google doesn’t like being scraped – and has mechanisms to limit it – but we haven’t heard about it suing anyone over scraping Google Search. If it did, many SEO companies would simply go out of business.
Yes, there is. But it’s limited and expensive (1,000 queries cost $5), so people turn to web scraping instead.
Take a look at our list of the best proxies for Google scraping.