Web Scraping Best Practices: A Guide to Successful Web Scraping
We’ve prepared some tips and tricks that will come in handy when gathering data.
It’s no secret that websites keep an eye out for bots by using various anti-scraping techniques like rate throttling or IP address bans. These and other roadblocks may determine your success in gathering the desired data. But sometimes, all you need is equivalent knowledge and some tips to avoid challenges along the way.
From IP address and user agent rotation to handling redirects and improving your digital fingerprint, even scraping gods look up for guidelines that work. We’ve put together the best web scraping practices to help you with IP blocks, request limits, or even technical issues like website structural changes. Continue reading this guide and arm yourself with web scraping best practices to follow.
- Consider the Website’s Guidelines
- Scrape Politely
- Discover API Endpoints
- Rotate your IP Address
- Know When to Use a Headless Browser
- Improve Your Browser’s Fingerprint
- Maintain Your Web Scraper
- Act Natural
- Other Tricks to Improve Your Scraping Bot
If you took a look at how people browse, you’d see that the pattern is chaotic. Conversely, bots are predictable – monotonous and much faster than actual users. That’s a dead giveaway since websites can monitor traffic by tracking your IP address – the number and pattern of connection requests you make within a specific timeframe. Any unusual activity raises a flag.
But that’s not all. Websites can also identify your device and software characteristics using various fingerprinting methods. For example, they can pinpoint a web scraper by the identifiers it sends in HTTP request headers like cookies or user agents. The most advanced fingerprinting techniques can even track mouse movements on the page to decide whether a user is a bot.
One way websites deal with unwanted visitors is by blocking their IP address. Some might even ban the whole IP range – 256 IPs that come from the same subnet. It mostly happens when you use datacenter proxies.
Some websites react by limiting your connection requests, meaning you won’t be able to gather data for some time. And the time range differs depending on the target server. This will slow down your scraper, and if you continue unwanted behavior, it might lead to an IP address ban.
There are more roadblocks scraping might throw your way. You can read more in our article on ways to overcome frequent web scraping challenges.
Imagine a website as somebody’s home – it has rules to follow. Most websites set up instructions for managing bot traffic called robots.txt. They outline which pages are okay to scrape, how often you can do it, and which pages are out of reach.
Another critical guideline – you should read the terms of services (ToS) which act as a contract between you and the target website. Some ToS involve scraping policies that explicitly prohibit you from extracting any data from the domain. These rules are rarely legally binding, but they can get you into trouble if you’re not careful.
If there’s one thing you should remember, it’s not to scrape data behind a login – especially when it comes to social media platforms. This has already caused multiple lawsuits and puts you into considerable risk.
Most web scraping tools can run hundreds of concurring requests. The problem is, smaller websites don’t have the resources to handle this much load. So, you might accidentally crash their servers if you access them too frequently.
To avoid this, you should accommodate to your target’s capabilities: add delays between requests, gather data during off-peak hours, and don’t be a burden in general. Doing so will make everyone’s experience better.
Web scraping requires making many connection requests in a short period of time. Hundreds of spiders overloading your servers are no fun, so websites impose request limits, use anti-scraping technology like CAPTCHAs, or even block IP addresses. But we have a solution called IP rotation.
One way to go about IP rotation is to use proxies. I’d recommend choosing a rotating proxy provider that automatically rotates your proxy IPs with every connection request. Try to avoid sticky sessions unless your workflow requires you to keep the same identity for several requests in a row. Also, note that some block IPs that come from cloud hosting services (datacenter proxies), so you may need to use residential addresses instead.
A headless browser is like a regular web browser (Chrome or Firefox), only without a user interface. When it comes to web scraping, there are two ways to go about a headless browser: either it’s an essential tool or irrelevant for your project’s success.
Requests made from a web browser contain a collection of headers that reveal your preferences and software information. One header – the user-agent string – is particularly important: if it’s missing or malformed, the target will refuse to serve your web scraper. This applies to to most HTTP clients like Requests, which send their own user-agent header. Don’t forget to change it!
Furthermore, it might not be a good idea to always use the same user-agent string since websites monitor requests coming from the same browser. The way out is to rotate your user agent. You should collect the user agents of up-to-date web browsers and loop through them.
User-agent aside, there are more headers to consider. For example, some websites require cookies, and you’ll have a better chance of succeeding with others if you add the referer header.
When you buy pre-made scraping tools, the service you subscribe takes care of the maintenance. However, custom-built software requires your (or your peers) constant supervision. There are two main reasons for that: 1) it’s a patchwork of tools, and 2) web developers make frequent structural changes to websites.
First, a self-built scraper is made from different components. Therefore, it’s realistic that sooner or later one or more elements may fail, and you’ll need to fix the issue. For example, your proxy servers may go down, or the web scraper can encounter a situation it won’t know how to handle.
Second, webmasters make frequent structural changes that can affect the functioning of your scraper. This can include new protection methods or simply rearranging the HTML structure to break your parsing code. Over time, you’ll need to add new features on top of old structures and run tests to see whether the scraper is good to go. Also, keep tabs for changes like missing or modified field names. This will prevent you from losing data quality.
The main difference between human and bot behavior is that people are slow and unpredictable, while bots are very fast and programmed to a specific crawling pattern.
To look more human-like, you should reduce the crawling rate by changing time intervals between your requests or clicking specific elements on a website. If you’re using a headless browser, you can also add random activities like mouse movements. Unpredictable actions will make it harder for the server to identify you as a bot.
After your scraper is all set up and running, there are more ways to go about improving your script.
Cache HTTP requests. Tasks like price aggregation require scraping more than one page, meaning you’ll have to go through many website URLs. That’s where crawling comes into play – you build a crawling logic to extract specific data from multiple pages. However, the process becomes a bit of a burden when you want to know what pages the crawler has already visited, or you need to revisit those pages for more data later. By storing responses to a database, you’ll avoid requesting the same pages in the future.
Use canonical URLs. Some websites have several URLs that lead to the same content. It usually happens when they include both a desktop and mobile version: for example, www.instagram.com and https://m.instagram.com. A canonical URL, or canonical tag, is an HTML snippet that defines the main version for duplicates (or near-duplicates). The rel=”canonical” element helps developers to detect and avoid duplicate pages. Frameworks like Scrapy handle the same URLs by default.
Handle redirects. HTML redirection or forwarding is a method to redirect users from one URL to another. HTML redirection confuses the scraper and causes slow-downs. Python-based scraping libraries like Requests usually follow redirects by default but offer an option not to. Web scraping frameworks like Scrapy have redirect middleware to handle them.