How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation

A guest article by Razvan Popescu, Head of Marketing at Abstract API.

If you’re scraping websites, you might already use a proxy server to collect data reliably and anonymously. What about the other side of the scrape, though – what if you want to block proxies from scraping your site? This article will describe how web scraping and proxies work, and how an IP geolocation API can be used to prevent web scraping with proxies.

What Is Web Scraping?

Web scraping is the process of taking unstructured data and formatting it into a structured format. For example, you might use Python to scrape Google Search results. Another common use case is to scrape up-to-date stock data from a stock market website, structure that data into a CSV, and pull that variable from the CSV to calculate your stock market returns in a Python program.

There is nothing illegal about doing this, but when it begins burdening a company’s web servers, they may block your IP address. Always check a website’s robots.txt file for their expected scraping behavior and etiquette.

What Is a Proxy?

When an IP address is blocked by a website, the scraper might work around the block by using a proxy server. So, what is a proxy? It’s a third party server that routes your connections through a different IP address. Remember that IP addresses identify where a connection takes place – for example, the router in your house. A proxy makes that connection appear to be coming another device in another place.

You may have encountered proxies when bypassing your school’s Internet filters back in the day, or using a VPN to stream region–restricted Eurovision song contests. We aren’t condoning these activities, but they use the idea of rerouting an IP address through a third party connection.

Elements of Successful Web Scraping

A little Python code, some Python libraries (like Beautiful Soup), and an Internet connection are all you need to start basic web scraping. But there are important factors in making your scraping efficient, reliable, and anonymous – that is, successful.

One of the most important factors in web scraping is using a high quality proxy , or even multiple proxies in a proxy pool to scale up your scraping operation. A high-quality proxy can take your web scraping projects to the next level:

If you’re scraping without a proxy, when one site blocks your IP, you have to go find another site with the same information.
Proxies increase scraping reliability and volume.
Proxies allow you to view content as it appears if accessed from other places in the world. If you’re scraping location-dependent data, this is very important.
Proxies protect your identity by substituting one of their IPs for one of your own. Think of it as similar to how APIs allow authenticated users to exchange data through an interface while remaining anonymous to each other. That said, you can provide your contact info in a third party proxy, if you want businesses you are scraping to be able to contact you.

Why Blocking Proxies Is Key to Preventing Web Scraping

As stated above, scraping without proxies is inefficient, unsafe, and doesn’t scale. If someone is serious about web scraping, they’re be surely using a high-quality proxy pool.

Proxy servers are a powerful tool. And while collecting public web data isn’t bad in itself, reckless web scraping can cause a lot of damage to websites.

So, if we look at the other end of the process, at the website that is being scraped, what’s the best way for us to protect our resources from bad traffic? We can use proxy detection and IP geolocation to root out users scraping with proxies and block them.

What Is Proxy Detection?

Proxy detection is – you guessed it – ways to identify a proxy connection by the website owner. The IP address received by the website can check that IP against a list of flagged addresses and block the traffic. If the scraper uses a limited number of IPs, proxy detectors learn to block them, but proxy services will just change IP ranges again.

You can also check the headers for common proxy entries like x-forwarded-for, but this only removes the most basic proxies, and we’re trying to block professionals.

How to Block Proxies Using IP Geolocation

To detect a proxy using IP geolocation, remember that IP addresses carry location information with them, announcing where a connection takes place. A proxy server makes that connection appear to be coming a a different geographic location.

So, if we are trying to identify a proxy server, we could use the free IP geolocation API from Abstract to test this. You can test it for free as soon as you sign up.

Let’s try testing a request in the browser:

				
					https://ipgeolocation.abstractapi.com/v1/?api_key={YOUR API KEY}

It will return our IP, our geographic location, and a lot of other interesting data:

				
					{
    "ip_address": "174.49.204.134",
    "city": "York",
    "city_geoname_id": 4562407,
    "region": "Pennsylvania",
    "region_iso_code": "PA",
    "region_geoname_id": 6254927,
    "postal_code": "17402",
    "country": "United States",
    "country_code": "US",
    "country_geoname_id": 6252001,
    "country_is_eu": false,
    "continent": "North America",
    "continent_code": "NA",
    "continent_geoname_id": 6255149,
    "longitude": -76.6653,
    "latitude": 39.9552,
    "security": {
        "is_vpn": false
    }

If we engage a VPN and try the same test request, we get different results. VPNs aren’t the same thing as proxies, but they provide a similar outcome.

				
					{
    "ip_address": "23.105.165.55",
    "city": "Manassas",
    "city_geoname_id": 4771401,
    "region": "Virginia",
    "region_iso_code": "VA",
    "region_geoname_id": 6254928,
    "postal_code": "20110",
    "country": "United States",
    "country_code": "US",
    "country_geoname_id": 6252001,
    "country_is_eu": false,
    "continent": "North America",
    "continent_code": "NA",
    "continent_geoname_id": 6255149,
    "longitude": -77.4918,
    "latitude": 38.7493,
    "security": {
        "is_vpn": false
    }

Now, we can use this IP geolocation API to see where incoming traffic is coming from, and make decisions on blocking based on that information. Some strategic considerations:

We might block IPs coming from countries with high fraud activity.
We might block requests geographically outside of our usual customer base.
We might take this data and find the proxy traffic isn’t doing anything suspicious or resource-consuming.
We might use this data to geo-target our ad campaigns. (This company in that city is disrupting everything!)

Can All Proxies Be Detected and Blocked?

The proxy cat-and-mouse game has been going on for a long time, and probably won’t stop. Proxies aren’t illegal, but a lot of the discussion around them makes them sound like only credit card scammers and Anonymous use them. They can be used to responsibly anonymize traffic online, but as with any tool, they sometimes fall in the hands of bad agents.

Considering that bad bot activity now accounts for 39% of internet traffic, it’s a good time to know who is accessing your hardware, and if it’s impacting your customers. IP geolocation databases are a great tool to collect and act upon.