In-Depth Look into Popular Proxy APIs (Web Unblockers)
This investigation takes a look at five popular proxy APIs like Bright Data’s Web Unlocker and Oxylabs’ Web Unblocker. We compare their features, pricing models, as well as ability to unblock websites behind major bot protection systems, such as DataDome or Shape.
- Proxy APIs integrate using the hostname:port format and remove the burden of managing proxy servers, running headless browsers, or dealing with blocks.
- All five participants managed to open protected websites over 90% of the time. Shape’s antibot gave them the most trouble, followed by the photo-focused social media platform.
- Proxy APIs allow modifying the request to an extent, such as sending custom headers. Several APIs offer structured data functionality, and proxy vendors like Bright Data and Oxylabs support precise filtering options. Their main drawback is limited interaction with dynamic pages.
With websites getting harder to scrape, proxy providers have developed a new type of service called proxy API. It integrates like a regular proxy server. But in the backend, the API combines multiple IP types, an intelligent proxy management layer, and website unblocking mechanisms.
If a request fails (encounters a block or CAPTCHA), the API adjusts its configuration and tries again until it succeeds. Some tools even have smart error handling to identify non-obvious failures like an empty 200 response.
That said, you shouldn’t treat a proxy API as just a CAPTCHA solver. Though they do remove the need to engage with streetlight marking and button holding, this is usually achieved by avoiding the challenge and not brute-forcing through it. Pages that have configured CAPTCHAs to pop-up every time may still need your attention.
How do you use a proxy API? The process differs little from any other rotating proxy server: there’s a hostname and port with authentication details. Then, you can append various parameters – such as location preferences – to the credentials or send them as a custom header. The proxy API intercepts the request and fetches the page based on your configuration.
In our previous research on web scraping APIs, we noticed that some had the ability to integrate as a proxy. But lately proxy APIs have been emerging as a separate category – whether for marketing purposes or based on real technical considerations. You’ll often find them called unblockers or smart proxy managers.
We approached multiple companies with a proposal to participate in the research. Five providers agreed. For transparency, we disclosed which websites we’d test in advance, so it was possible to prepare accordingly.
These are the participants:
- Bright Data – one of the biggest companies in the field of data collection. We tested its Web Unlocker, together with SERP API for Google. While these are two separate tools, Bright Data has a unified system which lets customers use any product without getting a new subscription. Bright Data’s been offering Web Unlocker for at least several years, and it was the company’s main growth driver in 2022.
- Crawlbase – a well established provider of web scraping tools with over 45,000 paying customers. We tested Crawling API – primarily a web scraping API with a proxy mode. We chose it over Crawlbase’s designated proxy API, Smart Proxy, because Crawling API allows paying as you go and doesn’t lock features behind pricing tiers. Otherwise, there’s little difference between the two – they even share the same parameters.
- Oxylabs – Bright Data’s main competitor in the proxy server market with excellent infrastructure. We tried Web Unblocker – a not-exactly-new proxy API that looks new because it rebranded from Next-Gen Residential Proxies in early 2023.
- Smartproxy – another major proxy provider looking to branch out into web scraping tools. We tested Site Unblocker – a very recent proxy API launched in July 2023.
- Zyte API – a long-standing data collection company that focuses on e-commerce and maintains several popular open source tools like Scrapy. We wanted to use Smart Proxy Manager, which is one of the oldest proxy APIs. But it turns out Zyte is sunsetting the tool. So we ended up testing Zyte’s API after an unintentional bait-and-switch. As a silver lining, Zyte plans to add a proxy mode to Zyte API soon.
We used a custom asynchronous Python script to send requests. We ran multiple tests with the US versions of target websites, making 1 request per second. Our computer was located in Germany.
We made around 1,800 requests for each target using one proxy API.
To verify that a request was successful, we looked at its response code, size, and the page’s title. Some APIs returned their own status codes in addition to website responses; we didn’t use them.
We wanted to see how good the APIs are at their main job – unblocking challenging websites. We selected seven targets protected by various anti-bot systems:
|Amazon||In-house CAPTCHA, returns empty 200-coded responses||❌|
|Photo-focused social media network||In-house protection, asks for login if triggered||✅ (We accessed the web interface and not the GraphQL endpoint)|
|Walmart||Akamai, FingerprintJS, PerimeterX, ThreatMetrix||❌|
Which Target Was the Hardest to Unblock?
|Avg. success rate||Avg. response time|
|Photo-focused social media network||92.19%||19.74 s|
How Did the APIs Do?
All participants managed to succeed at least nine times out of ten, so the answer is – pretty well:
- Zyte had a particularly strong showing, both in terms of success rate and response time. However, the provider did have to fix its Instagram scraper during our tests to perform as well as it did.
- Oxylabs and Smartproxy prioritized success rate and would’ve aced the tests if not for Nordstrom. However, their headless implementations were among the slowest.
- Crawlbase was relatively fast, but it failed more requests across the board, even with non-problematic targets like Google.
Breakdown by Individual Target
Proxy APIs are designed to be a drop-in replacement for proxy servers. Accordingly, their main expected features are location targeting and ability to establish sessions.
As we can see below, all participants support this. Oxylabs, Smartproxy, and Bright Data are foremost proxy server providers, so they can afford offering granular location settings that reach co-ordinate and ISP level.
|Localization||Countries (all), states, cities, co-ordinates||Countries (all), states, cities, ASNs||Countries (all), states, cities, co-ordinates||Countries (50)||Countries (26)|
|Proxy selection||Automated||Automated||Automated||Automated||Automated, with an option to route via TOR|
But proxy APIs aren’t just regular proxies. They mediate all communication with the target, which means you’re inevitably giving away some level of control. Let’s see how you can interact with the APIs and what’s out of their reach. We’ll omit Zyte for now, as it’s unclear which features will carry over to the proxy format.
|Request modifiers||Custom headers, cookies||Custom parameters for search engines||Custom headers, cookies||Custom headers, cookies|
|Page interactions||POST requests||❌||POST requests||POST requests, wait for load, scroll|
|Other||Asynchronous requests, parsers for search engines||CSS selectors, parsers for select websites|
Bright Data’s specialized SERP API is a different beast – it offers custom parameters for building the request, such as search query, pagination, and location. It can also parse various properties of Google and other search engines for structured data. Most of this is achieved by appending parameters to the URL.
Search engines aside, the proxy APIs of Oxylabs and Smartproxy are more versatile: they accept custom cookies and request headers. There’s also an option to send POST requests with form or other data. And, you can opt to receive a screenshot instead of the HTML source.
Crawlbase’s Smart Proxy inherits the parameters from the provider’s other APIs. So in addition to achieving everything mentioned above, you can also interact with the page by waiting for it to load or scrolling down. Like Bright Data, Crawlbase offers parsers for several search engines, social media, and e-commerce websites, with an option to extract particular CSS elements from any website.
Main Limitation of Proxy APIs
You’d think this could be solved by simply plugging the proxy API into Puppeteer or another headless library. But with a few exceptions, the APIs are incompatible by design.
Though some providers categorize proxy APIs as proxy servers (and not web scrapers), it doesn’t mean they follow the same conventions. Let’s have a look at the table below:
|Format||Traffic||Successful requests||Traffic||Successful requests||Successful requests|
|Modifiers||❌||Premium domains, city & ASN targeting||❌||JS rendering||Dinamically adjusted by target difficulty, JS rendering|
|Max price difference||x1||x2 + $4/CPM||x1||x2||x30+|
There are several things to take away. First, that there’s no one pricing format. Three providers charge for successful requests, which is the standard for web scraping tools. Oxylabs and Smartproxy, on the other hand, try to continue with the proxy paradigm where residential and mobile proxy networks usually monitor traffic use.
Zyte’s pricing is interesting in general. The provider adjusts request cost dynamically, based on the website’s difficulty and optional parameters. This means your rates may become more or less expensive over time, even when accessing the same website. There’s a calculator on the dashboard where you can check how much a request will cost.
Bright Data also segments pricing based on targets, but the approach here is simpler – you can enable a list of premium domains like Zillow or Nordstrom that cost an additional $4 per 1,000 requests.
Which Approach Makes More Sense?
It depends on the provider’s rates. But given that proxy APIs are supposed to be an upsell to residential proxies, they tend to be pretty expensive when metered by traffic. In this case, we’re partial to request-based pricing, especially if the website has large page sizes.