Comparing Popular Web Scraping & Proxy APIs
Our report benchmarks nearly a dozen popular unblockers and web scraping APIs. These remote scrapers simplify the process of web data collection by overcoming CAPTCHAs, JavaScript virtual machines, and other roadblocks erected by anti-bot systems.
Though growing in popularity every year, web scraping APIs have become especially relevant with the rise of AI models, closing down of online platforms, and commercialization of bot protection approaches.
Our main goal is to see how well the APIs are able to unblock protected websites in late 2024 (for earlier reports refer to: comparison of web scraping APIs (2023), comparison of proxy APIs (2023)). We also take a look at their features and pricing strategies to get a well-rounded view of the market.
Summary
- Our list of participants included 11 API providers, which we tested on 10 protected websites at a rate of 10 requests per second.
- Five APIs managed to open all targets consistently, while the others failed to unblock between one and five websites. Oxylabs, Zyte, and Bright Data had the highest average success rate of around 98%.
- Zyte’s API worked extremely fast, managing to unblock all targets without headless browsers. Bright Data’s Web Unlocker, though second to last in speed, was the only API that didn’t fail a single run.
- G2 (Cloudflare) proved to be the hardest target by success rate, while the largest number of participants – five – failed to unblock Allegro (DataDome).
- Compared to proxy APIs (also called web unblockers), web scraping APIs have more features: asynchronous delivery, data parsing capabilities, and specialized endpoints. Some also include a proxy mode, making the distinction arbitrary.
- We’re seeing more providers release specialized endpoints for popular targets. In addition, several AI-based parsing approaches have appeared, ranging from models trained on page types to AI-generated parsing schemas.
- Credit-based pricing models are usually chosen by providers that service smaller customers; while extremely cheap for basic websites, they impose huge multipliers when accessing challenging targets.
- Out of the business-oriented providers, Zyte is very hard to beat on price for unblocking tasks. For targets that rely on JavaScript or need special functionality (e. g. localized Google or Amazon queries), Smartproxy and Oxylabs offer a compelling balance between performance and cost.
Participants
Our research includes 11 major providers of web scraping and proxy APIs (often called web unblockers). The tools technically form different product categories, but we decided not to separate them. Both tend to use the same tech, and web scraping APIs sometimes have a proxy mode as one of the integration formats, which further blurs the distinction.
Most participants are well known in the industry, though not necessarily for their scraping infrastructure. Here’s the full list:
Participant | Tested products | Target audience |
Bright Data | Web Unlocker, SERP API | Companies & enterprise |
Infatica | Web, SERP, E-Commerce APIs | Individuals & small businesses |
NetNut | Web Unblocker | Companies & enterprise |
Nimble | Web, SERP, E-Commerce APIs | Companies & enterprise |
Oxylabs | Web Scraper API | Companies & enterprise |
Rayobyte | Scraping Robot | Individuals & small businesses |
ScraperAPI | Scraping API | Individuals & small businesses |
Scrapingdog | Web Scraping API | Individuals & small businesses |
Smartproxy | Web, SERP, E-Comm, Social APIs | Small to medium businesses |
SOAX | Web Unblocker | Companies & enterprise |
Zyte | Zyte API | Individuals to enterprise |
Methodology
We gave all participants the methodology doc in advance. Some actively monitored our progress, making adjustments to their scrapers on the fly. This is fine, as web scraping is a dynamic process, and it should be approached as such. Hopefully, we also helped to improve the success rate for actual customers.
We chose 10 targets based on their popularity and bot protection system. Our goal was to try the scrapers with all major anti-bot vendors.
Target | Bot protection |
Allegro (products) | DataDome |
Amazon (products) | In-house |
Canadagoose (products) | Kasada |
G2 (product reviews) | Cloudflare |
Google (SERPs) | In-house |
Indeed (location directories) | Shape |
Instagram (HTML profiles) | In-house |
Lowe’s (products) | Akamai |
Safeway (products) | Imperva |
Walmart (products) | PerimeterX |
There are some caveats to consider:
- Anti-bot systems may have different levels of protection based on the website (or even categories of the same website).
- Some bot protection vendors focus on securing sensitive endpoints (such as internal APIs or login pages), so they may not show up in full force against simple collection of public content.
We ran at least three tests for each target throughout several weeks. We fetched ~6,000 unique URLs, navigating directly to the page. The rate was 10 requests per second, with a timeout of 600 seconds. This is enough to trigger bot protection systems – and, as we’ll see, seriously tax some of the scrapers.
We used a custom Python script – its function was to simply send the request to the scraper and receive the response, measuring the time it took to reach us. Our server was located in the US.
Participants were free to suggest the optimal parameters for the targets, and some did. Otherwise, we used our own discretion, starting with the simplest configuration and enabling optional features (such as premium proxies and headless browsers) if we couldn’t unblock or load the valuable content.
We verified a request’s success by checking the response code and HTML size. The latter was necessary, as some websites (such as Safeway) tend to return 200-coded responses without data.
Some providers imposed concurrency limits that could potentially affect our scraping rate:
- ScraperAPI’s largest public plan has a limit of 100 concurrent threads, which isn’t enough for 10 requests/second, especially when headless browsers are involved.
- Infatica and Scrapingdog have the same restriction.
- Zyte’s default rate limit is 500 requests per minute (~8.3/second).
ScraperAPI, Scrapingdog, and Zyte lifted their restrictions for us. Infatica wasn’t able to, forcing us to scrape most websites at 1 req/s. We also encountered SOAX’s internal limits and decided to stick with ~5 req/s for more complex targets.
Benchmark Results
The results show the best run each API had with each target. We’ll provide comments to give you more context where necessary.
Overall Performance
Oxylabs | Zyte | Bright Data | Smartproxy | Nimble | NetNut |
---|---|---|---|---|---|
98.50% | 98.38% | 97.90% | 96.29% | 95.48% | 80.82% |
SOAX | ScraperAPI | Scrapingdog | Infatica | Rayobyte | |
69.30% | 67.72% | 43.84% | 38.40% | 37.65% |
Hover on a label to highlight, click to hide.
Zyte | NetNut | Smartproxy | Scrapingdog | Nimble | SOAX |
---|---|---|---|---|---|
6.61 s | 9.71 s | 10.91 s | 10.92 s | 13.01 s | 13.41 s |
Oxylabs | ScraperAPI | Infatica | Bright Data | Rayobyte | |
13.45 s | 15.39 s | 17.15 s | 22.08 s | 26.24 s |
Hover on a label to highlight, click to hide.
Five providers managed to open all targets more or less consistently, which is an excellent result. We can distinguish Oxylabs and Zyte for their overall success rate; and though it doesn’t reflect here, Bright Data was amazingly dependable, never failing a single test.
The rest had at least one target that gave them trouble. But you shouldn’t discount these APIs just by looking at one aggregate number. For example, aside from Lowe’s and Safeway, NetNut performed flawlessly with most websites, and so on.
In terms of response time, Zyte’s API was a speed demon, beating others by up to four times. The provider made some adjustments during the process, and by the end, it could somehow open all targets without requiring JavaScript rendering.
Bright Data obviously prioritized unblocking success, and so it performed slower than expected. We believe that a better way to scale its APIs would be through more parallel requests, which our testing parameters didn’t fully exploit.
Hardest Targets to Unblock
We omitted Infatica’s and SOAX’s results from the success rate column, as they were tested under lower rate limits (one and five requests per second, respectively).
Avg. success rate | Participants with over 80% failure rate | |
G2 | 60.39% | 4 |
Lowe’s | 67.17% | 3 |
Allegro | 68.32% | 5 |
Safeway | 68.93% | 4 |
Canadagoose | 75.73% | 4 |
Indeed | 81.40% | 1 |
89.82% | 0 | |
93.77% | 0 | |
Walmart | 94.54% | 1 |
Amazon | 96.12% | 0 |
G2 (Cloudflare) proved the hardest to unblock looking at the average success rate of all providers. However, it was actually Allegro that the most participants failed to open consistently. Its average success rate is boosted by the rest.
On the other hand, most APIs managed to open Google and Amazon nearly perfectly. As major web scraping targets, they’re the baseline for any commercial data collection service.
Breakdown by Individual Target
Provider | Success rate | Response time |
Oxylabs | 100% | 1.96 s |
Smartproxy | 100% | 2.38 s |
Bright Data | 99.90% | 3.68 s |
Nimble | 99.80% | 12.06 s |
NetNut | 99.62% | 6.18 s |
Zyte | 99.13% | 4.80 s |
Scraping Dog | 7.88% | 10.85 s |
ScraperAPI | 7.01% | 26.68 s |
SOAX | 2.12% | 6.30 s |
Rayobyte | 1.54% | 7.65 s |
Infatica | Failed to unblock |
As one participant exclaimed, Allegro is very hard to unblock. The website uses DataDome, and it’s even featured as a success case in the anti-bot vendor’s website.
In reality, we saw one of two extremes: an API either opened Allegro perfectly, or it failed completely. The same story repeated consistently across our tests. All in all, Oxylabs and Smartproxy performed particularly well here.
Provider | Success rate | Response time |
ScraperAPI | 100% | 3.79 s |
Oxylabs | 100% | 5.08 s |
Bright Data | 99.85% | 5.88 s |
Smartproxy | 99.83% | 5.05 s |
Nimble | 99.82% | 6.39 s |
Zyte | 99.80% | 3.26 s |
NetNut | 99.73% | 6.21 s |
SOAX | 99.67% | 12.11 s |
Infatica | 94.66% | 8.85 s |
Rayobyte | 87.86% | 12.93 s |
Scraping Dog | 78.23% | 13.97 s |
Amazon is the website for web scraping, so unblocking it is a must for any self-respecting service. As such, Amazon proved to be the least problematic target.
We saw consistent results with little deviation across runs. Though the difference was minimal, ScraperAPI had the best showing.
Provider | Success rate | Response time |
NetNut | 99.90% | 7.01 s |
Zyte | 99.88% | 15.26 s |
ScraperAPI | 99.79% | 3.58 s |
Bright Data | 99.60% | 4.45 s |
Oxylabs | 98.87% | 4.09 s |
Nimble | 90.73% | 11.85 s |
Smartproxy | 79.88% | 6.22 s |
Rayobyte | 12.95% | 56.83 s |
SOAX | Failed to unblock | |
Infatica | Failed to unblock | |
Scraping Dog | Failed to unblock |
The Canada Goose store includes only several hundred products, but it’s visited by hundreds of thousands of people every month. The website uses Kasada, which was a hard nut to crack. The participants either had a bypass method, or they failed to unblock this target.
Like in The Web Scraping Club’s benchmark, NetNut’s unblocker had the best success rate, though it wasn’t the fastest. Some APIs had significant variance between runs: Nimble, ScraperAPI, and Smartproxy failed several tests and then fixed their scrapers for others.
Provider | Success rate | Response time |
NetNut | 99.80% | 4.79 s |
SOAX | 99.38% | 13.75 s |
Bright Data | 91.74% | 26.80 s |
Zyte | 90.12% | 6.71 s |
Oxylabs | 87.35% | 27.45 s |
Smartproxy | 83.95% | 6.92 s |
Nimble | 69.11% | 39.23 s |
Scraping Dog | 19.80% | 3.33 s |
ScraperAPI | 1.36% | 22.29 s |
Rayobyte | 0.32% | 94.20 s |
Infatica | Failed to unblock |
G2, a major company review website, is protected by Cloudflare. We found it to be the most challenging target, giving even the most solid APIs a run for their money.
Again, NetNut showed the best performance, both in success rate and response time. As with Canada Goose, the results weren’t always consistent between runs for more than one participant.
Provider | Success rate | Response time |
Zyte | 100% | 0.81 s |
NetNut | 100% | 2.10 s |
Nimble | 100% | 3.24 s |
Smartproxy | 100% | 5.37 s |
Oxylabs | 99.98% | 4.79 s |
Scraping Dog | 99.97% | 2.93 s |
Bright Data | 99.86% | 10.12 s |
Infatica | 95.07% | 2.44 s |
SOAX | 94.13% | 8.70 s |
Rayobyte | 92.20% | 4.49 s |
ScraperAPI | 51.93% | 5.83 s |
Google is a must for any web scraping API. The search engine is protected by the infamous reCAPTCHA, which quickly rate limits suspicious visitors. But it proved no challenge to all but one API.
We’re particularly impressed with Zyte’s performance. Zyte API not only achieved a perfect success rate, but it returned requests in under one second – much faster than the rest.
Provider | Success rate | Response time |
NetNut | 100% | 2.52 s |
Smartproxy | 100% | 3.38 s |
Bright Data | 100% | 4.67 s |
Oxylabs | 99.88% | 3.69 s |
Infatica | 99.84% | 3.12 s |
Nimble | 99.76% | 10.80 s |
Zyte | 99.53% | 10.85 s |
SOAX | 98.92% | 12.84 s |
ScraperAPI | 98.80% | 5.02 s |
Scrapingdog | 25.46% | 20.03 s |
Rayobyte | 9.19% | 21.51 s |
Contrary to our expectations, Indeed wasn’t a hard target for the scrapers. The website employs Shape, a notoriously hard anti-bot system, but we either failed to trigger it or Indeed is using a lenient configuration.
In any case, at least five providers had amazing results, making it hard to distinguish one outlier. Similar outcomes repeated throughout all runs, with the exception of ScraperAPI.
Provider | Success rate | Response time |
Nimble | 99.97% | 7.01 s |
SOAX | 99.73% | 8.96 s |
Oxylabs | 99.55% | 27.46 s |
Smartproxy | 99.48% | 23.46 s |
Zyte | 99.13% | 2.63 s |
Bright Data | 96.61% | 55.04 s |
NetNut | 96.21% | 25.31 s |
Infatica | 93.04% | 20.40 s |
ScraperAPI | 79.33% | 21.90 s |
Scraping Dog | 75.36% | 8.83 s |
Rayobyte | 62.75% | 13.63 s |
Instagram is another major source of web data, though TikTok has probably started challenging it in popularity by now. The social media network uses its own bot protection system that redirects suspicious users to a login page. In our tests, however, Instagram didn’t cause big issues for most participants.
Overall, Nimble’s results look the best on paper. It’s also interesting that Zyte adjusted its scraper in the process, and our last test ran successfully without JavaScript rendering enabled. As a result, Zyte’s response time is mighty impressive.
Provider | Success rate | Response time |
Zyte | 100% | 17.78 s |
Smartproxy | 99.98% | 24.20 s |
SOAX | 99.83% | 14.16 s |
Nimble | 99.81% | 18.56 s |
Oxylabs | 99.75% | 29.58 s |
Bright Data | 99.14% | 75.61 s |
ScraperAPI | 63.13% | 34.45 s |
NetNut | 27.00% | 16.09 s |
Scraping Dog | 9.90% | 23.31 s |
Rayobyte | 5.79% | 39.36 s |
Infatica | 1.40% | 50.93 s |
Lowe’s is a decently popular target that employs Akamai’s bot protection system. It brought down a third of the participants, including NetNut, which came out strong with other anti-bots.
On the other hand, six APIs succeeded over 99% of the time, which is a great result.
Provider | Success rate | Response time |
Zyte | 100% | 1.65 s |
Smartproxy | 99.81% | 28.36 s |
Oxylabs | 99.69% | 27.57 s |
Nimble | 95.95% | 9.82 s |
Bright Data | 92.33% | 29.33 s |
ScraperAPI | 75.84% | 25.32 s |
Scraping Dog | 50.07% | 2.61 s |
Rayobyte | 6.61% | 2.09 s |
NetNut | 0.05% | 11.55 s |
SOAX | 0.04% | 27.31 s |
Infatica | Failed to unblock |
Safeway, the U.S. supermarket chain, is protected by Imperva and imposes aggressive geo-restrictions outside North America. The website isn’t a very popular target, so most participants found it tricky, requiring several runs to adjust.
All in all, Zyte’s performance looks amazing on paper, but it was Bright Data that ensured consistent results throughout all tests.
Provider | Success rate | Response time |
Smartproxy | 99.98% | 3.80 s |
ScraperAPI | 99.98% | 5.04 s |
Bright Data | 99.98% | 5.20 s |
Oxylabs | 99.88% | 2.84 s |
Nimble | 99.88% | 11.12 s |
SOAX | 99.25% | 16.58 s |
Rayobyte | 97.32% | 9.68 s |
Zyte | 96.22% | 2.31 s |
NetNut | 85.91% | 15.68 s |
Scraping Dog | 71.70% | 12.46 s |
Infatica | Failed to unblock |
Though probably eclipsed by Amazon, Walmart is a major e-commerce data source. It tends to juggle anti-bot systems but is generally associated with PerimeterX.
Most participants didn’t find Walmart problematic. However, we did see Nimble’s and Scraping Robot’s success rates crash after PerimeterX’s update late August.
Other Observations
- When accessing protected targets, commercial APIs can be brittle. Less popular websites may need to be unblocked, even if the provider has a general bypass for that bot system. Popular targets like Walmart or G2 may too temporarily break after major updates.
- Providers use different approaches for unblocking the same websites. Nimble relies on what it calls browserless drivers – they render JavaScript without invoking traditional headless browsers. We saw a big reliance on these drivers. On the other hand, by the end of our tests Zyte was able to access all targets without requiring browser-rendered HTML at all.
- There’s a big difference between running tests at one request per second and ten or more. First, we wouldn’t have discovered that providers had scaling issues. Second, some websites don’t start seriously blocking until five requests per second or more.
Feature Overview
Let’s take a quick look at what you can do with the scraper and proxy APIs.
Proxy vs API Integration
The question of integration method is often already decided before purchase: if your codebase takes the proxy format, you’ll naturally gravitate towards it. But is there a real difference between the features API and proxy integration methods offer? In a way, yes.
Proxy APIs (unblockers) | Web scraping APIs | |
Data delivery | Real-time | Real-time or on-demand, sometimes with batching & cloud storage |
Geo-location selection | Often country-wide, sometimes up to city & ASN | Usually at the country level |
Sessions | ✅ | ✅ |
Custom headers & cookies | ✅ | ✅ |
JavaScript rendering | A toggle | A toggle with optional instructions for scrolling, waiting, and more |
Specialized endpoints | Usually unavailable | For popular websites with tailored parameters (e. g. ASIN entry, ZIP selection for Amazon) |
Data parsing | Usually unavailable | Through specialized endpoints, manual selectors, or lately LLMs |
Output formats | HTML | HTML, JSON, sometimes CSV |
Proxy APIs:
Integration | Geolocation | Sessions | Custom headers | JS rendering | Specialized endpoints | Data parsing | |
---|---|---|---|---|---|---|---|
Bright Data | Proxy, async API | 150+ countries with city & ASN targeting | ✅ | ✅ | Automated, toggle | Search engines | Specialized endpoints |
NetNut | Proxy | 150+ countries | ✅ | ✅ | Toggle | ❌ | ❌ |
Web scraping APIs:
Integration | Geolocation | Sessions | Custom headers | JS rendering | Specialized endpoints | Data parsing | |
---|---|---|---|---|---|---|---|
Infatica | Real-time, async API | 150+ countries | ✅ | ✅ | Toggle | Search, e-commerce | Specialized endpoints |
Nimble | Real-time, async API (with batching, cloud storage) | 150+ countries with state & city targeting | ✅ | ✅ | Toggle, instructions | Search, e-commerce, social media | Manual, autoparser, special endpoints |
Oxylabs | Real-time, async API (with batching, cloud storage), proxy | 150+ countries with ZIP for Amazon, city & coordinates for Google | ✅ | ✅ | Toggle, instructions | Search, e-commerce | Manual, special endpoints, parser builder |
Rayobyte | Real-time, async API (with batching) | 150+ countries | ✅ | ✅ | Toggle, instructions | Search, e-commerce | Manual, special endpoints |
ScraperAPI | Real-time, async API (with batching), proxy | 12 countries with 50+ upon request, ZIP code for Amazon | ✅ | ✅ | Toggle, instructions | Search, e-commerce | Manual, special endpoints |
Scraping Dog | Real-time, async API, proxy | 15 countries | ✅ | ✅ | Toggle, instructions | Search, e-commerce, social media, more | Special endpoints |
Smartproxy | Real-time, async API (with batching), proxy | 150+ countries with ZIP for Amazon, city & coordinates for Google | ✅ | ✅ | Toggle, instructions | Search, e-commerce, social media | Manual, special endpoints |
SOAX | Real-time | 150+ countries | ❌ | Cookies | Toggle | Search, e-commerce, social media | Special endpoints |
Zyte | Real-time API, proxy | 150+ countries | ✅ | ✅ | Toggle, instructions, scripting | ❌ | Manual, category based |
Proxy APIs are often meant to be a direct upsell to proxy servers with a drop-in replacement process. At the same time, because you’re effectively outsourcing the page opening stage, proxy APIs need to go beyond regular proxy network features like geo-location to cover request manipulation and even JavaScript rendering. So they do.
Despite their wealth of features, proxy APIs can still be limited. For example, they rarely offer specialized endpoints, on-demand access to scraped output, or data structuring features. Another big drawback for complex scenarios is incompatibility with headless browser libraries, with no browser instructions on their own. This is where web scraping APIs offer more flexibility.
Exceptions exist. Bright Data’s SERP API integrates as a proxy, but in reality it’s a highly specialized scraper with data parsing and custom parameters. Funnily enough, some providers that sell web unblockers also offer web scraping APIs with a fully-featured proxy mode. In these scenarios, the difference hinges on the pricing method and, likely, marketing strategy.
How do you work with proxy and web scraping APIs? Obviously, the main requirement is sending an HTTP request to the provider’s server. However, the way you configure that request can differ. It’s usually either a GET request with parameters in the URL, headers, or a POST request with a JSON payload.
Exploring Individual Features
All modern proxy and web scraping APIs can render JavaScript. With pages becoming increasingly interactive, a question that arises more often every year is: what else can I do on the page? Proxy APIs tend to ignore it; the solution of scraping API developers is to expose browser controls through special parameters.
Screenshot | Click | Input | Scroll | Wait | |
Nimble | ✅ | ✅ | ✅ | ✅ | ✅ |
Oxylabs | ✅ | ✅ | ✅ | ✅ | ✅ |
Rayobyte | ✅ | ✅ | ✅ | ✅ | ✅ |
ScraperAPI | ❌ | ✅ | ✅ | ✅ | ✅ |
Scraping Dog | ✅ | ❌ | ❌ | ❌ | ✅ |
Smartproxy | ✅ | ✅ | ✅ | ✅ | ✅ |
SOAX | ✅ | ❌ | ❌ | ❌ | ✅ |
Zyte | ✅ | ✅ | ✅ | ✅ | ✅ |
Bright Data, Infatica, NetNut – only basic rendering functionality available.
It’s possible to combine the instructions. For example, you can select a field, enter text, click on it, and wait for the response. Providers impose execution time limits, which often range between 60 and 120 seconds.
Zyte takes this a step further. Its clients get access to a cloud-hosted VS Code environment, where they can write their own interaction scripts.
The latter functionality isn’t common: instead, we’re seeing new categories emerge with the aim to increase web scraping success while providing standard compatibility with headless browser libraries. Some examples would be Undetect, Bright Data’s Scraping Browser, and major anti-detect browsers like Multilogin and Gologin.
Specialized endpoints are tailor made for websites or their properties (such as Amazon product pages). They often have custom parameters and data parsing capabilities. For instance, a Google SERP endpoint may be able to fetch local results (city-wide or in particular coordinates), which would otherwise be unavailable when targeting a general-purpose API.
Amazon | Others | ||
Bright Data | SERP, ads, search types, local search | ❌ (available in other products) | Bing, Yandex, DDG |
Infatica | SERP, ads | Search, product | Booking |
Nimble | SERP, ad optimization, local search | Search, product (incl. ZIP code) | Bing, Yandex, adding more fast |
Oxylabs | SERP, ads, search types, hyperlocal search | Product, search, sellers, reviews, more (incl. ZIP code) | Walmart, Bing, Etsy, BestBuy, Target |
Rayobyte | SERP | Product | ❌ |
ScraperAPI | SERP, several search types | Product, search, offers, reviews (incl. ZIP) | Walmart |
Scraping Dog | SERP, search types | Product, search (incl. ZIP) | LinkedIn, Twitter, Yelp, Indeed |
Smartproxy | SERP, search types, hyperlocal search | Search, product, sellers, reviews, more (incl. ZIP) | ❌ |
SOAX | SERP, search types | Search, product, reviews, questions | Walmart, all major search engines & social media platforms |
NetNut, Zyte – no specialized endpoints available for the tested products.
Compared to the year before, we’re seeing an interesting trend: scraper vendors have been introducing more specialized endpoints to their products. One example is ScraperAPI, which now offers scrapers for Amazon, Google, and Walmart. Another is Nimbleway – the provider has set out to build what it calls online pipelines for targets in various verticals.
The direction is interesting, considering that LLMs have reduced the barrier to entry particularly when it comes to parsing, and that they tempt to consolidate toward one all-encompassing tool. Maybe a single purpose reassures that the scraper will be fit for the task?
Data parsing is an area where some of the most exciting developments are taking place. Of course, this is thanks to machine learning and large language models. But we’re also seeing changes in less sophisticated approaches: since our previous report, Oxylabs, ScraperAPI, and Smartproxy have all implemented selector support for building parsers by hand.
Manual parsing | Pre-made templates | Other | |
Bright Data | ❌ | Specialized endpoints | |
Infatica | ❌ | Specialized endpoints | |
Nimble | Selectors | Specialized endpoints | Autoparsing, AI parser schemas |
Oxylabs | Selectors | Specialized endpoints | AI parser schemas |
Rayobyte | Selectors | Specialized endpoints | |
ScraperAPI | Selectors | Specialized endpoints | |
Scrapingdog | ❌ | Specialized endpoints | |
Smartproxy | Selectors | Specialized endpoints | |
SOAX | ❌ | Specialized endpoints | |
Zyte | Selectors | Models trained on page types |
NetNut – no data parsing available for the tested products.
Let’s explore several different approaches to AI-based parsing that lie in the modest Other column.
#1. Custom machine learning models trained on specific page types.
Zyte has been playing around with machine learning for years now. Instead of parsing individual targets, Zyte trained multiple in-house models for whole page categories: products, news, directories, etc. The caveat was that they relied on AI vision, which required browsers. Still, during its conference roughly a year ago, Zyte bragged about being dozens of times cheaper and more accurate than ChatGPT.
Since then, Zyte has adapted the models to non-rendered requests, significantly cutting down the cost. It’s also experimenting with supplementary LLM features. They can make the schema more flexible by adding custom data points, and they can also transform data: translate, normalize, summarize, etc.
#2. A universal AI parser.
Similarly to Zyte, Nimble uses HTML-trained AI agents to extract data from various page types. Unlike Zyte, the provider automatically chooses the relevant agent depending on the page, keeping the decision process in the backend.
In a way, this makes the customer’s job easier. But it’s also way less predictable (Willl this target work? What’s the schema?). During our tests, we found the functionality to be more miss than hit: it parsed Lowe’s but failed to structure Canadagoose or G2. We’re sure it’s bound to improve fast.
To make the agents more robust, Nimble is preparing to release the ability to generate custom schemas. This feature will accept simple, likely natural language instructions and translate them into parsers. According to Nimble’s documentation, these parsers will get reusable IDs and heal automatically after identifying a failure.
For now, Nimble’s stopgap solution combines the dynamic parser with manual selectors to build a parser for the page.
#3. LLM-assisted parsers generated upon request.
This is the approach Oxylabs announced during its recent web scraping conference. Basically, you send a URL with natural language instructions to an LLM, then it generates a schema and selectors for scraping the data points. You get a preview of the output and the ability to adjust the schema to your needs. Once you’re happy, the selectors get added to the API request code.
Oxylabs’ approach is highly pragmatic, as the language model is invoked only once and not with every page access. However, it has limitations, namely that once a parser breaks, you have to manually repeat the generation process.
Pricing Approaches
We’ll overview the pricing models of the participants and how much our benchmarks would’ve cost.
Request, Credit-Based and Black Box Models
There are multiple ways to price proxy and web scraping APIs. The former use traffic or requests as the main metric. The latter charge for successful requests, either by keeping their model simplistic (one page = one request) or creating increasingly elaborate schemes based on credits.
Zyte’s model is closer to credits, for it includes variables that affect the final rate. But it’s also unique because the cost may change with time, depending on how hard Zyte finds the target to scrape. According to the provider, these revisions take place once per quarter and affect around 0.1% of websites. Still, such a pricing scheme works as a kind of black box.
Model | Structure | Price range | Trial | |
Bright Data | Requests | PAYG, subscription | $1-$2,000 | 7 days for companies |
Infatica | Credits | Subscription | $25-$240 | 5k req, 7 days |
NetNut | Requests | Subscription | Not public | 7 days for companies |
Nimble | Requests | PAYG, subscription | $3-$3,000 | Available |
Oxylabs | Requests | Subscription | $49-$2,000 | 5k req, 7 days |
Rayobyte | Requests | PAYG | $1.8 | 5k free req / month |
ScraperAPI | Credits | Subscription | $49-$299 | 1k free credits / month, 7-day trial |
Scraping Dog | Credits | Subscription | $40-$200 | 1k credits, 30 days |
Smartproxy | Requests | Subscription | $30-$500 | 1k req, 7 days |
SOAX | Requests | Subscription | $2.5-$2,200 | Available |
Zyte | Dynamic | PAYG, subscription | $1-not specified | $5 credits for 30 days |
The table provides some interesting data points:
- Scraper vendors prefer trials over paying as you go. Only Scraping Robot has PAYG as its sole pricing model, and Zyte starts requiring commitment after $100. Furthermore, some trials extend to free plans that refresh monthly.
- Credit-based pricing usually targets customers with smaller needs. This is evident from looking at the price ranges of public plans.
Price Modifiers
To understand how exactly request-based and credit-based pricing models compare, we’ll have to explore the base price and available modifiers.
The CPM at $100 column shows how much 1,000 requests would cost when spending $100 with each participant. It may a little be biased against enterprise-minded providers, as they start to scale well at $1,000 and up.
Base CPM at $100 | Price modifiers | |
Bright Data | $3 | A list of premium websites (2x) |
Infatica | $0.09 | JS rendering (10x), E-comm & SERP (10x), JS + E-comm/SERP (20x), LinkedIn (130x) |
Oxylabs | $1.80 | – |
Nimble | $3 | – |
Rayobyte | $1.80 | – |
ScraperAPI | $0.49 | Amazon (5x), SERP (30x), Social (30x), JS rendering (10x), premium IPs (10x), premium IPs + JS (30x), ultra premium IPs (30x), ultra premium + JS (75x) |
Scraping Dog | $0.09 | Google (5x), JS rendering (5x), premium IPs (10x), premium + JS (25x), LinkedIn (200x) |
Smartproxy | $1 | – |
SOAX | $2.50 | – |
Zyte | From $0.10 | Target (up to 10x), parsing (up to 3x), JS rendering (up to 15x), JS + parsing (up to 25x), screenshot (up to 25x) |
NetNut – no public pricing available.
Credit-based pricing models can have huge multipliers reaching tens or even hundreds of times. These variables interact with one another: for example, you can toggle both JavaScript rendering and better quality proxies. From the user standpoint, having these options exposed can feel burdensome, as you need to experiment with parameters and mind the credit cost.
Having said that, they’re really efficient for basic websites that require neither residential proxies nor JavaScript rendering. The low baseline price also makes these scrapers look really good in marketing materials. However, for hard targets like G2, you’ll be likely to overpay.
Request-based models have the opposite problem: they’re really expensive if the target doesn’t bite. But given that these providers are often enterprise-oriented, different considerations kick in, such as scalability and unblocking success.
The Cost to Run Our Benchmarks
So, how much did we pay to complete the full 180,000 requests (10 * 6,000 * 3)? The graph shows aggregate costs, taking the rate of the closest suitable plan.
Three participants failed to unblock some targets consistently (at least 20% of the time) and use credit-based pricing. We didn’t want to speculate on the configuration, so we excluded the following from the graph:
- Infatica: Canadagoose, G2, Lowe’s, Safeway, Walmart.
- Scrapingdog: Allegro, Canadagoose, G2.
- ScraperAPI: Allegro, G2.
Hover on a label to highlight, click to filter.
For general unblocking without JavaScript rendering or data parsing, Zyte delivered incredible value considering its performance results. The provider’s price was closer to entry-level APIs like Infatica and Scrapingdog than the premium competitors.
Smartproxy and Oxylabs also look compelling, more so if you need headless browsers or the bundled parsing features. And while this API may not be the most efficient choice in general, ScraperAPI’s prices for Amazon and Walmart in particular are definitely worthy of attention.
Conclusion
This concludes the report. Assuming that very few readers will reach this part, we moved the summary to the beginning. But since you’re here – thank you for getting to the end! If you have any questions, feel free to contact us through info at proxyway dot com or our Discord server.