Proxy Market News - Proxyway https://proxyway.com/news Your Trusted Guide to All Things Proxy Tue, 21 Jan 2025 09:20:31 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://proxyway.com/wp-content/uploads/2023/04/favicon-150x150.png Proxy Market News - Proxyway https://proxyway.com/news 32 32 New Review: Massive https://proxyway.com/news/new-review-massive https://proxyway.com/news/new-review-massive#respond Tue, 21 Jan 2025 09:12:19 +0000 https://proxyway.com/?post_type=news&p=30612 Massive joins the ranks of our reviewed providers.

The post New Review: Massive appeared first on Proxyway.

]]>

News

Massive joins the ranks of our reviewed providers.

Adam Dubois
massive review news

Massive is a US-based computer resource sharing platform that became a proxy vendor in 2024. The provider maintains a self-sourced pool of residential proxies, offering access primarily to business customers. We had a chance to try it out. 

How was our experience? Pretty good, actually. While not huge, Massive’s proxy network was respectably large in prime locations and had top-notch infrastructure performance. We also liked the detailed usage stats. Massive’s basic dashboard and pricing model, on the other hand, left things to be desired. 

In the end, the pros outweighed the cons, and we decided to give Massive a solid score of 8.7. You can read the full review here: https://proxyway.com/reviews/massive-proxies.

Get proxy news and updates directly to your inbox.



The post New Review: Massive appeared first on Proxyway.

]]>
https://proxyway.com/news/new-review-massive/feed 0
Google Search Starts Requiring JavaScript Rendering https://proxyway.com/news/google-search-starts-requiring-javascript-rendering https://proxyway.com/news/google-search-starts-requiring-javascript-rendering#respond Fri, 17 Jan 2025 13:13:22 +0000 https://proxyway.com/?post_type=news&p=30572 Web scraping services scramble for a workaround.

The post Google Search Starts Requiring JavaScript Rendering appeared first on Proxyway.

]]>

News

Web scraping services scramble for a workaround.

Adam Dubois
google javascript instructions

Google, the largest search engine, has started requiring JavaScript rendering to display search results.

Without it, Google refuses to deliver the query and rather redirects to instructions for enabling JavaScript.

The change took place sometime between January 15 and 16. It was noticed and discussed on Hacker News.

For now, the workaround is to run headless browsers or change the user agent to one of JavaScript-free web browsers, such as Lynx.

Google’s move has affected most commercial scraping APIs, either disrupting their services or notably increasing the latency due to the forced switch to browser-based crawlers.

The response time of SerpApi, a popular Google scraping service.

In addition, it has probably made scraping the website much more costly, at least until a better workaround is found.

Get proxy news and updates directly to your inbox.

The post Google Search Starts Requiring JavaScript Rendering appeared first on Proxyway.

]]>
https://proxyway.com/news/google-search-starts-requiring-javascript-rendering/feed 0
Bright Data Changes Scraper Prices, Calls Web Unlocker an API https://proxyway.com/news/bright-data-scraper-prices-unlocker-api https://proxyway.com/news/bright-data-scraper-prices-unlocker-api#respond Thu, 09 Jan 2025 12:52:49 +0000 https://proxyway.com/?post_type=news&p=30151 The provider’s products now cost either 50% more or 50% less, depending on which you use.

The post Bright Data Changes Scraper Prices, Calls Web Unlocker an API appeared first on Proxyway.

]]>

News

The provider’s products now cost either 50% more or 50% less, depending on which you use.

Adam Dubois
Bright Data, the Israeli web data platform, has made significant changes to the pricing of its scrapers. The provider is also shifting the positioning of Web Unlocker from a proxy tool to an API.

Prices Changes

The rates for Web Unlocker and SERP API, Bright Data’s proxy-like scrapers, have been cut by 50%. On the other hand, the beta-level Scraper APIs, which scrape and structure individual websites, will now cost 50% more. As a result, customers will get a unified line-up when it comes to price:

PlanWeb Unlocker, SERP API (old)Web Scraper APIs (old)New rates
PAYG$3$1$1.5
$499$2.55$0.85$1.27
$999$2.25$0.75$1.12
$1,999$2.10$0.70$1.05

The change will make Bright Data’s scrapers significantly more competitive in an already very cutthroat market. However, the pricing model remains rigid, which makes Bright Data economical primarily for harder to reach and JavaScript-dependent websites.

Web Unlocker as an API

For the past few years, Web Unlocker has been Bright Data’s poster-child scraper product. It was specifically made as an upsell to the provider’s proxy networks, taking the same proxy-like integration format and sharing most of the same conventions. Web Unlocker inspired a whole category of unblocker products that stood alongside web scraping APIs – sometimes with little technical justification. 

Lately, however, Bright Data has been making significant changes to its prized product. The provider started off by exposing more parameters, such as device selection. It then introduced an alternative integration method: a REST API. And beginning with 2025, Web Unlocker has turned into Web Unlocker API, with API integration as the recommended method for using it.

bright data unlocker integration methods
The two integration methods now available for Web Unlocker API. Source: Bright Data

For now, Bright Data’s implementation is still rather basic, with most parameters remaining in the username. That said, it’s already become the default option in the dashboard.

bright data unlocker api integration in the dashboard
The REST API still mostly relies on old conventions. Source: Bright Data

We don’t know what exactly prompted this change, but the ground below proxy APIs has been shaky for a while. SOAX did away with proxy-like integration, and Nimble discontinued its Unblocker Proxy altogether last autumn.

Maybe proxy-like scrapers don’t have as much demand compared to more standard API integration methods – at least not enough to form a separate product category? Quite a few APIs are able to integrate as proxies already.

Get proxy news and updates directly to your inbox.

The post Bright Data Changes Scraper Prices, Calls Web Unlocker an API appeared first on Proxyway.

]]>
https://proxyway.com/news/bright-data-scraper-prices-unlocker-api/feed 0
New Review: LunaProxy https://proxyway.com/news/new-review-lunaproxy https://proxyway.com/news/new-review-lunaproxy#respond Mon, 06 Jan 2025 09:00:35 +0000 https://proxyway.com/?post_type=news&p=30078 LunaProxy joins the ranks of our reviewed providers.

The post New Review: LunaProxy appeared first on Proxyway.

]]>

News

LunaProxy joins the ranks of our reviewed providers.

Adam Dubois
lunaproxy review news

New year, new review? That’s right! Our first addition in 2025 is LunaProxy – a provider from Hong Kong with multiple proxy networks under its belt. 

Like our recently covered PIA S5, LunaProxy offers affordable rates, potentially unlimited residential traffic, and a free proxy manager desktop app. The provider has a large proxy pool (particularly in Brazil) and decent infrastructure performance; however, its policies and user experience could use improvements. 

In the end, we gave LunaProxy a score of 8.3. You’ll find the full review here: https://proxyway.com/reviews/lunaproxy-proxies

Get proxy news and updates directly to your inbox.

The post New Review: LunaProxy appeared first on Proxyway.

]]>
https://proxyway.com/news/new-review-lunaproxy/feed 0
New Review: Live Proxies https://proxyway.com/news/new-review-live-proxies https://proxyway.com/news/new-review-live-proxies#respond Wed, 18 Dec 2024 08:26:13 +0000 https://proxyway.com/?post_type=news&p=29799 Live Proxies joins the ranks of our reviewed providers.

The post New Review: Live Proxies appeared first on Proxyway.

]]>

News

Live Proxies joins the ranks of our reviewed providers.

Adam Dubois
live proxies news main

Live Proxies started out as a sneaker proxy vendor. Now, this American provider is trying to make a name in the broader market. Its consumer line of products takes some unusual formats, with dedicated residential IPs and very long uptime. The business residential proxy pool strives to compete with the industry’s top guns. 

We tested both formats and found these IPs to be of a great quality. Some aspects of the service, however, were less polished. But with big plans in mind, the provider already has a roadmap laid down to improve them. All in all, we gave the Live Proxies of today a score of 8.1.

You can read the full review here: https://proxyway.com/reviews/liveproxies-proxies

Get proxy news and updates directly to your inbox.

The post New Review: Live Proxies appeared first on Proxyway.

]]>
https://proxyway.com/news/new-review-live-proxies/feed 0
New Review: PIA S5 Proxies https://proxyway.com/news/new-review-pia-s5-proxies https://proxyway.com/news/new-review-pia-s5-proxies#respond Tue, 10 Dec 2024 10:36:19 +0000 https://proxyway.com/?post_type=news&p=29357 PIA S5 Proxies joins the ranks of our reviewed providers.

The post New Review: PIA S5 Proxies appeared first on Proxyway.

]]>

News

PIA S5 joins the ranks of our reviewed providers.

Adam Dubois
pia s5 review news

PIA S5 Proxies comes from the crop of Hong Kong based providers that entered the market between 2022 and 2023. It has some attractive selling points, including a proxy manager app, IP-based residential proxy pricing, and below-market rates.

Did PIA S5 Proxies manage to meet and beat our expectations? Its value is sure good, but the overall experience is still very imperfect. In the end, we decided to give the provider a round 8.0.

You can read the full review here: https://proxyway.com/reviews/pia-s5-proxies.

Get proxy news and updates directly to your inbox.

The post New Review: PIA S5 Proxies appeared first on Proxyway.

]]>
https://proxyway.com/news/new-review-pia-s5-proxies/feed 0
Smartproxy Launches Core Scraping APIs https://proxyway.com/news/smartproxy-launches-core-scraping-apis https://proxyway.com/news/smartproxy-launches-core-scraping-apis#respond Thu, 28 Nov 2024 07:50:50 +0000 https://proxyway.com/?post_type=news&p=28766 This product variation comes with fewer features for one fifth of the cost.

The post Smartproxy Launches Core Scraping APIs appeared first on Proxyway.

]]>

News

This product variation comes with fewer features for one fifth of the cost.

Adam Dubois

Smartproxy, the international provider of proxies and web scraping tools, has launched Core web scraping APIs. 

The new APIs are a stripped down version of the provider’s web scrapers (now called Advanced), and they’re available at a significantly cheaper price. Compared to the full version, they:

  • Restrict geo-targeting to the US, Canada, Great Britain, Germany, France, the Netherlands, Japan, and Romania
  • Allow making fewer requests per second – 30+ rather than unlimited
  • Return only the raw HTML response
  • Don’t use Smartproxy’s premium proxy pool
  • Can’t render JavaScript
  • Lack ready-made templates for individual website properties, alongside scheduling functionality
smartproxy core advanced api comparison
The complete feature comparison.

In exchange, Core APIs offer significantly cheaper rates. It’s hard to map the plans onto each other, but the one directly comparable package costs nearly five times less:

Requests50,000100,000250,000600,0002 million6 million
Regular CPM*$1.6$1.4$1.2CustomCustomCustom
Core CPM$0.29$0.17$0.12$0.10

* Price for 1,000 requests

The Core variation is currently available with Smartproxy’s E-Commerce Scraping API. But it looks like the provider plans to introduce it with other APIs, as well. 

For tasks that do not require JavaScript rendering or accessing the toughest websites, it may be among the most affordable options currently on the market.

Get proxy news and updates directly to your inbox.

The post Smartproxy Launches Core Scraping APIs appeared first on Proxyway.

]]>
https://proxyway.com/news/smartproxy-launches-core-scraping-apis/feed 0
Infatica Launches Dedicated Datacenter Proxies https://proxyway.com/news/infatica-launches-dedicated-datacenter-proxies https://proxyway.com/news/infatica-launches-dedicated-datacenter-proxies#respond Wed, 13 Nov 2024 13:46:53 +0000 https://proxyway.com/?post_type=news&p=27931 The product covers nearly 50 locations in six continents.

The post Infatica Launches Dedicated Datacenter Proxies appeared first on Proxyway.

]]>

News

The product covers nearly 50 locations in six continents.

Adam Dubois

Infatica, the UK-based provider of web scraping infrastructure and services, now sells dedicated datacenter proxies.

The product gives access to lists of server-based IPs reserved for the customer’s exclusive use. Infatica’s stock includes around 200,000 proxy servers spread throughout 47 countries in six continents.

infatica dedicated datacenter locations
All available locations for the dedicated datacenter proxies.

These proxy servers are static, support both HTTP(S) & SOCKS5 protocols, and impose no traffic restrictions. 

The pricing ranges from $4.12 to $1.1 per IP address, depending on the number of IPs bought and their location. Infatica goes out of its way to hide exactly how much you’ll need to pay – to reach concrete numbers, you’ll have to not only register but also undergo a KYC procedure.  

Infatica’s dedicated datacenter proxies should already be available for purchase.

Get proxy news and updates directly to your inbox.

The post Infatica Launches Dedicated Datacenter Proxies appeared first on Proxyway.

]]>
https://proxyway.com/news/infatica-launches-dedicated-datacenter-proxies/feed 0
Evomi Launches Core Residential Proxies https://proxyway.com/news/evomi-launches-core-residential-proxies https://proxyway.com/news/evomi-launches-core-residential-proxies#respond Wed, 13 Nov 2024 11:30:29 +0000 https://proxyway.com/?post_type=news&p=27925 At $.49/GB, these may be the cheapest residential proxies out there.

The post Evomi Launches Core Residential Proxies appeared first on Proxyway.

]]>

News

At $0.49/GB, these may be the cheapest residential proxies out there. 

Adam Dubois
evomi core residential pricing

Evomi, the Swiss provider of rotating proxy networks, has launched a new product called Core residential proxies.

The product offers access to a pool of 5M IPs from home devices around the world. Its distinguishing characteristic is the price – the only plan thus far includes 100 GB of traffic for $0.49/GB. Alternatively, it’s possible to pay as you go at double the rate.

Core residential proxies support city-level filtering, HTTP(S) and SOCKS5 protocols, and impose no limits on concurrent threads. The gateway server can switch IPs with every connection request or establish sticky sessions.

In many ways, Core residential proxies resemble Evomi’s other residential product, now called Premium residential. The main differences are that this pool contains fewer IPs, lacks ASN targeting, and doesn’t include Evomi’s pool modes.

The announcement comes shortly after Geonode’s big move, where the provider dropped everything but one residential proxy plan at the same rate of $0.5/GB. Currently, both companies undercut all of their competitors by a big margin. We’ll be curious to see whether this model is sustainable, and at what cost.

Get proxy news and updates directly to your inbox.

The post Evomi Launches Core Residential Proxies appeared first on Proxyway.

]]>
https://proxyway.com/news/evomi-launches-core-residential-proxies/feed 0
New Review: Evomi https://proxyway.com/news/new-review-evomi https://proxyway.com/news/new-review-evomi#respond Mon, 04 Nov 2024 09:32:52 +0000 https://proxyway.com/?post_type=news&p=27755 Evomi joins the ranks of our reviewed providers.

The post New Review: Evomi appeared first on Proxyway.

]]>

News

Evomi joins the ranks of our reviewed providers.

Adam Dubois
evomi review news

Evomi’s young age is deceptive. Though founded less than a year ago, the company runs with an experienced team behind it. As such, this provider already has a lot going for it: a solid brand in the making and proxy networks – especially mobile – that perform very well. 

We had a chance to benchmark all three of Evomi’s current products: residential, mobile, and pool-based datacenter proxies. You can read the full review and discover why Evomi received a solid score of 8.7 here: https://proxyway.com/reviews/evomi-review.

Get proxy news and updates directly to your inbox.

The post New Review: Evomi appeared first on Proxyway.

]]>
https://proxyway.com/news/new-review-evomi/feed 0
Extract Summit 2024: A Recap https://proxyway.com/news/extract-summit-2024-recap https://proxyway.com/news/extract-summit-2024-recap#respond Fri, 18 Oct 2024 08:50:56 +0000 https://proxyway.com/?post_type=news&p=27459 Our virtual impressions from Zyte’s annual web scraping event.

The post Extract Summit 2024: A Recap appeared first on Proxyway.

]]>

News

Our virtual impressions from Zyte’s annual web scraping event.

Adam Dubois
extract summit 2024 banner
Zyte’s Web Data Extract Summit has ended. The line-up this year was particularly strong, and we enjoyed watching the presentations. These are our impressions from the event.
 
Zyte has made the videos freely available on YouTube, so you can quickly get an idea of what they’re about before committing 30 or sometimes even 60 minutes of your time.

Organizational Matters

Like the last two years before it, Zyte’s conference was held physically. For the first time ever, the venue was in Austin, Texas. This spelled great news for Americans, but us Europeans could no longer comfortably watch it – the event took place after usual business hours. But I guess there’s no making everyone happy.

2024’s Extract Summit took place over two days. October 9 was dedicated to live workshops, and the presentations were delivered on October 10. Live tickets for both days cost $330. Virtual attendance was free, but it only included the second day’s talks. 

Zyte used Eventbrite for ticket management and Airmeet as the streaming platform. The latter had all the bells and whistles like sections for comments, polls, and QA. I think you could also join virtual discussion tables in-between talks, but I didn’t get the chance to try out this option. The presenters would take questions from the live audience, as well as Airmeet, with Zyte’s CEO Shane Evans moderating. 

The main event included nine talks and two panel discussions. Due to time differences, I was only able to watch the recordings. Still, I got the impression that everything proceeded more or less smoothly. After all, Zyte’s been doing this since 2019, so they’ve long become pros.

zyte extract summit streaming platform
This is what Zyte's platform looked like.

Main Themes

There was basically one theme explored through various lenses. Not hard to guess – it’s AI: machine learning, large language models, generative AI, all types and flavors. Again and again. 

I don’t mean to sound negative; after all, AI has been pushing the envelope in web scraping, and it’s on the top of everyone’s minds while they’re trying to implement it and keep up, all at once. Zyte did a good job composing the line-up, and there were plenty of outside speakers to bring their perspectives. 

Something that caught my attention was how many vendors of web scraping tools Zyte accepted to its event. Apify, Browserless, Reworkd can all be considered competitors, yet they were still invited to talk.

The Talks

Talk 1. Harnessing the Power of Large Language Models for Advanced Data Engineering and Data Science

Neelabh Pant from Walmart spoke about his team’s use of LLMs for data cleaning. In an act of extreme generosity for the uninitiated, he decided to begin from the creation of the universe, introducing data processing and even LLMs. But it didn’t take long for things to pick up pace.

In brief, traditional rule-based methods require a lot of manual effort, can’t handle context and unstructured data well. Conversely, these are the areas where LLMs excel. After many experiments, Neelabh built a two-phase system that adds missing values (called improvement phase) and extracts facts from unstructured data (called feature enhancement phase). He provided the implementation details and compared four approaches based on price and effectiveness (spoiler: RAG + agents win). 

If you’re in the field of data engineering and spend inordinate amounts of time on messy data, this is the talk for you.

extract summit 2024 talk 1
Manual data preprocessing requires a lot of effort.

Talk 2. Web Data Extraction Mastery: Real-World Implementations and ROI-Driven Success Stories

John Fraser’s company Parts ASAP scrapes the agricultural product data of several dozen competitors several times a week. He outsources the process to Zyte and, by timely implementing the extracted insights, ensures a healthy but by no means shocking 20% annual growth to the happy board. Sounds… a bit mundane, doesn’t it?

Well yes, but also no. John is what I described to myself as a nonchalant badass – one hand in the pocket, giving a no-nonsense story of how he found a practical use of web scraping to grow his business. It doesn’t push any envelopes or promise you the world. And yet, I enjoyed it a lot.

extract summit 2024 talk 2
John pities the fools who make their inventory levels public.

Talk 3. A Practical Demonstration of How to Responsibly Use Big Data to Train LLMs

Joachim Asare from Harvard University spoke about the ethical pitfalls looming in the LLM training process. These include leaking private information, introducing biases, and ingesting low-quality data, among others. The presenter explored the issues during different stages of training: data collection, fine-tuning, and deployment.

Joachim’s mantra throughout the talk was dump data, ‘dumb’ AI. He provided harrowing examples where a maltrained mental health AI model can advise people to kill themselves, or where Meta’s AR glasses were hacked with terrible privacy outcomes. I don’t dabble in LLM training, so the talk was harder to relate to, but it’s still very relevant for understanding how third-party AI can affect you as the user.

extract summit 2024 talk 3
The issues with LLM training boil down to this one phrase.

Talk 4. How We Transformed Zyte's Data Business with Cutting-Edge AI Technology

Ian Lennon from Zyte spoke about the problem of horizontal scaling – in particular, the company’s approach to providing high-quality (read: structured) data from hundreds of websites. According to Ian, it’s a combinatorial problem, and AI has allowed Zyte to slash setup costs and onboard customers they couldn’t before. 

How exactly? First, by building supervised machine learning models that can parse various page categories. Then, by making them work without browser rendering. Zyte’s final iteration (at this point) allows users to customize the models, by either adding manual code or invoking privately-hosted LLMs. 

Zyte’s also betting big on scraping templates that cover all major stages of web scraping: crawling, unblocking, and parsing. I remember the provider introducing its no-code product page template last year – turns out, e-commerce data makes up nearly 60% of Zyte’s business. More templates are coming soon.

Overall, it’s an interesting watch to learn about Zyte’s approach, even if it takes a more salesy angle.

extract summit 2024 talk 4
... unless you're using Zyte, of course!

Panel Discussion. The Future of Proxy Technology: Trends and Innovations in Residential, Mobile & Datacenter Proxies

Jason Grad from Massive, Neil Emeigh from Rayobyte, Ovidiu Dragusin from Serversfactory, and Vlad Harmanescu from Pubconcierge sat down for a discussion on proxy servers, managed by Zyte’s Shane Evans. There was supposed to be one more participant – Tal Klinger from The Social Proxy – but he wasn’t able to attend.

The panelists touched upon many topics ranging from IP sourcing, effectiveness of different proxy types, and geolocation challenges to ethics and IP scoring. To my surprise, the latter received particular attention, as more and more clients are turning to services like IPQualityScore for evaluating proxy services. This can be a dangerous (and not always useful) practice, but it serves as an easy signal for IP quality.  

The panel had a good balance between providers focusing on residential and server-based proxies, highlighting their perspectives and challenges: for example, geolocation is a significant issue for ISP proxy vendors, less so for peer-to-peer networks. Considering that our website has the word proxy in it, this is a must.

extract summit 2024 panel 1
How do you call a group of proxy service providers? A pool, maybe?

Talk 5. Distributed Intelligence for Distributed Data

Matthew Bloomberg, co-founder of Charity Engine, spoke about the project and its future directions. We first encountered Charity Engine when testing Zyte’s now-defunct Crawlera tool several years ago; it then served as an IP network for the smart proxy management layer. 

Turns out, there’s more to the project than we thought. Charity Engine is a distributed computing platform – so, something like Folding @ Home. It’s able to mobilize not only network resources but also computing power and even full browsers from willing residential users. Matthew gave examples of how the network was used for academic purposes and shared upcoming updates, such as data processing layers on top of the basic API. 

My favorite idea was that Charity Engine doesn’t just extract knowledge from the web but also creates new knowledge in the process. By the way, the network is open to any business interested in its capabilities.

extract summit 2024 talk 5
Now this is sexy.

Panel Discussion: Navigating the Legal Landscape of Web Data Extraction

Sanaea Daruwalla from Zyte, Hope Skibitsky from Quinn Emanuel (the law firm that litigated the HiQ case), Stacey Brandenburg from Zwillgen, and Don D’Amico from Glacier Network discussed the legal topics relevant to web data extraction. There was a lot to talk about: the discussion lasted nearly an hour and nearly gave me carpal tunnel syndrome from all the notetaking. 

Without expanding too much on it, the current legal landscape is super volatile: we had the Bright Data lawsuits, and all the AI cases are buying lawyers their third seaside mansion. The panelists spoke about the applicability of different online agreements, collection of publicly available personal data, how to approach copyright in the context of AI, relevant regulations, and more. 

If you’re running a web scraping business or working with LLMs/Gen AI, you should definitely watch this.

extract summit 2024 panel 2
Sanaea did a great job moderating the discussion.

Talk 6. Advanced Techniques and Innovations for Extracting Specific Data Attributes from Diverse Sources

Iván Sánchez, senior data engineer at Zyte, described his company’s use of LLMs for data parsing. It complements and narrows down on Ian’s (Talk 4) high-level overview of Zyte’s AI capabilities. 

Iván first introduced the reasoning behind using LLMs at all. He then went on to address the major challenges that arise in implementing the models, such as optimizing token use and devising evaluation metrics. I’ve learned a lot: that it takes relatively few samples to train a model, that you can save money by only selecting relevant regions of a page, and that models become funky way below their maximum token limit. Recommended.

extract summit 2024 talk 6
Zyte's brilliant hack for reducing token consumption.

Talk 7. Cache, Cookies, Reconnects: Accelerate Scrapes with Session Management

Joel Griffith from Browserless, a company that runs hardened headless browsers so you wouldn’t need to, described the methods of session management. In particular, he covered caching, cookies, and browser processes, comparing the strengths and weaknesses of each.

It was a highly structured presentation that reminded me of university lectures. If you’re dealing with headless browsers in-house, you’ll learn when to use each method, backed by Joel’s personal experience and some rough implementation examples (which he elegantly called sketches). The process approach received the most attention in QA, and from me as well.

extract summit 2024 talk 7
Think you won't need to watch the talk now? There's more where that came from.

Talk 8. How to Feed Large Language Models (LLMs) with Data from the Web

Another web scraping company took the stage, this time Apify headed by Jan Čurn. If anything, the presentation was a product demo, but that doesn’t mean we got nothing to learn. 

Jan spoke a lot about retrieval-augmented generation – its basic mechanisms and importance as the killer LLM application. A bold claim, but one that’s hard to disagree with. He then blazed through some web scraping challenges, setting up the stage for the demo and introducing neat third-party utilities in the process. Finally, Jan showed Apify’s new actors that are made for RAG and include integrations with Pinecone, Langchain, and the like. 

extract summit 2024 talk 8
Jan has something rad for your RAG.

Talk 9. Enabling Large Language Models (LLMs) Agents to Understand the Web

One more web scraping company. Asim Shrestha, CEO of Reworkd AI, represents the new generation of data extraction tools that arose together with LLMs. From what I read in their Techcrunch interview, Reworkd’s aim is to capture the long tail of customer needs which competitors like Bright Data currently may not cover very well. 

In the talk, Asim described his company’s problem space. It includes finding the right interface to feed data to AI agents, crafting useful prompts, and evaluating the output with real websites. Through constant experimenting, Akim’s team has found unconventional solutions, such as rendering a webpage into a spatial 2D structure with labels for links and other elements. This, and another tool for running evaluations, has been open sourced for everyone to use.

Unfortunately, the audience was tired by this point and didn’t ask a single question. But that doesn’t reflect the quality of the talk – I found it stimulating. Knowing that Reworkd is backed by venture capital, we’re bound to see more innovation come from it.

extract summit 2024 talk 9
Reworkd's annotated spatial 2D mapping of a webpage.

Bottom Line

That was Zyte’s Web Data Extract Summit – the last web scraping-related conference of 2024. If any of the summaries tickled your fancy, the full recordings are available on YouTube. Thanks for reading!

Get proxy news and updates directly to your inbox.

The post Extract Summit 2024: A Recap appeared first on Proxyway.

]]>
https://proxyway.com/news/extract-summit-2024-recap/feed 0
OxyCon 2024: A Recap https://proxyway.com/news/oxycon-2024-recap https://proxyway.com/news/oxycon-2024-recap#respond Wed, 02 Oct 2024 11:18:02 +0000 https://proxyway.com/?post_type=news&p=26539 Our impressions from Oxylabs’ fifth annual web scraping conference.

The post OxyCon 2024: A Recap appeared first on Proxyway.

]]>

News

Our impressions from Oxylabs’ fifth annual web scraping conference. 

Adam Dubois
oxycon 2024 banner

The anniversary edition of OxyCon is behind us. If you didn’t have the chance to participate, or simply want to read our detailed summary, these are Proxyway’s impressions from the event. The presentations are available on demand, so you can always watch the ones that caught your eye. 

General Information about 2024’s OxyCon

Like last year (and, as far as I remember, most years before), OxyCon took place online. However, there was one big change: all presentations were delivered in real time. There was also a live audience in the background, I suppose primarily the company’s employees, who would sometimes react or cheer. 

Some of the presenters were obviously dealing with nerves, and hiccups did occur. But this setup made the conference more human and less like Apple’s eerily robotic keynotes. 

Otherwise, the logistics changed little compared to previous iterations: you registered (for free), received an invitation email, and logged in to Oxylabs’ platform with an embedded video player and a Slido widget for questions. Those who wanted to discuss more, or more in-depth, could visit the provider’s Discord server.

Introduction aside, 2024’s OxyCon featured six talks and a panel discussion to conclude it. All in all, the event proceeded smoothly and according to schedule. 

The Talks

These are 2024’s presentations and panel discussions. You can jump to the talks you’re interested in using the quick links below.

  1. Introduction: Web Scraping Trends
  2. Ensuring Scalability in Data Collection: Key Components, Challenges, and Advancements
  3. Human-Centered Approach to Streamlined Data Gathering
  4. Imitating Real User Behavior With Mouse Movements
  5. Harnessing Gen AI for Data-Driven Answers
  6. AI-Powered Public Web Data Collection at Scale
  7. Legal Compliance in the Age of AI
  8. Panel Discussion: Advanced Unblocking Strategies

Introduction: Web Scraping Trends

A fast one. Gabriele Montvile, CCO at Oxylabs, outlined three major trends impacting web data collection. They’re well-known for industry insiders, so there’s no harm in spoiling them for you: AI, ethics, and advancing anti-bots. The interesting part was the supporting material, which included survey data, AI use cases and challenges. Ten minutes well spent.

oxycon introductory talk
The three major trends in today’s web data collection.

Talk 1: Ensuring Scalability in Data Collection: Key Components, Challenges, and Advancement

Zydrunas Tamasauskas, another C-level face at Oxylabs, spoke about web scraping pipelines, implementation strategies of proxy servers, headless scraping, and beyond. The title doesn’t make it clear, but this presentation is primarily about proxies. You’ll learn how to choose the appropriate type and implement several load balancing approaches. Some takeaways: desktop residential IPs are the best, and managing sessions between proxies and headless browsers is its own circle of hell. 

All in all, a useful talk. We were mentioned as well, so of course you have to watch it now!

the first talk of oxycon
No particular reason why we chose this slide. Truly.

Talk 2: Human-Centered Approach to Streamlined Data Gathering

Vilius Visockas from CityNow, a Lithuanian real estate intelligence website, disclosed (let’s sensationalize this a little) how he’s able to scrape nearly a thousand local sources with a small team of 3-4 people. In a wonderful synergy of capitalism and engineering, Vilius chose the only reasonable approach: he built a pipeline management platform, implemented some fail safes, and hired code school grads to mine experience and earn some cash. 

Vilius talks about the challenges of keeping the system abuzz. Among other things, this involves maintaining and optimizing schemas, as well as ensuring satisfactory results from contributors with different backgrounds and generally little programming experience. But to me, the real beauty lies in the idea itself and the self-interested social value it provides.

the second talk of oxycon
It’s free affordable real estate.

Talk 3: Imitating Real User Behavior With Mouse Movements

A practical boots-on-the-ground presentation that, according to feedback, made at least several viewers’ days. Tadas Gedgaudas from Oxylabs shared his know-how on dealing with mouse-based detection methods. 

The presenter dedicated the first part to establishing whether websites actually track mouse movements. (Examples from the wild and his own personal weeks-long goose chase to unblock a website prove that they do.) He then showed how to verify this with the browser’s dev tools and went through the pros and cons of three major mouse algorithms: Bezier, Gaussian, and Perlin. Finally, Tadas introduced an open source library made by Oxylabs that can implement any algorithm with a few lines of code.  

My biggest gripe is that due to time constraints we were all left hanging: why use anything other than Perlin? But that was probably answered on Discord…

the third talk of oxycon
This Python library is actually open source.

Talk 4: Harnessing Gen AI for Data-Driven Answers

Brace yourselves – we’re entering the AI zone. When Paul Felby (Adthena) started by demonstrating a chatbot, my first thought was, “Oh no!..”. But it turned out there was more to the talk than met the eye: in particular, how to ensure accurate answers and not make LLMs implode when working with a database that ingests hundreds of millions of SERPs per day.

Paul had multiple tricks up his sleeve. Some involved getting the LLM to generate proper queries, either directly in SQL or by adding a semantic layer. Others dealt with creating a team of agents, each performing their own task – even QA. There were layers and layers of AI, which all somehow worked together. The result: a chatbot, but not quite what we’ve grown to expect. Everyone’s working on AI now, so I’m sure you’ll find something to take away from this.

the fourth talk of oxycon
The elaborate multi-agent backend behind Adthena’s chatbot.

Talk 5: AI-Powered Public Web Data Collection at Scale

The commercial break of the day. Aleksandras Sulzenko from Oxylabs laid out the web data acquisition pipeline, then proceeded to talk about the challenges of each step and how Oxylabs’ tools can make it hurt less. That would be pretty much it, but Aleksandras also made a product announcement: the web scraping API was getting AI functionality called Copilot (how original). 

Alright, so it’s something to work with. And the implementation really did prove fascinating: the feature generates API queries from natural language instructions. The real utility here is that Copilot can also create custom parsers with a modifiable schema and visual interface for fine-tuning. Many competitors use AI to directly interact with the page, so this approach is highly practical, albeit somewhat more manual and less resilient to changes.

In brief, watch the talk if you’re in the market for scrapers – or you’re trying to create a competing service of your own.

the fifth talk of oxycon
Oxylabs’ AI parser in the flesh (or rather dashboard).

Talk 6: Legal Compliance in the Age of AI

Nerijus Sveistys, senior legal counsel at Oxylabs, went through the risks, regulations, and relevant lawsuits pertaining to AI. It was more of an overview rather than directly applicable guidelines. Sorry, AI startup founder – you’ll still have to hire a lawyer. 

Not keeping track of the legal environment too closely, I learned that the EU already has a regulatory framework, China enacts laws for particular issues, and the US lacks a uniform approach for now. I also saw how many lawsuits are taking place, mostly over copyright issues. My favorite example was the bathroom-invading Roomba surveillance system. A solid talk overall.

the sixth talk of oxycon
Beware of bathroom-invading roombas.

Panel Discussion. Advanced Unblocking Strategies

The discussion included Hocine Amrane from Dataimpact, Paulius Gerve from Oxylabs, Jonny Smyth from Ceartas, Brecht Stamper from Lighthouse Intelligence, and Carl Erkof from Wiser Solutions. The host was Juras Jursenas, COO at Oxylabs. Quite a crowd.

It took over 40 minutes, so I’m not sure I’ll be able to recount everything. I suggest just go and watch the discussion – it’s worth it. Some of my notes to give you a taste:

  • One of the biggest concerns of the participants was the commercialization of anti-bot software. Specialist tools are more robust, and these companies have marketing people to proliferate.
  • We’re finally starting to see detection methods like Canvas fingerprinting put to use. There are more techniques waiting to be exploited, such as local storage.
  • Anti-bot research, and much of unblocking, is still done manually, and success relies on human error (which is surprisingly frequent). 
  • To be successful in this game, you need patience and willingness to butt your head against the wall until it gives.


Fascinating stuff.

the panelists of the discussion
The panelists on stage.

Conclusion

That was it for this year’s OxyCon. Did you find anything interesting? The videos are available on demand. And now, we’ll be waiting for another major industry event which is right around the corner – Zyte’s Extract Summit. 

Get proxy news and updates directly to your inbox.

The post OxyCon 2024: A Recap appeared first on Proxyway.

]]>
https://proxyway.com/news/oxycon-2024-recap/feed 0