Zyte Archives - Proxyway

Zyte Publishes 2026 Web Scraping Industry Report

Adam Dubois — Fri, 23 Jan 2026 11:40:30 +0000

Zyte

The report outlines six trends shaping web data collection.

Adam Dubois

Published: January 23, 2026

Zyte, the Ireland-based provider of web scraping tools and services, has published its 2026 Web Scraping Industry Report.

Zyte’s whitepaper describes six key trends shaping web data collection; it overviews their impact and provides actionable recommendations for staying apace with the striding industry.

To briefly summarize them:

The web scraping stack is being increasingly bundled into unified tools – APIs, making individual components like proxy servers less capable or rational to manage in-house.
AI now sits across the entire scraping lifecycle, whether as LLM-based data parsers, machine learning unblocking algorithms, or code generators.
End-to-end automation will become the default for web scraping pipelines: we’ll have an agent orchestrating specialized sub-agents, with humans designing rather than implementing the process.
Manual access strategies will become unsustainable at scale, giving way to machine learning on both sides of the equation: bots and bot detection tools.
Web traffic will divide into access lanes, establishing a hostile, negotiated, or invited relationship with websites. New standards and authentication protocols will grant preferential access to select entities.
Legal clarity is arriving with compliance demands in the shape of California’s Assembly Bill 2013, the EU AI Act, and other legislation. Enterprises are treating data provenance and compliance systems as increasingly important.

Frankly, if you’re building your own scrapers, the overarching message is concerning: AI productivity gains are being overshadowed by access restrictions and growing complexity.

At the same time, commercial tools have become better than ever. Coupled with more legal clarity, they may actually make web scraping easier for companies that have budgets to outsource their web scraping operations.

The report is free to access after entering your email address. We recommend reading it.

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte Publishes 2026 Web Scraping Industry Report appeared first on Proxyway.

Zyte Extract Summit 2025 (Dublin): A Recap

Adam Dubois — Mon, 17 Nov 2025 12:52:10 +0000

Zyte

Our virtual impressions from the second edition of Zyte’s annual web scraping conference.

Adam Dubois

Published: November 17, 2025

Six weeks after running Extract Summit in Austin (we cover it here), Zyte brought the web scraping conference to its home turf of Ireland.

It would be wrong to call the Dublin edition a reprise of the first event; for the most part, it had a new line-up and focused on different areas of web data collection. As such, both events complement each other.

As always, our aim is to give you a brief (and very human) summary of the conference’s talks. Zyte has made them available on demand, so you can do some opinionated window shopping before committing to the full videos.

Organizational Matters

Zyte didn’t change much in the way it organized the event, so there’s no need to waste metaphorical ink describing it. The conference spanned two days (the first being dedicated to workshops), and there was an option to watch everything online. You had a Slido form for questions, and that’s about it.

The execution wasn’t flawless: for the better part of the event, online viewers had to choose between mono audio on their headphones or cranking up the volume to the max on speakers. But other than that, Zyte’s organizing committee did a solid job.

Main Themes

AI? AI. It’s unavoidable, really. But if Austin was largely about LLM parsing and agent-assisted code generation, Dublin gave much more attention to the unblocking side of things. We had a panel discussion with none other than Antoine Vastel, and Kieron Spearing gave a structured drill-down into how websites construct their requests. We loved that.

The panel of lawyers focused on intellectual property in particular, which is the hot-button topic of the day. And, of course, team Zyte once again tried to sell their internal-project-made-public (VS Code extension), which is proper etiquette for the host in these kinds of events.

The last brief presentation, which for some reason escaped the official agenda, tried to straight out dissuade viewers from building their own scrapers, claiming that outsourcing was the more rational choice. Though structured as a personal story, the talk was nonetheless delivered by Zyte’s developer advocate, so it was hard to take it at face value.

The Talks

Here are the presentations delivered in Dublin. Hit those blue hyperlinks to jump straight to the talk you’re interested in:

How to Make AI Coding Work for Enterprise Web Scraping
Scraping a Synthetic Web: Dead Internet Theory Meets Web Data Extraction
(Panel discussion) Antiban Panel
AI and the Web: What 2025 Changed and What Comes Next
The Anatomy of a Request: Bypassing Protections and Scaling Data Extraction
(Panel discussion) The Future of Data Laws: AI, Web Data, and Intellectual Property
The New Era of AI Data Collection: A Deep Dive into Modern Web Scraping
IPv6-Powered Web Scraping: Design Patterns, Pitfalls & Practical Checklists

Talk 1. How to Make AI Coding Work for Enterprise Web Scraping

This is the one presentation that does repeat Austin. Zytans Ian Lennon (CPO) and John Rooney (Dev Engagement Manager) introduced their new VS Code agentic spider builder to the audience in Dublin.

John first gave a tech demo where he wrote a quick spider to scrape and structure some e-commerce pages. Ian then took over and addressed the bigger picture from the business point of view. The extension is free for now, so we recommend giving it a go to see if it works for you. We were told that it’s already saved a lot of dev resources in Zyte’s internal use.

As for the talk, we suggest watching the Austin version. John executed the demonstration live, which unfortunately resulted in the LLM executing itself mid-process. But even companies like Meta don’t always get live demos 100% right, so we respect John for his bravery.

Talk 2. Scraping a Synthetic Web: Dead Internet Theory Meets Web Data Extraction

If you thought the dead internet theory was fringe – or you didn’t know about it at all – this is the talk for you. Domagoj Maric, AI Customer Delivery Manager at Pontis Tech, described the many ways bots have infiltrated into our browsing lives, manipulating facts and impacting our decisions.

It’s a sprawling talk filled with examples, personal experiences, and even an overview of relevant legislation. Domagoj went as far as to build his own social media bot, proving how cheap and fast this process is. To spoil it a little, 10k comments cost just $2, and this is with current token prices.

While there was less to do with web scraping than the title led us to believe, this truly was a fascinating presentation that we recommend without reservations.

Panel 1. Antiban Panel

This is probably the only panel we’ve seen that brought bot makers and bot breakers to the same stage. It was hosted by Zyte’s CEO Shane Evans and comprised Antoine Vastel (Head of Research at Castle), Fabien Vauchelles (Scrapoxy), and Kenny Aires (Team Lead at Zyte). Antoine is a bit of a mythical figure in our niche, and he was able to participate because his current role doesn’t deal with web scraping that much.

The panel addressed a range of topics, such as how anti-bot companies distinguish between good and bad bots, or how the busy month of November impacts the data extraction and protection industries. However, it mostly dealt with change: in detection techniques, the role of proxies, and the cost of web scraping in general.

We learned a lot. One of the main findings for us was that proxies are becoming less important in the big picture, to the point where they’re now considered a weak signal. Even the consistency of a fingerprint is no longer the ultimate giveaway due to improving botting tools and edge cases from regular users.

Anti-bots face the constraint of retaining a good user experience, bots are constrained by scraping costs, and no one knows what exactly to do with AI agents yet. A great discussion overall.

Talk 3. AI and the Web: What 2025 Changed and What Comes Next

Zyte’s Senior Data Scientist Ivan Sanchez returned to talk about LLMs. Compared to Austin, this presentation gave a more high-level outlook; it overviewed the prevailing trends and allowed itself to speculate a little.

Ivan spent a lot of time talking about reasoning models. He believes that GPT-4o and beyond caused a revolution of sorts that not only improved answers but unlocked new capabilities. The paradigm shifted from guessing the next word to solving problems. Reasoning models become even more powerful when made into AI agents, which is where we currently stand.

The next part dealt with broader market movements, such as more foundational models (including Google’s turnaround with Gemini and Meta’s setbacks), China leading the open source, concerns about a potential bubble, and agents as the new consumers of web data. The presentation is worth watching, especially if you’re not well acquainted with the developments in AI.

Talk 4. The Anatomy of a Request: Bypassing Protections and Scaling Data Extraction

An ex Michelin-star chef, Kieron Spearing from CentricSoftware, now runs 5,000 scrapers that make 130M requests per day. It’s a pretty huge scale, if you ask us! Kieran shared his process for scaling web scraping operations and not going insane with maintenance. It was a practical and highly actionable talk.

According to the speaker, building resilient scrapers starts with the methodology. This requires experimenting with the request through cookies, headers, proxies, and other identifiers, until you’re left with the leanest working configuration.

As a chef, Kieron is a big proponent of preparation. If there’s one thing we took away it’s that every minute spent in investigation will save ten in implementation. But there was much more: for example, that the browser’s dev tools may not honor the original header order, or that going through a website’s API is always worth it, even if it requires much more upfront unblocking.

Panel 2. The Future of Data Laws: AI, Web Data, and Intellectual Property

The inimitable Sanaea Daruwalla, Zyte’s Chief Legal Officer, invited three more lawyers to talk about intellectual property in the age of AI. Its panelist Nikos Callum came from the F500 company Wesco, Dr Bernd Justin Jutte of University College Dublin represented academia, while Callum Henry works alongside Sanaea for Zyte.

The discussion revolved around relevant legislation and legal concepts. It explored the EU’s AI act with its concept of risk tiers. We found it baffling that the level of risk should be self-assessed, and that this doesn’t apply to personal AI use. According to the panelists, the EU’s opt out requirement may also cause challenges, as there’s no set format for this procedure.

We also had the chance to learn about US law, in particular its concept of fair use. Finally, the participants discussed some recent high-profile cases, namely the Anthropic book lawsuit and Getty vs Stability AI. It seems like so far judges have tended to favor AI companies in interpreting transformative use, but nothing has been set in stone yet.

The panel discussion ended on a funny note: when it comes to giving legal advice on web scraping, large language models are much more cautious than even lawyers! Go figure. All in all, this one is highly recommended.

Talk 5. The New Era of AI Data Collection: A Deep Dive into Modern Web Scraping

Fabien Vauchelles, the man behind Scrapoxy, brought his famed slides to talk about the race between bots and anti-bots. Together with his collection of monochrome ducks, Fabien covered the main developments in bot protection. Then, he demonstrated how to build a self-healing scraper.

Fabien’s anti-bot part talked about several threats. The network fingerprint, for one, is something that’s hard to create and easy to detect. The browser scene gave little relief, too, as our current champion Camoufox is open source and thus has been studied to death, and serious scraping requires expensive custom solutions. The presenter further identified new signals, such as the audio fingerprint. At least CAPTCHAs seem to be reaching a dead end for anti-bot tech.

In the second part, Fabien showed several ways to maintain scrapers with large language models. He wrote an MCP server that injects middleware into Scrapy scrapers. Upon failure, an LLM generates new code until the spider works again. All a human needs to do is verify the pull request.

Fabien’s conclusions weren’t very inspiring. In-house scraping is becoming too resource demanding for many new players; and at the same time, the internet is closing off. But hey: we’re still here, so it’s not all doom and gloom.

Talk 6. IPv6-Powered Web Scraping: Design Patterns, Pitfalls & Practical Checklists

Yuli Azarch, CEO of Rapidseedbox, explained why IPv6 proxies should be used in web scraping and how to do that effectively. The why part basically boiled down to IPv6 adoption and the costs associated with getting IPv4 IPs; the how part had few slides but made the meat of the presentation.

It turns out that websites don’t take IPv6 addresses as individual IPs – rather, they evaluate them in blocks of /48 (or septillion addresses). That’s why it’s best to have multiple /48 subnets or, in serious web scraping jobs, go as far as /29. Yuli found that setting up reverse DNS delegation also works to prevent blocks.

Frankly, we had such big expectations from this talk. Can you use IPv6 to scrape Google? Amazon? How many requests can you realistically make per a /48 subnet? What about IPv6-only residential proxy pools which are now emerging as a new product? Alas! But even if we ended up a little disappointed, we didn’t feel like our time was wasted watching the talk. 1.5x speed and skimming through the first half can give you a good bang for your buck.

Ending Remarks

Thanks to Zyte for organizing yet another great conference. If you’re human and managed to get this far down the page – you have our sincerest admiration and respect. Otherwise, please give us your best recipe of cupcakes in the comments!

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte Extract Summit 2025 (Dublin): A Recap appeared first on Proxyway.

Zyte Extract Summit 2025 (Austin): A Recap

Adam Dubois — Thu, 02 Oct 2025 05:16:31 +0000

Zyte

Our virtual impressions from the first edition of Zyte’s annual web scraping conference.

Adam Dubois

Published: October 2, 2025

Extract Summit is one of the two yearly events dedicated to web scraping, the other being OxyCon. For the first time, the conference spanned two continents: North America and Europe.

This recap covers the US part which took place in Austin at the end of September. Zyte has made the talks freely available on YouTube, so you can use this article to quickly learn about them before committing.

The Dublin edition is set for early November. We plan to cover it, as well.

Organizational Matters

After flip flopping between Dublin and Austin (in 2024, the venue was in Austin), Zyte decided to simply cover both locations. This spelled great news for audiences that invariably suffered due to time differences. Being located in Europe, we know this pain all too well.

The Austin edition concluded over two days. The first day had five technical workshops run by Zyte. Day 2, dubbed the Main Event, featured ten presentations. Virtual attendance was free, but it only included day two. Live tickets cost several hundred dollars for both days; the sum was meant to cover access to the workshops, the venue – and, of course, tacos.

Once again, being geographically challenged, we were unable to watch the talks live. But Zyte was gracious enough to give us access to the recordings shortly after. Live viewers had Vimeo for the stream and Slido right beside it to ask any questions that arose.

Curiously, there were no panel discussions this year – usually, organizers try to include at least one. And, maybe owing to time constraints, the presenters took very few questions after their talks, often just one or two.

The third thing we noticed was how many industry insiders there were. Aside from Zyte’s staff, we counted five web scraping infrastructure providers and only one company that offers a service based on the data they process (without even scraping it!).

Main Themes

Very much expectedly, the conference revolved around large language models. However, the topic didn’t feel overwhelming, as Zyte struck a good balance sprinkling in flavor presentations. By flavor, we mean case studies or know-how specific to the speaker’s line of business, such as Ovidius’ war stories from working in an IP sourcing company.

The talks didn’t single out data processing which, in addition to natural language input, is arguably AI’s main strength in our niche. We also learned about generating spiders through the use of LLMs and AI agents.

Little attention was given to unblocking – come to think of it, aside from Julien’s woes with scraping Google, the topic was omitted altogether. Maybe companies are less willing to share their secret sauce as the stakes grow, which is a broader trend we’ve noticed over the past year.

The overarching vibe (excuse our Gen-Z) was that many exciting things are coming along, but nothing’s been decided yet – and that there are plenty of opportunities to capitalize on. Pretty inspiring, if you ask us!

The Talks

These are the presentations delivered in Austin. Feel free to use the hyperlinks below if any catches your eye in particular:

How to Make AI Coding Work for Enterprise Web Scraping
Why AI Agents Struggle with Web Scraping (and How to Help Them)
The Technical Reality of Processing 10% of Google’s Global Search Volume
You Might Want to Reconsider Scraping with LLMs
Do You Really Need a Browser? Rethinking Web Scraping at Scale
Web Scraping as Social Practice: Balancing Ethics and Efficiency in a Data-Hungry World
Balancing Innovation and Regulation in Data Scraping
Building Blocks of a Web‑Scraping Business
99 Problems but a /24 Ain’t One (Except When It Is)
Data-Quality Framework for User-Submitted Financial Documents

Talk 1. How to Make AI Coding Work for Enterprise Web Scraping

A product demo from the get-go! Zyte brought two heavy hitters, Ian Lennon (CPO) and John Rooney (Dev Engagement Manager) on stage to showcase what the company has been cooking this year.

Without beating around the bush, it’s a VS Code extension called Web Scraping Copilot. The tool’s main purpose is to help developers build Scrapy spiders faster by writing objects, fixtures, and other code needed to scrape websites. It achieves this by coupling GitHub’s Copilot and Zyte’s MCP server.

The presentation had two parts. First, John fired up VS Code and promptly built a spider on stage, demonstrating how to fetch and structure several product pages. Ian then took over and gave a broader perspective from the business point of view.

The gist was that instead of making solutions, Zyte aims to create components to help engineers do web scraping well. This is all done with enterprise requirements in mind, in particular determinism, modularity, and ownership of code.

What’s interesting is that you don’t even need to buy Zyte’s API for the extension to work – it accepts any proxy or unblocking tool. The extension itself is free for now, but you may want to get a paid version of GitHub’s Copilot to avoid restrictions.

Talk 2. How to Make AI Coding Work for Enterprise Web Scraping

In the first presentation, Ian mentioned an autonomy scale where AI tools move from assistance towards agency as they progress. Zyte’s Senior Data Scientist Ivan Sanchez took this idea and fleshed it out in the context of AI agents for web scraping.

The first part covered various types of AI agents, drumming up hype with quotes about their adoption. Ian then took viewers back to reality: in their current shape, AI agents kind of suck for web scraping. He gave three slides with challenges and potential solutions before introducing Zyte’s attempt at overcoming the shortcomings.

Wait a minute, are we talking about Web Scraping Copilot all over again? As it turns out, yes. Ian shared more context about the origins of the tool (internal project) and its innards: Copilot relies on mini-agents and MCP sampling to achieve what insular agents can’t. In the end, he teased viewers with a testimonial that claimed to have cut spider building time from eight hours to just two. Impressive!

Talk 3. The Technical Reality of Processing 10% of Google’s Global Search Volume

In the third talk Julien Khaleghy, CEO of a major SERP API called… SerpApi, shared the trials and tribulations of scraping Google data in 2025. The takeaway is that despite spending ten times the resources, Google is now twice slower to scrape. Ouch.

What makes this search engine such a naughty target? Besides the infamous move to JavaScript dependency in February and the deprecation of more than 10 results per scrape, Julien’s team encounters: more CAPTCHAs, more diverse CAPTCHAs, more and sometimes permanent (!) IP bans, and JS challenges, among other things.

The presentation gives a fascinating opportunity to learn how a tech giant behaves when it starts taking web scrapers seriously. As a bonus, Julien throws in a performant open source Ruby parsing library – because we’re in this together.

Talk 4. You Might Want to Reconsider Scraping with LLMs

The fourth talk really subverted our expectations. Delivered by Jerome Choo, Director of Growth at Diffbot, it spoke about the performance of large language models in data extraction.

Why did we find the talk so subversive? Well, that’s because Diffbot has been an early adopter and major proponent of machine learning that’s not based on gen-AI. We expected Jerome to demolish LLMs, prying open their weaknesses for all to see. What we witnessed was actually an honest confirmation that AI is pretty darn good at putting data into structures.

Throughout the talk, Jerome walked us through multiple data transformation scenarios, such as extracting news signals about M&As or getting the required information from data processing agreements. The presenter compared various language models and gave useful tips which culminated in this nugget of wisdom: write schemas, not rules.

Talk 5. Do You Really Need a Browser? Rethinking Web Scraping at Scale

Another contrarian presentation – but this time, without a twist. Sarah McKenna from Sequentum, a serial presenter at Zyte’s events, challenged the prevailing tendency to run everything through a web browser.

Sarah’s response was mainly prompted by the rise of AI agents and their reliance on browsers. We have Perplexity’s Comet browser, as well as investments into cloud infrastructure like Browserbase and Browser-Use. However, hype is one thing, and reality is another. Sarah cited works revealing the limitations of LLMs and reminded everyone just how costly and brittle browser-based scraping is.

In-house, Sequentum behaves like any sane (read: bootstrapped) web scraper does: it fires up browsers only when forced to, otherwise extracting necessary identifiers and turning to a lightweight HTTP library. Sarah also spoke about Cloudflare’s gatekeeping efforts, battles over standards, and more, concluding that “the browser opportunity” is still wide open for grabs.

Unfortunately, the slides weren’t formatted properly. But it was still an interesting talk to follow.

Talk 6. Web Scraping as Social Practice: Balancing Ethics and Efficiency in a Data-Hungry World

Rodrigo Silva Ferreira, QA Engineer at Posit, gave a presentation about collecting data responsibly.

Rodrigo Silva isn’t a professional or even habitual web scraper, so his talk was naive at times and often sounded more like a school project. However, the speaker’s sincerity and description of his socially-oriented personal projects left us all the better for having watched it.

The most valuable takeaway for us was that scraping is never just technical, which we sometimes tend to forget. It can have a big impact not only for those doing the scraping, but also the destination, and the people or communities whose data we collect.

Talk 7. Balancing Innovation and Regulation in Data Scraping

Another serial speaker at extract summits, Zyte’s Chief Legal Officer Sanaea Daruwalla, brought viewers up to date with the latest legal developments in web scraping and artificial intelligence. Considering that all we do is scrape data and talk about AI, this one is a must.

To keep this sprawling and complex topic digestible, Sanaea took a brilliant concept of scales, putting innovation on one end and regulation on the other. She then tackled four pertinent topics: public web data, copyright in AI, and the use of personal data.

Compared to 2024, the scales tipped strongly toward innovation, but only when it came to scraping public data. The other cases are much less straightforward. Some of the takeaways were that you shouldn’t collect pirated content, and that the EU takes personal information very seriously.

Talk 8. Building Blocks of a Web‑Scraping Business

Victor Bolu is responsible for ensuring the profitability of his business, Webautomation, and he came on stage to talk about it. To be more precise, he brought a generalized plan for small web scraping businesses, together with ideas for bringing margins closer to a typical SaaS business.

Victor whipped charts and numbers; he broke down the costs of goods, spoke about LTVs, CACs, and other terms found in the books of business management. He gave two case studies, showing why more revenue may not result in profit.

Victor even concocted a three-step margin improvement strategy that revolved around cutting proxy costs, automating support, and pushing upsells with AI. Some of the advice was a little hand-wavy (such as building models that auto-adjust to bot changes), but the talk was delivered from a business and not a technical point of view. This one’s optional.

Talk 9. 99 Problems but a /24 Ain’t One (Except When It Is)

That’s one brain twister of a title. Ovidiu Dragusin from Servers Factory described the daily challenges of an IP broker – or, as he cheekily called them, war stories. We saw Ovidiu last year as part of a panel; however, he really shone having the stage all to himself.

Compared to some other proxy-oriented talks we’ve seen, this one wasn’t heavy on content. (In fact, we probably learned more during the brief QA session.) The speaker opted to share three anecdotes concerning SLAs, disappearing suppliers, and miscommunication with new IP sources. The overarching message was that chaos is the status quo, and that these crazy people wouldn’t have it any other way.

Ovidiu came to entertain and maybe make viewers emphatize with IP brokers. He succeeded.

Talk 10. Data-Quality Framework for User-Submitted Financial Documents

Egor Panlov from Truv closed the conference by delivering a talk about extracting information from financial documents. It’s interesting that his company doesn’t even scrape the web; regardless, data parsing is one of the major problem areas in our field.

Egor began by introducing income verification documents (like tax statements or pay stubs) and the challenges they bring. These are usually missing or inconsistent records and different document formats. He then walked us through the company’s verification system, showing how they normalize fields, validate data, and make sure that nothing is inaccurate or tempered with. We’re talking about people’s money, after all!

Large language models played a role here as well, naturally within strict guardrails. In fact, they’ve replaced OCR models for something like photos. Egor’s presentation actually received the most questions out of all, maybe due to fewer time constraints. However, we counted over 40 slides, many filled with tables and formulas; so, the talk was more suitable for watching on demand than live. We recommend doing so.

Bottom Line

That was the first edition of Zyte’s 2025 Web Data Extract Summit. If any of the summaries tickled your fancy, the full recordings are available on YouTube. Thanks for reading!

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte Extract Summit 2025 (Austin): A Recap appeared first on Proxyway.

Kill Your Product – Why Sacrificing Your Cash Cow Can Be the Path to Growth

Shane Evans — Wed, 23 Jul 2025 09:15:05 +0000

Zyte

An article by Shane Evans, CEO of Zyte.

Shane Evans

Published: July 23, 2025

In the tech industry these days, funerals for software are all too familiar and the graveyard of discontinued products is ever-growing. Whether it arises through company failure, M&A or market shifts, the decision to sunset software can evoke sadness, embarrassment, fear and resentment.

But killing your software can be a path to success. In fact, sunsetting your biggest product at its peak could be the move that unlocks a brighter future.

That’s what I did when I shocked my team by announcing we would deprecate the product accounting for 60% of our revenue. Here is what I learned, and why I think this bold move is sometimes necessary.

Piece-by-Piece Proliferation

All software is the story of laddering waves of new capability at the intersection of problem and opportunity.

My web scraping journey started in 2007 when I wrote Scrapy, a web scraping framework, to support extracting data from e-commerce websites. Within two years, my team used it to gather data from 4,000 websites.

However, additional challenges arose over time. When websites started blocking access, I wrote Smart Proxy Manager to route requests through a large list of IPs, manage them, and avoid getting blocked. Further capabilities were added as separate products, such as a residential proxy offering and the Smart Browser for large-scale browser rendering needs.

But the trouble with incrementalism is that one day, you wake up and realise your offering is really a smorgasbord of cumbersome point solutions.

Complexity Creeps Up

Servicing a suite of tools drains an increasing amount of time. As the task of modern web scraping grew in complexity, demanding several different approaches, we shipped products for each. But our stack became so complex that even our expert users lost time deciding on the optimal solution or responding to website changes.

Smart people, skilled at assembling pieces of a tech stack from disparate sources, won’t always complain to you about this sort of friction, because solving puzzles with competence is their job; technical challenges are business-as-usual.

Moreover, this proliferation of products was considered good practice at the time – every other vendor was rapidly adding more products, often with overlapping use cases.

However, when providing customers with a collection of isolated tools, many remain oblivious to the full range of options available. They often don’t realise when they have made a sub-optimal choice, and can fail to recognise possibilities beyond their immediate needs.

Rip It Up

The answer to our problem lay in combining our offerings in a single API that could address the whole web scraping stack, making optimal use of the infrastructure and avoiding the need for users to manage all that complexity.

But sometimes people become accustomed to the status quo. Product managers assumed the new API would be an add-on to our primary product, Smart Proxy Manager, because they could only perceive iteration through our existing product offering.

So, when I said, “Guys, we’re killing these products,” people were shocked. I announced that, in a couple of years, we wouldn’t be selling the standalone products anymore – instead, we would build a single, brand-new product, an all-in-one web scraping API, called Zyte API.

I don’t mind admitting, the team thought I’d gone crazy; people were unhappy. A year after the switch, however, we have seen a 15% increase in revenue from migrated users. Even though the new product is cheaper on average for the same workload, usage is up considerably as it can be used on a broader range of tasks.

Sunsetting Is Success

So, don’t mourn for deprecated software. Sunsetting a product can indicate a mature software category experiencing strong growth, momentum that has driven a creative explosion of diverse solutions, which now need to be rationalised.

A company’s willingness to kill a product shows that it is evolving fast enough to outpace its previous offerings, transforming standalone features into a larger, more ambitious vision.

You can expect to see a lot more software being sacrificed in the near future. AI is such a step change that it will prompt a fundamental rethink of many products, including in the web scraping field, where large language models will transform the ability to parse unstructured data.

A Funeral for Your Flagship

If you are coming around to the value of bidding goodbye to your main product, what are the main considerations?

1. Build Internal Buy-In

Your team may provide the most resistance. After all, staff are wedded to and care deeply about what’s on their plate right now.

Your job is to build confidence in your vision for the future. Build a coalition of internal support by showing how the medium and long-term benefits represent a bigger prize. I had to communicate the vision clearly, demonstrating how an integrated API solution would ultimately save us time, reduce costs, and improve the customer experience.

Unfortunately, you won’t always convince everybody, and you must still proceed despite some opposition.

2. Reallocate Resources Meaningfully

Stopping doing something frees up the resources to do something else. This is the fuel that gives your future room to grow. Embracing that opportunity means deciding to stop actively developing the outgoing product.

Had we not actively stopped new feature development on Smart Proxy Manager, staff would not have taken Zyte API seriously. This goes for sales as well as product teams – we had to stop selling our older product.

3. Take Users on the Journey

Explain the rationale behind the change and highlight the new product’s value proposition. You need to get customers to see the benefits on the other side of the hill. Although in our case, the new product was cheaper on average, customers will often have concerns and questions about pricing.

4. Build a Bridge to the Future

But it’s not just about communication. Technical customers’ anxiety about product deprecation is real and understandable because no one wants to be forced to write new code for something that works perfectly well. Offering backwards compatibility, as Zyte API did, can massively minimise disruption to users. A commitment to continuing to support critical enterprise customers will always go a long way to guaranteeing continuity.

Kill or Be Killed

Letting go of the past is the best way to embrace the future. Retiring a flagship product isn’t a sign of failure; it’s a commitment to innovation.

As we enter this new era of disruption, I wonder if companies will be willing to disrupt themselves before it’s too late.

Shane Evans

CEO of Zyte.

More by this author

The post Kill Your Product – Why Sacrificing Your Cash Cow Can Be the Path to Growth appeared first on Proxyway.

Extract Summit 2024: A Recap

Adam Dubois — Fri, 18 Oct 2024 08:50:56 +0000

Zyte

Our virtual impressions from Zyte’s annual web scraping event.

Adam Dubois

Published: October 18, 2024

Zyte’s Web Data Extract Summit has ended. The line-up this year was particularly strong, and we enjoyed watching the presentations. These are our impressions from the event.

Zyte has made the videos freely available on YouTube, so you can quickly get an idea of what they’re about before committing 30 or sometimes even 60 minutes of your time.

Organizational Matters

Like the last two years before it, Zyte’s conference was held physically. For the first time ever, the venue was in Austin, Texas. This spelled great news for Americans, but us Europeans could no longer comfortably watch it – the event took place after usual business hours. But I guess there’s no making everyone happy.

2024’s Extract Summit took place over two days. October 9 was dedicated to live workshops, and the presentations were delivered on October 10. Live tickets for both days cost $330. Virtual attendance was free, but it only included the second day’s talks.

Zyte used Eventbrite for ticket management and Airmeet as the streaming platform. The latter had all the bells and whistles like sections for comments, polls, and QA. I think you could also join virtual discussion tables in-between talks, but I didn’t get the chance to try out this option. The presenters would take questions from the live audience, as well as Airmeet, with Zyte’s CEO Shane Evans moderating.

The main event included nine talks and two panel discussions. Due to time differences, I was only able to watch the recordings. Still, I got the impression that everything proceeded more or less smoothly. After all, Zyte’s been doing this since 2019, so they’ve long become pros.

Main Themes

There was basically one theme explored through various lenses. Not hard to guess – it’s AI: machine learning, large language models, generative AI, all types and flavors. Again and again.

I don’t mean to sound negative; after all, AI has been pushing the envelope in web scraping, and it’s on the top of everyone’s minds while they’re trying to implement it and keep up, all at once. Zyte did a good job composing the line-up, and there were plenty of outside speakers to bring their perspectives.

Something that caught my attention was how many vendors of web scraping tools Zyte accepted to its event. Apify, Browserless, Reworkd can all be considered competitors, yet they were still invited to talk.

The Talks

These are 2024’s presentations. Feel free to jump to the ones that caught your eye using the quick links below.

Harnessing the Power of Large Language Models for Advanced Data Engineering and Data Science
Web Data Extraction Mastery: Real-World Implementations and ROI-Driven Success Stories
A Practical Demonstration of How to Responsibly Use Big Data to Train LLMs
How We Transformed Zyte’s Data Business with Cutting-Edge AI Technology
Panel Discussion: The Future of Proxy Technology: Trends and Innovations in Residential, Mobile & Datacenter Proxies
Distributed Intelligence for Distributed Data
Panel Discussion: Navigating the Legal Landscape of Web Data Extraction
Advanced Techniques and Innovations for Extracting Specific Data Attributes from Diverse Sources
Cache, Cookies, Reconnects: Accelerate Scrapes with Session Management
How to Feed Large Language Models (LLMs) with Data from the Web
Enabling Large Language Models (LLMs) Agents to Understand the Web

Talk 1. Harnessing the Power of Large Language Models for Advanced Data Engineering and Data Science

Neelabh Pant from Walmart spoke about his team’s use of LLMs for data cleaning. In an act of extreme generosity for the uninitiated, he decided to begin from the creation of the universe, introducing data processing and even LLMs. But it didn’t take long for things to pick up pace.

In brief, traditional rule-based methods require a lot of manual effort, can’t handle context and unstructured data well. Conversely, these are the areas where LLMs excel. After many experiments, Neelabh built a two-phase system that adds missing values (called improvement phase) and extracts facts from unstructured data (called feature enhancement phase). He provided the implementation details and compared four approaches based on price and effectiveness (spoiler: RAG + agents win).

If you’re in the field of data engineering and spend inordinate amounts of time on messy data, this is the talk for you.

Talk 2. Web Data Extraction Mastery: Real-World Implementations and ROI-Driven Success Stories

John Fraser’s company Parts ASAP scrapes the agricultural product data of several dozen competitors several times a week. He outsources the process to Zyte and, by timely implementing the extracted insights, ensures a healthy but by no means shocking 20% annual growth to the happy board. Sounds… a bit mundane, doesn’t it?

Well yes, but also no. John is what I described to myself as a nonchalant badass – one hand in the pocket, giving a no-nonsense story of how he found a practical use of web scraping to grow his business. It doesn’t push any envelopes or promise you the world. And yet, I enjoyed it a lot.

Talk 3. A Practical Demonstration of How to Responsibly Use Big Data to Train LLMs

Joachim Asare from Harvard University spoke about the ethical pitfalls looming in the LLM training process. These include leaking private information, introducing biases, and ingesting low-quality data, among others. The presenter explored the issues during different stages of training: data collection, fine-tuning, and deployment.

Joachim’s mantra throughout the talk was dump data, ‘dumb’ AI. He provided harrowing examples where a maltrained mental health AI model can advise people to kill themselves, or where Meta’s AR glasses were hacked with terrible privacy outcomes. I don’t dabble in LLM training, so the talk was harder to relate to, but it’s still very relevant for understanding how third-party AI can affect you as the user.

Talk 4. How We Transformed Zyte's Data Business with Cutting-Edge AI Technology

Ian Lennon from Zyte spoke about the problem of horizontal scaling – in particular, the company’s approach to providing high-quality (read: structured) data from hundreds of websites. According to Ian, it’s a combinatorial problem, and AI has allowed Zyte to slash setup costs and onboard customers they couldn’t before.

How exactly? First, by building supervised machine learning models that can parse various page categories. Then, by making them work without browser rendering. Zyte’s final iteration (at this point) allows users to customize the models, by either adding manual code or invoking privately-hosted LLMs.

Zyte’s also betting big on scraping templates that cover all major stages of web scraping: crawling, unblocking, and parsing. I remember the provider introducing its no-code product page template last year – turns out, e-commerce data makes up nearly 60% of Zyte’s business. More templates are coming soon.

Overall, it’s an interesting watch to learn about Zyte’s approach, even if it takes a more salesy angle.

Panel Discussion. The Future of Proxy Technology: Trends and Innovations in Residential, Mobile & Datacenter Proxies

Jason Grad from Massive, Neil Emeigh from Rayobyte, Ovidiu Dragusin from Serversfactory, and Vlad Harmanescu from Pubconcierge sat down for a discussion on proxy servers, managed by Zyte’s Shane Evans. There was supposed to be one more participant – Tal Klinger from The Social Proxy – but he wasn’t able to attend.

The panelists touched upon many topics ranging from IP sourcing, effectiveness of different proxy types, and geolocation challenges to ethics and IP scoring. To my surprise, the latter received particular attention, as more and more clients are turning to services like IPQualityScore for evaluating proxy services. This can be a dangerous (and not always useful) practice, but it serves as an easy signal for IP quality.

The panel had a good balance between providers focusing on residential and server-based proxies, highlighting their perspectives and challenges: for example, geolocation is a significant issue for ISP proxy vendors, less so for peer-to-peer networks. Considering that our website has the word proxy in it, this is a must.

Talk 5. Distributed Intelligence for Distributed Data

Matthew Bloomberg, co-founder of Charity Engine, spoke about the project and its future directions. We first encountered Charity Engine when testing Zyte’s now-defunct Crawlera tool several years ago; it then served as an IP network for the smart proxy management layer.

Turns out, there’s more to the project than we thought. Charity Engine is a distributed computing platform – so, something like Folding @ Home. It’s able to mobilize not only network resources but also computing power and even full browsers from willing residential users. Matthew gave examples of how the network was used for academic purposes and shared upcoming updates, such as data processing layers on top of the basic API.

My favorite idea was that Charity Engine doesn’t just extract knowledge from the web but also creates new knowledge in the process. By the way, the network is open to any business interested in its capabilities.

Panel Discussion: Navigating the Legal Landscape of Web Data Extraction

Sanaea Daruwalla from Zyte, Hope Skibitsky from Quinn Emanuel (the law firm that litigated the HiQ case), Stacey Brandenburg from Zwillgen, and Don D’Amico from Glacier Network discussed the legal topics relevant to web data extraction. There was a lot to talk about: the discussion lasted nearly an hour and nearly gave me carpal tunnel syndrome from all the notetaking.

Without expanding too much on it, the current legal landscape is super volatile: we had the Bright Data lawsuits, and all the AI cases are buying lawyers their third seaside mansion. The panelists spoke about the applicability of different online agreements, collection of publicly available personal data, how to approach copyright in the context of AI, relevant regulations, and more.

If you’re running a web scraping business or working with LLMs/Gen AI, you should definitely watch this.

Talk 6. Advanced Techniques and Innovations for Extracting Specific Data Attributes from Diverse Sources

Iván Sánchez, senior data engineer at Zyte, described his company’s use of LLMs for data parsing. It complements and narrows down on Ian’s (Talk 4) high-level overview of Zyte’s AI capabilities.

Iván first introduced the reasoning behind using LLMs at all. He then went on to address the major challenges that arise in implementing the models, such as optimizing token use and devising evaluation metrics. I’ve learned a lot: that it takes relatively few samples to train a model, that you can save money by only selecting relevant regions of a page, and that models become funky way below their maximum token limit. Recommended.

Talk 7. Cache, Cookies, Reconnects: Accelerate Scrapes with Session Management

Joel Griffith from Browserless, a company that runs hardened headless browsers so you wouldn’t need to, described the methods of session management. In particular, he covered caching, cookies, and browser processes, comparing the strengths and weaknesses of each.

It was a highly structured presentation that reminded me of university lectures. If you’re dealing with headless browsers in-house, you’ll learn when to use each method, backed by Joel’s personal experience and some rough implementation examples (which he elegantly called sketches). The process approach received the most attention in QA, and from me as well.

Talk 8. How to Feed Large Language Models (LLMs) with Data from the Web

Another web scraping company took the stage, this time Apify headed by Jan Čurn. If anything, the presentation was a product demo, but that doesn’t mean we got nothing to learn.

Jan spoke a lot about retrieval-augmented generation – its basic mechanisms and importance as the killer LLM application. A bold claim, but one that’s hard to disagree with. He then blazed through some web scraping challenges, setting up the stage for the demo and introducing neat third-party utilities in the process. Finally, Jan showed Apify’s new actors that are made for RAG and include integrations with Pinecone, Langchain, and the like.

Talk 9. Enabling Large Language Models (LLMs) Agents to Understand the Web

One more web scraping company. Asim Shrestha, CEO of Reworkd AI, represents the new generation of data extraction tools that arose together with LLMs. From what I read in their Techcrunch interview, Reworkd’s aim is to capture the long tail of customer needs which competitors like Bright Data currently may not cover very well.

In the talk, Asim described his company’s problem space. It includes finding the right interface to feed data to AI agents, crafting useful prompts, and evaluating the output with real websites. Through constant experimenting, Akim’s team has found unconventional solutions, such as rendering a webpage into a spatial 2D structure with labels for links and other elements. This, and another tool for running evaluations, has been open sourced for everyone to use.

Unfortunately, the audience was tired by this point and didn’t ask a single question. But that doesn’t reflect the quality of the talk – I found it stimulating. Knowing that Reworkd is backed by venture capital, we’re bound to see more innovation come from it.

Bottom Line

That was Zyte’s Web Data Extract Summit – the last web scraping-related conference of 2024. If any of the summaries tickled your fancy, the full recordings are available on YouTube. Thanks for reading!

Adam Dubois

Proxy geek and developer.

More by this author

The post Extract Summit 2024: A Recap appeared first on Proxyway.

Zyte Adds AI Scraping Functionality to Its API

Adam Dubois — Wed, 06 Mar 2024 11:20:08 +0000

Zyte

The tool can now crawl, unblock, and parse websites using AI and an optional no-code interface.

Adam Dubois

Published: March 6, 2024

Zyte, the Ireland-based web data extraction platform, has announced the addition of new AI-based functionality to Zyte API. With it, the tool has become a “complete solution for web scraping”, allowing developers to “build new spiders and add data sources in minutes”. The solution combines Zyte API and pre-made spider templates accessible through Scrapy Cloud on Zyte’s dashboard. It allows extracting data from websites without creating your own crawling logic, dealing with anti-bot systems, or specifying selectors for data extraction. The data extraction step relies on Zyte’s proprietary machine learning model that the company claims to be up to “50x cheaper and 56% more accurate” than large language models like ChatGPT. For now, the only available template covers e-commerce product pages. Using Scrapy Cloud’s no-code interface, you can quickly create a web scraper by specifying a few parameters: starting URL, geolocation, request count, crawling strategy, and browser rendering. The scraper then automatically crawls the website and returns structured data. Developers who need more functionality can make quick customizations on the dashboard or use the provided Python library. This way, it’s also possible to build completely new scrapers that make use of Zyte’s AI tech.

The company first introduced its e-commerce template and no-code interface during Extract Summit, Zyte’s annual web scraping conference (you can find our recap here). We’ve tried it several times for small tasks like crawling e-commerce pages and received satisfactory results.

All in all, Zyte’s AI Scraping is a fascinating project that tries to combine the interests of both business folks and seasoned Scrapy developers. For now, the experience is still clunky: it requires two subscriptions (to Zyte API and Scrapy Cloud), parses limited data types, and the user interface can be intimidating. In addition, the API’s pricing isn’t always easy to estimate, as it changes dynamically based on the target.

Having said that, the underlying tech is, without a doubt, solid, and we’re eager to see how the provider will streamline its product in the future.

You can try the AI e-commerce template for free by claiming Zyte API’s $5 credit and using Scrapy Cloud’s free tier.

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte Adds AI Scraping Functionality to Its API appeared first on Proxyway.

Extract Summit 2023: A Recap

Adam Dubois — Mon, 20 Nov 2023 13:18:00 +0000

Zyte

Our (virtual) impressions from Zyte’s annual web scraping event.

Adam Dubois

Published: November 20, 2023

Zyte’s Web Data Extraction Summit has concluded. It was a treat – and, in my opinion, a must for anyone in our industry, whether to network or learn. In this article, I’ll share our impressions and briefly recount the conference’s 13 talks. They’re available on demand after filling in a lead capture form, so you’ll be able to watch any presentation that catches your eye. Let’s get started!

Organizational Matters

Same as last year, 2023’s event had a physical venue. This time it was hosted in Dublin, Zyte’s hometown, but you could also join online. Aside from networking opportunities, live attendees had the chance to participate in four workshops that took place a day before the talks.

An early bird’s ticket cost €159 (plus an extra €80 for access to the workshops). Virtual attendance was actually free this year, which marked a welcome change for those who couldn’t (or wouldn’t) make it live. Unfortunately, that included us. Zyte used Eventbrite for all tickets, so those who found out about the conference once it had started were out of luck, as registration had closed.

Zyte streamed the event over YouTube, used Slido setup for asking questions, and had a Slack channel for less immediate discussions.

Main Themes

We knew that the conference would spend time on AI, but it really took the center stage this year. ChatGPT, other LLMs, and machine learning in general permeated nearly every talk, even the flavor presentations that usually introduce some niche use case. This is understandable given how hyped and compatible with our industry the new tech is.

Secondly, Zyte really made an effort to make the conference hands-on, or at least applicable in practice. And we’re not talking about the workshop. During the talks, participants were given access to Zyte’s new no-code interface, a POC tool that (of course) integrates ChatGPT, a form to calculate their web scraping expenses, and so on. Even the presentation on legal matters provided an actionable checklist for four relevant use cases (two of which, once again, involved AI).

Zyte’s web scraping API has matured a lot in a year. Knowing that, the host dedicated three presentations to promote the tool. This was definitely noticeable; but it didn’t feel pushy or tacky watching online.

The Talks

Let’s run through the conference’s 13 talks. They cover various topics from an exploration of ChatGPT for the purposes of data collection to a deep dive into current bot protection systems.

Here’s the line-up. Clicking on a title will take you there:

Talk 1. Introduction by Zyte’s CEO: Why I Replaced My Most Popular Product

Intriguing title, right? Zyte’s CEO Shane Evans kicked off the presentations with a walk down memory lane. He recounted the story of how Crawlera (Zyte’s proxy management layer) came to be, how it evolved, and why it had to give way to Zyte API. It was both a feature run through (look what we can do now!) and a deprecation notice for the old tool.

The talk was kind of promotional (it even included our benchmarks!) but still interesting, as it described the problem space. Overall, it was a good way to start the day.

Talk 2. Innovate or Die: The State of the Proxy Industry in 2023

Isaac Coleman, VP of marketing at Rayobyte, gave an obituary for its main product – datacenter proxies. I don’t know what public speaking classes they take, but Rayobyte’s speakers all have the mannerisms of American preachers. Truth be told, it was very fitting for the occasion.

Isaac spoke about the three shifts that decimated datacenter proxies for three major use cases, sometimes overnight. The market for this proxy type has reportedly shrunk and even the remaining major vertical is at risk. Scary, right? Isaac then broke down the cost of web scraping operations and provided a worksheet for calculating your costs. Handy? Yes. Worth watching? If you use proxies, then definitely yes.

Talk 3. Can ChatGPT Solve Web Data Extraction?

Konstantin Lopuhkin, head of data science at Zyte, tried to answer what many have on our minds: can I use ChatGPT for scraping, to what extent, and is it worth it? Konstantin went over the price considerations of using OpenAI’s APIs, different scraping approaches (generating code vs extracting directly with LLM), and compared commercial models with open source alternatives. Finally, he demoed an internal tool that is no longer available.

The presentation was born from experience, so it provided concrete numbers and reasoned arguments. It may not age very well due to how fast the tech evolves, but at this moment, I consider the talk super relevant. The audience’s questions were also interesting, as they touched upon the considerations many of us have with LLMs.

Talk 4. Enterprise-Grade Scraping with AI

Another talk by Zyte’s crew that builds upon the previous presentation, so the two may be worth watching back to back. In particular, Ian from Zyte talked about the problem of horizontal scaling (parsing many pages) and how AI can be used to address it.

Curiously, the hero of this story wasn’t LLMs: they were briefly mentioned and shoved away as an immature tech. Rather, it was Zyte’s own supervised ML model that the provider has been perfecting for over four years. It runs on every page, is reportedly more accurate, and up to 50 times cheaper than ChatGPT 3.5. Ian dove into the model’s innards, while his colleague Adrian demonstrated a no-code wrapper that crawled and parsed an e-commerce page.

Talk 5. Detect, Analyze & Respond. Harnessing Data to Combat Propaganda and Disinformation

Nesin Veli from Identrics gave a flavor presentation on the methods and prevention of cognitive warfare. Doesn’t ring a bell? The term defines techniques used to manipulate public perception for various ends.

Nesin introduced his company’s web scraping stack and showed how they trained an ML model to recognize hate speech in a news site’s dataset. But to us, the fascinating part was how Identrics applies the tools to combat cognitive warfare. The range of activity is very broad and includes things like narrative tracking across media channels and outlet credibility checks. Considering how prevalent and insidious information warfare has become, this was definitely educational.

Talk 6. Spidermatch: Harnessing Machine Learning and OpenStreetMap to Validate and Enrich Scraped Location Data

Jimbo Freedman’s company Huq Industries provides popularity, visiting time, and other data points related to geographic areas or objects. To do this, they first need to precisely map the points of interest. You wouldn’t think that’d be a challenge, but Jimbo proved us otherwise. Fun fact: some of Huq’s competitors still mark objects by physically visiting most of them!

In brief, the company’s problem space involves scraping thousands of relevant stores (via forked AllThePlaces spiders) and cross-referencing the information with OpenStreetMaps to validate accuracy. This raises multiple problems related to metadata and store co-ordinates. Jimbo described his four-step process and how the involvement of LLMs affects the output. To give you a hint: significantly, but hallucinations remain a problem.

Talk 7. Anatomy of Anti-Bot Protection

This one’s a treat. Fabien Vauchelles, the anti-detect expert at Wiremind, dissected the main bot protection methods. This would be interesting on its own, but Fabien’s French accent, enthusiastic delivery, and custom, presumably AI-made illustrations, really turned the talk into an experience.

Fabien went through the four web scraping layers – IP address, protocol, browser, and behavior – and how precisely anti-bot systems use them to identify bots. There are so many data points they can track… sometimes too many for their own good! The presenter pinpointed the main ones and listed eight steps for tackling increasingly difficult targets. Recommended.

Talk 8. Taming the World Wide Web

American accent, slides from the early ‘00s, and a vague title promising the world… This is what Eric Platow from LexisNexis greets you with. But that’s just the first impression. In reality, Eric walked the audience through a project he had to complete: 1) a million biographical records to scrape each month from thousands of websites, 2) a deadline of six months, and 3) minimal human resources.

The websites were related to lawyers, so they posed some peculiar challenges: old (like really old) page structures, repurposed or squatted domains, and irrelevant pages. Another big challenge was extracting and normalizing the right data; this required fuzzy matching, NLPs, and LLMs. In the end, Eric’s efforts managed to save $3.7M on manual labor by 400 people. Watch the talk to learn the specifics.

Talk 9. Soaring Highs and Deep Dives of Web Data Extraction in Finance

Alex Lokhov from Hatched Analytics delivered another flavor presentation on productizing alternative data for the finance vertical. It had two related but also somewhat separate parts.

The first part illustrated the relevance of alternative data and listed the requirements for productizing it. For example, we learned that financial services always require context and that datasets suffer from something called alpha decay. The second part was more technical, focusing on data storage and especially visual monitoring – the presenter’s strong suit. So, it’s possible to learn something even if you’re not particularly interested in this use case.

Talk 10. A Step-by-Step Guide to Assessing Your Web Scraping Compliance

Law time! The presentation was delivered by Sanaea Daruwalla, Chief Legal Officer at Zyte. We’ve seen Sanaea multiple times before; based on prior feedback, this year she chose a very specific theme. It covered four popular web scraping use cases, with a checklist of possible legal risks and their mitigation strategies.

To us, this combination really hit the mark – especially given that two of the situations dealt with AI models, a topic that’s extremely pertinent today. Sanaea gave actionable advice and outlined the upcoming regulations likely to affect web scraping operations, such as the EU’s AI Act. All in all, one of the must-watches of the conference.

Talk 11. Using Web Data to Visualise and Analyse EPC ratings

Another tech demo delivered by Neha Setia Nagpal and Daniel Cave. It demonstrated Zyte API’s no-code wrapper for e-commerce product pages, together with its flexible Scrapy cloud underpinnings.

Basically, Daniel played a data scientist who had a quick project in mind. He used the wrapper to quickly collect the energy efficiency ratings of home appliances and visualize them in Tableau. Neha took the role of an engineer. The scraper’s stock functionality didn’t fully meet Daniel’s needs, so she opened the hood and fixed this by adding a few parameters. Overall, it was an interesting but completely optional presentation.

Talk 12. Dynamic Crawling of Heavily Trafficked Complex Web Spaces at Scale

A talk by Andrew Harris from Zoominfo. Complex spaces are platforms that have many users interacting with them at once: social media, search engines, and similar. Andrew’s challenge was designing a low-code platform that users with complex needs could use simultaneously. The solution, and the main focus of this presentation, was a sophisticated scheduling system with weighted queueing and other elements.

Maybe because the talk came so late in the conference and we watched it in one sitting, it was a bit of a slog. The presenter used academic language, with slides that were full of information and didn’t necessarily match what was spoken. If you care about the topic and decide to watch the talk, be prepared to pause it multiple times.

Talk 13. The Future of Data: Web-Scraped Data Marketplaces and the Surge of Demand from the AI Revolution

The final talk of the day featured Andrea Squatrito from Data Boutique, a data marketplace. It consisted of two halves. The first tried to substantiate data marketplaces using arguments that apply to most platforms: mostly economies of scale and easier distribution. The second half was more interesting, as it addressed the challenges of trust and quality assurance. AI was only mentioned in passing.

Conclusion

That was 2023’s Extract Summit! If you found any of the talks interesting, you can watch them on the event’s website. And now, we’ll be waiting for the final conference of the year – Bright Data’s ScrapeCon (which unfortunately had to be delayed due to events in Israel).

Adam Dubois

Proxy geek and developer.

More by this author

The post Extract Summit 2023: A Recap appeared first on Proxyway.

Zyte Announces 2023’s Web Data Extraction Summit

Adam Dubois — Mon, 10 Jul 2023 00:00:00 +0000

Zyte

The annual conference will take place in Dublin on October 25-26.

Adam Dubois

Published: July 10, 2023

Zyte, the provider of web data extraction infrastructure and services, has announced 2023’s Web Data Extraction Summit, its annual conference on web scraping. Also called Extract Summit, this year’s conference will take place in Dublin on October 25 and 26. Its first day will be dedicated to developers with four workshops, a coding contest, and networking with peers. The second day will include 12 talks on web data, AI, the future of web scraping, as well as a keynote by Zyte’s founder Shane Evans. Overall, Extract Summit will explore three main topics:

AI in web scraping and how it’s impacting the industry,
Web scraping APIs, or the evolution and future of web scraping technology,
Scaling your data extraction with the latest advancements in technology, expertise, compliance, and industry benchmarks.

Registration is already open. Early bird seats to the live event cost €159.00, with an extra €80.00 if you wish to attend the developer workshop. Alternatively, it’s possible to watch the event online free of charge. Last year’s Extract Summit attracted over 230 attendees. We covered the talks on our blog. Zyte has also made their videos available on the event’s website. Web Data Extraction Summit is one of the two major conferences on web scraping. The second event, OxyCon, will take place online on September 13.

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte Announces 2023’s Web Data Extraction Summit appeared first on Proxyway.

Zyte Officially Launches Zyte API

Adam Dubois — Fri, 27 Jan 2023 00:00:00 +0000

Zyte

The new tool is envisioned to become a single solution for all web data extraction needs.

Adam Dubois

Published: January 27, 2023

Zyte, the provider of web scraping infrastructure and data extraction services, has launched Zyte API – a customizable tool for collecting data from any website.

Zyte’s new product falls under the category of web scraping APIs. They take care of proxy management, block avoidance, and headless browsers to extract web pages with a single API call. This greatly simplifies the process of data collection, freeing up developers to focus on the more productive tasks of data transformation and analysis.

The competition in this area is getting tough, but Zyte doesn’t come empty handed. Its API includes interesting features, such as the ability to automatically select the appropriate proxy type and location based on the URL. In addition, Zyte’s enterprise clients can use its TypeScript API to write custom page interactions in a cloud development environment.

Another worthy mention is Zyte’s pricing model. Instead of having a fixed rate, the API dynamically calculates request cost based on the website’s difficulty, use of a headless browser, and other factors. To make estimations easier, Zyte has built a visual dashboard tool where you can enter any URL and see how much it costs to scrape.

Zyte envisions its API as the ultimate data collection tool for developers. For now, its capabilities extend to scraping pages, but the roadmap promises data parsing, crawling, and cloud storage capabilities. Once they’re out, Zyte will be able to streamline its sprawling product line-up to the benefit of both the provider’s marketing team and the user.

We’ve had the opportunity to try out Zyte API first-hand. You can find our impressions and performance tests in a broader context here.

If you’d rather test the tool yourself, the provider offers a $5 credit. You can use it without commitment to determine whether Zyte API meets your needs.

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte Officially Launches Zyte API appeared first on Proxyway.

Zyte’s 2022 Extract Summit: A Recap

Adam Dubois — Thu, 06 Oct 2022 12:00:00 +0000

Zyte

Our (virtual) impressions from Zyte’s annual web scraping conference.

Adam Dubois

Published: October 6, 2022

Web Data Extraction Summit (or simply Extract Summit) is Zyte’s annual conference on data collection. This year, it took place in London, with an option to watch the whole thing online. The conference featured 12 talks which the organizers somehow managed to fit into one day. This article presents our impressions from the conference. Zyte has made its videos available on demand, so you can get a quick impression of what they’re about before watching the full recording.

Organizational Matters

We weren’t able to attend the event live, so we can only comment on its virtual aspect. We’d like to thank Zyte for graciously providing us with a free ticket.

Zyte ran ticket sales through EventBrite, which had a decently streamlined experience. The only confusing time was after completing the purchase flow, when it wasn’t clear what would happen next. Thankfully, Zyte sent an email shortly with a link to the streaming page.

The livestream itself was hosted on YouTube. You could easily open it on the platform, so I imagine those £25 tickets went a long way in some companies. There was also a Slido widget for asking questions.

The virtual event started with serious technical issues, making online viewers miss the first two presentations. But once the livestream was brought online, things went smoothly.

The Talks

Extract Summit had a whopping 12 talks in one day – the same number as OxyCon’s two days combined. They covered a variety of topics ranging from data collection trends and developments to the more practical questions of running and scaling web scraping infrastructure.

Here’s the full list. Clicking on a talk’s name will take you to its description:

Talk 1: State of the Web Data Industry in 2022

As it has become tradition, the conference kicked off with a presentation by Shane Evans, Zyte’s CEO. Shane briefly ran through the data collection trends he had identified since the last Extract Summit. The 12 minutes gave quite a few interesting (though sometimes predictable) insights.

To very briefly summarize them, companies have started treating data as an increasingly strategic priority. Some are even implementing organizational changes to build dedicated data teams. Furthermore, web scraping has become clearer from the legal standpoint, which helps with adoption. According to a study, spending on web data should grow at a mid-double digit rate year over year.

The presentation featured a breakdown of Zyte’s clients by industry and use case. Unsurprisingly, e-commerce took the first place, but Zyte also has quite a few customers in finance. Finally, Shane pondered on the build-buy dilemma, highlighting the challenges of modern data collection.

Talk 2: Practical machine learning to accelerate data intelligence

A talk by Peter Bray, CEO at Versionista. His service monitors website changes for clients in pharmaceutics and other industries. Peter gave a utilitarian demonstration how off-the-shelf machine learning tools can create value at scale and limited cost.

As a change monitoring service, Versionista needed a way to categorize content, so that it could provide insights into the changes. Peter showed how his team applied Google Vertex, ElasticSearch, and Named Entity Recognition to create models for various page and content types. He also gave advice for simplifying data labeling and overcoming cases that lack context for natural language processing.

We feel like this presentation goes very well with Allen O’Neill’s crash course on machine learning. Together, they introduce the machine learning techniques relevant for web scraping and give a concrete example of how companies apply them in real life. Consider watching both.

Talk 3: How to ensure high quality data while scaling from 100 to 100M requests/day

A walkthrough of how a company’s web scraping infrastructure evolved throughout 10 years in business. It was an interesting presentation that moved step by step, providing the reasoning behind each, together with increasingly complex diagrams of the company’s infrastructure. Worth a watch.

The speaker, Glenn de Cauwsemaecker from OTA Insight, gave the same talk in OxyCon, so you can read our more detailed impressions here.

Talk 4: Sneak peek at the new innovations at Zyte

The (understandably) longest talk, where Zyte’s CPO Iain Lennon and Head of Development Akshay Philar introduced the company’s new developments.

In brief, Zyte spent the year working on three issues:

Solving the site ban problem. According to Zyte, the tech is there, but the real pain is balancing out effectiveness and cost.
Creating a browser that’s designed for web data extraction, and
Solving scaling challenges to faster growth.

To address the first two, the company is releasing Zyte API. It automatically selects proxies, solves CAPTCHAs, and runs headless browsers if needed to ensure scraping success. In this, it strongly resembles Bright Data’s Web Unlocker or Oxylabs’ APIs. However, Zyte also brings some innovations to differentiate it from the competition:

Different targets will have dynamic pricing based on how much it costs to extract data from them. You’ll be able to see this information in the dashboard and play around with various parameters to reduce the expenses.
The API will allow you to make various page actions for JavaScript-dependent pages, such as scrolling or clicking on buttons.
Zyte will provide a cloud-based IDE for scripting browser actions.

For the third issue, Zyte is introducing crawling functionality exposed via the Zyte API. The company will maintain custom spiders for high-volume sites and use machine learning for the long tail. This functionality will require no contracts or minimum commitment, which should make it great for quick prototyping.

Zyte API is set for release on October 27, with web crawlers coming in early 2023.

Talk 5: How the data maturity model can help your business upscale

One more by Zyte. James Kehoe, a product manager, presented a data maturity model the company created after interviewing 40 industry representatives. The model aims to help businesses identify where they stand in their data collection operation and what they can expect going forward. It was a business-oriented talk with very fast delivery.

James outlined a grid: its columns listed the five sequential steps of a data collection operation and the rows their maturity level. James gradually went through each step, explaining what it looks like at different maturity levels. He then showed where the interviewees placed themselves on the grid.

In a nutshell, the model looked pretty useful, even if a little theoretical. The talk should translate well into a blog post if Zyte ever decides to publish one.

Talk 6: Architecting a scalable web scraping project

Another presentation from Zyte given by developer advocate Neha Setia Nagpal. It resembled the previous talk in that the presenter introduced a framework. But where the data maturity model was meant more for evaluating a data scraping operation, this one aims to help design one.

Neha outlined eight steps (and also best practices) that should help developers architect a scalable solution:

Clarify the goal.
Analyze the website.
Prioritize the project attributes like scalability and extensibility.
Highlight the constraints.
Design the crawl.
Ensure data quality.
Choose the tech stack.
Brace for impact.

Overall, it’s a handy list of things to consider, especially if you’re moving from ad-hoc scraping to a sustained operation or preparing to collect data in a company setting.

Talk 7: Ethics and compliance in web data extraction

Zyte’s head legal counsel Sanaea Daruwalla gave a speech on the intersections between legality and ethics in web scraping. In other words, she showed how running a business in an ethical manner protects you from violations of most relevant laws. Sanaea’s talk covered multiple important areas, such as the use of personal data, applicability of website terms of use, copyright law, sourcing and use of residential proxies, and more. For example, did you know that the GDPR covers public personal data as well, or while company details aren’t considered personal data, there are exceptions? Earlier on, Sanaea participated in OxyCon’s legal panel discussion. But here she had the whole stage for herself, and she’s a good speaker. It’s a presentation we definitely recommend watching. Add in OxyCon’s talks and the FISD guidelines (recommendations by industry experts in the legal and alternative financial data space that Saneae recommended), and you’ll have a pretty good understanding of the topic.

Talk 8: Crawling like a search engine

A talk by Guillaume Pitel, CTO of Babbar. His company crawls over 1 billion pages daily to help SEO marketers with their backlinking efforts. Similarly to Glenn’s presentation, this one also recounted Babbar’s journey to achieve its current scale. It should be interesting to companies doing heavy web crawling; maybe less so to regular web scrapers.

According to Guillaume, much of the web is crap. So, he and the team had to figure out how to continuously crawl only the most interesting parts of the web, compute Page-Rank like metrics on a graph, and then analyze the context to create a semantically-oriented index.

A good part of the talk covered technical implementation. It described how Guillaume’s team adapted the BUbiNG crawler for their needs, managed the WWW graph, and architectured the whole system to successfully handle billions of URLs per day. Curiously, the infrastructure runs on Java, uses just 16 IPs, and doesn’t process dynamic content for now.

Talk 9: Challenges in extracting web data for academic research

A refreshing perspective that we rarely see in similar conferences. Dr. Hannes Datta from Tilburg university spoke about the whys and hows of data collection in a university setting. It’s worth watching to expand your horizons and understand how different the constraints can be for others.

It turns out that web data is getting increasingly popular in fields like marketing research. In 2020, it played a part in 15% of all studies there. Scientists use this data to investigate new phenomena, improve methodology, and for various other tasks. Hannes himself studied the impact of Spotify’s playlist algorithms when the platform was still emerging.

Dr. Datta (a satisfyingly fitting name) brought forth some of the peculiar challenges that academics experience. For example, they care deeply about the validity of data, which can be impacted by website changes or even personalization algorithms when accessing content from a residential IP. Scientists also have to worry about legal and ethical questions. All in all, there are many considerations to take into account, and Hannes described a good deal of them.

Talk 10: Data mining from a bomb shelter in Ukraine

A touching presentation by Zyte’s Ukrainian software engineer Alexander Lebedev. He got stranded in the country when the war started. Being an engineer he is, Alex decided to use a data-driven approach to organize his sleep and other activities with the fewest interruptions.

In essence, Alex wrote a Telegram scraper that collected air alerts from two channels. He then mapped the frequency of those air alerts during different times of day on a graph. Once a siren gets sounded, people rush to bomb shelters, so Alex reasoned that finding out the patterns could help his family organize their life around them.

Alex was in a relatively calm region, so we’re not sure how useful this data actually was in his case. But the project definitely helped him keep himself occupied and gain control over the uncertainty. Alex also managed to extract some broader insights about the bombing frequency of different regions and their change over time.

It was definitely one of the more unique demonstrations of the uses of web scraping.

Talk 11: How to source proxy IPs for data scraping

A talk by Neil Emeigh, CEO of Rayobyte. Recently rebranded from Blazing SEO, Rayobyte controls hundreds of thousands of datacenter proxies. Neil shared insights into how his company sourced those IPs and what customers should look out for. The delivery was over the top at times (throwing money at the audience? Come on.), but it made the talk entertaining to watch.

Neil spoke about renting versus buying addresses, the importance of IP diversity and ASN quality. He gave some strategies for working with datacenter proxies (the control vs diversity option), together with interesting tidbits of knowledge. Did you know that Google can ban a subnet for as little as 200 requests per hour? That you should never get AFRINIC IPs? Or that IPv6 proxies suck (for now)? Well, there are good reasons why.

To top it all off, Neil told a story of how the FBI came to his house and started interrogating him about proxies. Turns out, the IP address industry is pretty controversial, especially if you’re from Africa. But you’ll learn more by watching the talk.

Talk 12: The future of no-code web scraping

The final speaker, Victor Bolu, runs a no-code data collection tool called Web Automation. He overviewed the types, potential, and limitations of no- and low-code web scraping tools, all the while trying to persuade an audience of web scraping professionals that they’re the future.

That didn’t exactly work out: in a poll, something like 85% voted that no-code tools won’t replace code-based web scrapers anytime soon. But then again, maybe it wasn’t the right question to ask. Victor himself spent a lot of time speaking about expanding the market, which no-code has a fair chance to achieve.

Whatever your stance is, the talk provided plenty of material for understanding the landscape and selling points of no-code data collection, especially if you’re thinking of introducing a similar tool of your own.

Conclusion

Despite the initial technical issues, we believe that Zyte held a successful conference that was well worthy of the entry fee. We’ll be waiting for the next Extract Summit with excitement – maybe even attend it live? See you there!

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte’s 2022 Extract Summit: A Recap appeared first on Proxyway.

Zyte to Hold Its Annual Web Scraping Conference

Adam Dubois — Wed, 24 Aug 2022 00:00:00 +0000

Zyte

The fourth Web Data Extraction Summit will take place on September 29 in London.

Adam Dubois

Published: August 24, 2022

The data collection specialists at Zyte have announced Web Data Extraction Summit – an annual event bringing together web scraping professionals and data lovers. The one-day conference is set to take place on September 29. Unlike the previous online-only event, it will once again have a physical venue – London’s County Hall. Those unable to attend live will be able to watch the event online. This year’s Web Data Extraction Summit will have 13 talks covering four major topics:

Scaling web scraping, which will focus on overcoming the challenges involved in scaling data collection operations.
Ethical web data extraction, which will talk about web scraping best practices and ethical use cases.
Innovation in web scraping, which will share the innovative ways companies use web data.
The future of the web data industry, which will discuss how web scraping will change moving forward.

The list of speakers includes Zyte’s CEO Shane Evans, Rayobyte’s CEO Neil Emeigh, Professor Hannes Data from Tilburg university, and more. This year, attendance is paid. A ticket to the live event costs £200, while an online seat is £25. You can find more information about the agenda and buy tickets on Web Data Extraction Summit’s website.

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte to Hold Its Annual Web Scraping Conference appeared first on Proxyway.

Zyte to Hold a Virtual Web Scraping Conference

Adam Dubois — Wed, 04 Aug 2021 12:00:00 +0000

Zyte

Web Data Extraction Summit, a one-day event for data extraction professionals, will take place on September 30.

Adam Dubois

Published: August 4, 2021

The data extraction experts at Zyte are organizing Web Data Extraction Summit – an annual event bringing together web scraping professionals and data lovers. Held on a new virtual event platform, Extract Summit promises a day “jam-packed with talks and workshops” that will include everything from web scraping tools and techniques to the legal aspects of data collection. This year, the provisional line-up includes 20 speakers from Zyte and other data driven companies. They will cover topics like data quality, headless browsers, alternative data in finance, legal matters, and more. Attendees will also be able to take part in a range of panels and workshops where they’ll have the chance to talk with data collection professionals and do some hands-on coding. This will be Web Data Extraction Summit’s third year in the running. Last year’s event had over 3,000 people sign up, making it one of the largest conferences on web data extraction. You can register now free of charge on https://www.extractsummit.io/.

Adam Dubois

Proxy geek and developer.

More by this author

The post Zyte to Hold a Virtual Web Scraping Conference appeared first on Proxyway.