OxyCon, one of the the largest virtual conferences on web scraping, has concluded. We were there to witness it all, and now we want to share our impressions with you.
This recap briefly describes the event’s eight talks. They cover a variety of relevant topics that range from large scale web scraping, latest legal developments, to the potential of video data extraction and, of course, AI.
The videos will be accessible on demand, so you can use our recap to determine whether, and which, of them to watch. Let’s begin!
General Information about This Year’s OxyCon
Like the year before, OxyCon took place online, with external speakers presenting remotely. You could access the video stream through a dedicated web page, after giving your contact information. Oxylabs sent a sign-in code by email, and you were in.
In total, there were six solo presentations, two panel discussions, and 15 speakers. The event went on without breaks, so it was quite a challenge to follow all the talks. At the end of each, viewers were able to ask questions via Slido, and there was also a Discord channel for discussion (as opposed to Slack the year before).
All in all, the conference proceeded smoothly, with no bigger interruptions or connection difficulties. With four years of experience, Oxylabs has a good grasp on how to run virtual events.
Here are this year’s OxyCon presentations and panel discussions. Feeling lucky? Click on a title to jump there:
- Overcoming Blocks in Large-Scale Web Scraping
- Cybercriminal Footprint Erasure: Response Strategies
- Leveraging Machine Learning for Web Scraping
- Open-Source Technology for Extracting High-Quality Data at Scale
- Web Scraping, AI, and Evolving Legal Landscapes
- Accelerating Data-on-Demand Services with Async Python and AWS.
- Unlocking Insights from Video Data: Challenges and Solutions
- Web Scraping in 2023 and Beyond
Denis Zyk from Oxylabs kicked off the conference with a high-level overview of what it takes to run a web scraping operation at scale. He went over the main challenges (such as dynamic content), block avoidance strategies, and ways to manage scalability. The latter part was the weakest, but covering it in detail would go way beyond the 30 minutes dedicated to the talk. Finally, Denis promoted Oxylabs’ Web Unblocker, which has solved the above issues and is available as a service.
Overall, while experienced web scraping professionals were unlikely to hear something novel, it was a solid summary of the status quo for everyone else.
Javier Velandia from Appgate, a zero-trust security service, introduced the challenges his company experiences when fighting cybercriminals and shared ways to overcome them. It’s always fascinating to learn how web scraping is used in specific niches, and this presentation was no exception.
Javier explained the tactics that cybercriminals use, such as implementing hidden redirects, typosquatting, or hiding malware behind URL shorteners. Some of the challenges hit close to home, like dealing with dynamic websites or IP blocks. Also, did you know that ChatGPT has an evil twin? Well, watch this presentation to learn about it and more.
Another Oxylaber Andrius Kuksta spoke about the roles that machine learning takes at his company and in the broader context of web scraping. It was both amusing – and slightly absurd – to learn how web scrapers and anti-bot companies employ the same tools for opposite goals. At Oxylabs, ML helps to automatically parse websites, avoid blocks, and manage proxies. Nothing wild, but it does sound useful.
At the end of his presentation, Andrius introduced multiple untapped avenues for machine learning. Maybe you can take them as inspiration for your own project?
Glen De Cauwsemaecker from OTA Insight returned for the second year. Previously, he recounted his company’s growth to 100 million requests per day. Now, Glen introduced some open source tools that he believed could be useful for web scraping.
In reality, much of the presentation revolved around the dilemma of headless versus non-headless. The presenter shared some useful browser automation resources, but it looked like he preferred the no-browser approach through reverse engineering and unconventional tools like distorting proxies.
Sometimes, the talk sounded like the musings of someone who’s been in the industry for too long. But we recommend watching this one, maybe even more so if you’re a professional, as it strays further away from the cookie-cutter path.
Four lawyers entering a (virtual) room sounds like the start of a good joke. But in this case, it led to an interesting discussion on the legal aspects of web scraping. The panel included Denas Grybauskas from Oxylabs, Alex Reese from Farella Braun + Martel, Kieran McCarthy from McCarthy Law Group, and Hope Skibitsky from Quinn Emanuel’s New York.
Hope first recounted the infamous case of HiQ v. LinkedIn case, which ended CFAA-based lawsuits and turned online contracts into the main battleground. The participants discussed when data can no longer be considered public, and how far Terms of Service actually reach (is scraping an exposed endpoint legal? Can you scrape Twitter if your executives have a social media account?).
Finally, they touched upon the copyright problem with AI models and the relevant cases to follow there. If you’re web scraping as a business, this one is a must-watch.
At Zyte’s conference one year ago, the Ukrainian Alexander Lebedev (Hotjar) spoke about mapping air alerts from a bomb shelter! We’re glad to see that things have more or less normalized for him since then.
This time, Alexander gave advice for how to create scalable data extraction services – more precisely, scraping tens of thousands of pages per minute on demand. The presentation covered the benefits of AWS Fargate, best options for web scraper architecture, efficient proxy use, optimal request batching, and more. Alex provided plenty of examples, making this one of the most practical talks in the conference. Strongly recommended.
Allen O’Neill, a long-standing OxyCon participant, whetted the viewer’s appetite with the opportunities of video data extraction. Due to a niche topic, the presentation has limited practical relevance today. But we still enjoyed watching it, if only for its aspirational value.
Allen spoke about China’s live shopping events, Gen-Z, and the $2.5T to be generated by video commerce by 2028. The gist – video is important! At the same time, it’s a hard nut to crack if you go beyond metadata: a mid-tier influencer can generate 162 million images & 63 days worth of audio for analysis. Allen’s team at SocialVoice made it work, and he shared some tips from their experience.
The second panel brought together four business heads – Ali Chaudhry (Veracious AI), Sash Sarangi (EMAlpha), Neil Emeigh (Rayobyte), and David Cohen (Datasembly) – to discuss the future trends of data extraction. It was hosted by Juras Jursenas, COO of Oxylabs.
The speakers explored multiple topics, mostly focusing on large language models and the increasing difficulty of scraping the web. We won’t cover all the details here; but, for example, Sash spoke about the value of LLMs in homogenizing data, while Neil disclosed Amazon’s recent shadowbans on datacenter IPs.
You’ll find value in this discussion if you want to keep apace with industry trends – in particular, if you’re looking for business ideas involving AI.
That was 2023’s OxyCon. If any talk caught your eye, go ahead and watch the video on the event’s webpage. All in all, we enjoyed the conference. And now, we’ll be waiting for the second major event of the year – Zyte’s Extract Summit!