Guides - Proxyway https://proxyway.com/guides Your Trusted Guide to All Things Proxy Tue, 30 Sep 2025 10:58:59 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://proxyway.com/wp-content/uploads/2023/04/favicon-150x150.png Guides - Proxyway https://proxyway.com/guides 32 32 A Short History of Ticketing Proxies https://proxyway.com/guides/ticketing-proxies-history https://proxyway.com/guides/ticketing-proxies-history#respond Tue, 30 Sep 2025 10:55:02 +0000 https://proxyway.com/?post_type=guides&p=38228 Ticketing proxies are used by ticket scalpers to buy tickets en masse. They have always been an important part of their business.

The post A Short History of Ticketing Proxies appeared first on Proxyway.

]]>

Guides

For most of history, ticket proxy would have been a guy you asked to wait outside the cinema/theater/stadium to get tickets for the hottest events on the day they started selling. Today, he has been replaced by a computer. But how did this state of business come to be? 

The Rise of Online Ticketing

While the exact origins of paperless tickets are debated, Ticketmaster is definitely one of the most influential companies in the field. Even before the ‘90s, it was working on primitive versions of online tickets. Namely, the company put machines in physical stores where people could buy and print tickets instead of going to the location itself. This made use of the growing network infrastructure while also working around the issue of customers not having access to computers, the internet, or printers at home. 

But by the mid ‘90s, home computing was growing large enough and the internet accessible enough to facilitate buying tickets entirely online. Ticketmaster launched ticketmaster.com in 1995, while tickets.com was founded around the same time. The people of 1996 didn’t yet have cellphones, let alone smartphones, but the basis was there. 

By the time the millennium rolled around and the world failed to end, the adoption of online ticket sales was spreading rapidly, as this enthusiastic 2001 PR piece on theater ticket sales in the UK attests. Meanwhile, Ticketmaster was buying competitors left and right, diversifying its offerings. For example, the TicketWeb acquisition was meant to expand its reach to New York clubs and the San Diego Zoo

So, there was obviously money in selling tickets. But it’s equally true that there was always money in reselling tickets, especially at a markup, especially for hotly desired events…

The Rise of Online Ticket Scalping

With online ticket sales taking off, secondary businesses started springing up to feed on that. If you had tickets you wanted to sell, you needed a platform for that, and StubHub – launched in 2000 – was built on that premise. Of course, not everyone wanted to use an official platform, and thus ticket resellers found business on Craigslist and Facebook (once those became a thing, anyway). When eBay purchased StubHub, Ticketmaster countered by buying the competing TicketsNow to claim a piece of the ticket resale pie. 

Wiseguy Tickets in the indictment.

At the same time, online ticket scalping was not far behind official online ticket sales. One of the earliest “successes” was Wiseguy Tickets (also working as Seats of San Francisco) which manipulated fan club memberships to buy up tickets for U2’s Vertigo tour in 2005, earning $2.5 million of profit in the process. When the law finally came down on them in probably the most famous case targeting ticket scalpers, the prosecution alleged that Wiseguys made $25 million during their 2002-2009 streak. 

The group used a variety of methods to overcome security measures put in place by ticket vendors – including beating multiple generations of CAPTCHA – all in order to help their employees and bots secure the most lucrative tickets. 

Bots were one of the key components of their illicit success. These automated systems could spot sales and reserve tickets faster than any human could, and could be scaled without compromising the security or the bottom line. After all, bots don’t talk or earn wages. But the bot tactics and innovations are well-documented elsewhere – what’s most important for us is what they did with proxies

Wiseguys Use Of Ticketing Proxies

Your ticketing bot may be smart, but it can’t do anything if Ticketmaster has banned its IP. To get around the issue, Wiseguys created their own network of proxies. 

According to the indictment, Wiseguys started building out covert IP infrastructure around 2007, using shell companies Smaug and Platinum Technologies. Wiseguys registered 100,000 IPs to impersonate legitimate customers. Furthermore, they aimed to rent non-consecutive IPs to hide the synthetic nature of their network. 

Wiseguys rented out these IPs from companies providing colocation services by claiming the addresses were for testing internet protocol services or brokering hotel room bookings. As such, they effectively built an infrastructure of what we call datacenter proxies

A product screenshot from a defunct website selling ticketing bots.

The first line of proxies was meant for Watchers, bots programmed to monitor ticket vendors for new events. To operate the Watchers, Wiseguys leased Amazon servers. The moment a new sale was spotted, the server lease was terminated. This hid the connection between Watchers (that were constantly refreshing the website to spot ticket sales) and the actual ticketing bots that would attack in the next wave. 

Granted, that whole infrastructure didn’t spring up at once, and neither did the technical adaptations. Moreover, 100,000 IPs wasn’t the end goal, as email correspondence showed Wiseguys’ intent to acquire up to 500,000 addresses. 

Others would have reasons to follow in their footsteps. While it’s hard to evaluate the size of the scalping market, some estimates put the ticket resale market in the US in the early 2010s to be worth around $4 billion.

The Technical Adaptations of Ticketing Proxies

It’s difficult to pinpoint when the proxy seller industry as we know it emerged – only that it definitely started with datacenter proxies. Providers merge and rebrand, so research involves turning to internet archives and hunting for snapshots of websites.

Wiseguys made do without any proxy provider – for them, it all started with sourcing datacenter proxies from colocation services. But those IPs are fairly simple to detect: either by getting data from IP geolocation services or just seeing many similar IPs connect at once. They’re also then easy to block, as the ticket seller doesn’t risk blocking actual customers. After all, people don’t live in datacenters and, as such, don’t get datacenter IPs. This made it clear scalpers needed something harder to detect – which led to the rise of residential proxies.

Residential proxies were the natural next step: hosted by real users, their IPs were identified as coming from residential areas. They would be harder to block, too – you may be blocking a paying customer! 

According to the scarce historical data, Luminati – that’s Bright Data before the rebrand – was marketing itself as a peer-to-peer VPN provider up till the end of 2015. From 2016, it started positioning as a proxy network with residential IPs. And if we go over Oxylabs’ archives, residential proxies as a specific product appeared in May 2018.

The ol' Luminati frontpage on archive.today.

There was also (and probably still is) a shadier undercurrent of residential proxy vendors. 911 S5 was a massive supplier of proxies that started operations in 2014 before it got shut down by the FBI in 2020. It used six free VPNs to turn 19 million devices into residential proxies and reap around $100 million in profits. The existence of malicious actors like these certainly siphoned off some of the demand. 

It’s unclear when the untraceability of residential proxies became a large enough selling point for them to be legally marketed as a specialized product for ticket scalping. But we do know that the sneaker scalping craze was taking off in 2018, spurring a niche market that was looking for alternatives to datacenter proxies.

While sneakers weren’t directly tied to ticketing, the two markets developed in parallel and pushed proxy suppliers to adapt. For a long time, both sneaker releases and ticket sales worked on the first-come-first-serve-principle. As such, scalpers needed more speed, and bots could only work as fast as the internet connections allowed. This is where proxy suppliers had to adapt –  speed was essential, and ISP proxies offered it.

ISP proxies combined the speed and reliability of the datacenter proxies with the untraceability of the residential ones. But this solution didn’t work forever. Eventually, sneaker sales moved to a raffle system (as for tickets, various artists had tried doing that even in the Wiseguys days) and speed lost prominence as a selling point. Still, ISP proxies remain a staple of proxy suppliers to this day.

An example of a primitive CAPTCHA from an ancient paper on CAPTCHAs.

For all the evolutions of the proxies, bots are still the most important part of the technological arms race. CAPTCHAs never stopped changing; security measures to detect bot-like behavior demanded new types of bots that would act sufficiently human-like, and so on. There are far more vectors for bot detection and obfuscation than there are for proxies.

But the fight doesn’t end there: tackling scalpers solely via technological effort would mean playing catch up with a decentralized group of heavily financially-incentivized and inventive people. That’s why ticket scalping has long been combated on another front: the law.

The Legal Backlash to Ticketing

Web data scraping had been around for almost as long as the internet was, but it was rapidly increasing in prominence around the same time as sneaker copping. This was great news for proxy providers, as they could increasingly diversify their markets. Meanwhile, the legislators were somewhat catching up with the idea that automated ticket scalping is potentially harmful to consumers. 

For example, in 2016, the US passed the Better Online Ticket Sales (BOTS) Act. In the Federal Trade Commission’s own words, “the law outlaws the use of computer software like bots that game the ticket system.” More than that, it also outlawed the sale of tickets that were knowingly obtained via such methods. 

FTC's sassy introduction into the explanation of the BOTS Act.

Other countries have also been working on similar legislation. The UK passed a law in 2018 that would put potentially unlimited fines on “ticket touts” (that’s how the scalpers are called in the UK) for using bots. The Canadian province of Ontario implemented a similar rule in 2017. In Taiwan, both scalping and using proxies to get tickets are against the law. 

The effectiveness of the BOTS Act was, however, dubious. There was one case in 2021 when three ticket brokers were sentenced to pay $3.7 million in damages (as they were determined to be unable to pay the full $31 million sum set earlier). It was the most serious case brought before the public. This in part prompted President Trump to issue an executive order on March 31st, 2025 to make the FTC more rigorously enforce the BOTS Act. 

The enforcement of such laws remains fairly weak outside the US as well, especially when the act of scalping itself often remains unregulated and thus very lucrative. For example, the parts of the Ontario law targeting specifically scalping were rolled back in 2019 after a change of government. Taiwanese officials are currently considering making ticket purchases tied to your real name as a way to impede scalpers.

The serious implementation of such laws is also impeded by scalpers’ (alleged) secret ally: ticketing agencies themselves. You may remember that Ticketmaster had purchased a ticket reselling company to get a cut of both sales and resales. However, recent Ticketmaster and Live Nation lawsuits by the US Department of Justice and then the FTC claim that the companies knowingly allow scalpers to purchase tickets beyond their set limits (via multiple accounts) – among other shady practices. 

FTC bringing down the heat on Ticketmaster.

Between weak and uneven enforcement of anti-scalping and ticketing laws and actors in the anti-scalper space, this leaves space for scalpers to survive and thrive. The profits of servicing this market don’t seem to be large enough for large companies to risk it, but the risk-reward calculation seems good enough for smaller businesses. This is also reflected in how the marketing treats this use case in the current day.

The Life of Modern Ticket Proxies

Today, there aren’t proxy providers that would market themselves as selling exclusively ticketing proxies – at least not publicly. Besides, this would be somewhat limiting when you consider the many use cases proxies have today. However, the providers’ overall attitudes towards this specific niche are varied.

As of September 2025, some prominent proxy providers either directly marketed proxies for ticketing or at least endorsed the use of their product for reselling:

Other major proxy providers forbid the use of their products for ticket scalping purposes:

But while reputable large enterprises might not be too hot on ticketing, smaller providers are seizing on the opportunity:

ISP proxies are now marketed for the ticketing crowd by smaller and more specialized proxy providers:

In Conclusion

Proxies have been an inextricable part of digital ticketing for almost as long as it existed. However, while they’re vital in enabling the process, they’re not nearly as crucial as ticketing bots. We can already see major proxy providers ditching or outright banning ticketing as a use case for their products. 

As web scraping becomes an increasingly important feature of e-commerce, proxies have other reasons to proliferate and develop. And all those developments are likely to be, one way or another, useful for ticketing. Therefore, the history of ticketing proxies is the history of commercial proxies in general. And that is a lot less criminally tantalizing!

Table of Contents
Picture of Chris Becker
Chris Becker
Proxy reviewer and tester.

The post A Short History of Ticketing Proxies appeared first on Proxyway.

]]>
https://proxyway.com/guides/ticketing-proxies-history/feed 0
What Is an AI Data Parser? https://proxyway.com/guides/what-is-an-ai-data-parser https://proxyway.com/guides/what-is-an-ai-data-parser#respond Mon, 15 Sep 2025 08:49:52 +0000 https://proxyway.com/?post_type=guides&p=37842 An AI data parser either uses an LLM to generate a web page parser algorithm or just scrapes the web page with an LLM.

The post What Is an AI Data Parser? appeared first on Proxyway.

]]>

What Is an AI Data Parser?

The Oxford Dictionary of English describes parsing as – just kidding! Parsing means turning an abstract jumbled ball of information into a nice and structured collection of data. Of course, you can do it yourself by manually entering the details of your lunch receipts – or 69,000 pages of laptops on sale on Amazon – into a spreadsheet. But AI parsing is much more powerful – and a lot better suited for scraping the web

AI Data Parsing in Short

AI parsing is the method of taking unstructured information – like prices on a bunch of web pages – into nice and orderly data fit for a database by using LLMs (Large Language Models). Traditional methods already have the accuracy and speed, but the added flexibility of LLMs greatly reduces maintenance requirements and difficulty of scaling. 

But to really explain the benefits of AI-assisted data parsing, we have to first look into the ways data was structured before AI/LLMs entered the field.

Traditional Methods of Data Parsing in Web Scraping Explained

Pre-Machine Learning Parsing

The basic model of parsing a website means taking a programmer, sitting them down in front of the HTML structure of a web page, and making them write a CSS, XPath, or Regex-based algorithm for extracting data out of that page. Ideally, once written, the algorithm will be able to reliably parse all the necessary data from any page under the same category of a domain.

The parsing algorithm you get is both static and deterministic:

  • Static: it doesn’t change unless you change it manually.
  • Deterministic: run it on the same web page a thousand times, and it will always get the same output; if the listed laptop price is $850, then the database entry for the price will always be $850.

There are two downsides to this method:

  • Maintenance: a static algorithm can’t handle any changes to the web page – just like you, but with less drama. So, someone needs to keep an eye on the web design and then rewrite the algorithm to adapt to any changes. 
  • Non-scalable: let’s say it takes one developer one day to write a parser for a single domain. That’s not bad if you’re only scraping/parsing data from one domain. What if you want to hit 10,000 different domains? Then you’ll need either 10,000 developers, 10,000 days – or, more realistically, a combination of the two. Oh, and don’t forget the maintenance.

Classic Machine Learning Parsing

When machine learning (ML) became more commonplace, a new method was employed:

  1. You sit down with a web page, look at the HTML code, and split it into elements. 
  2. You label the elements: this is the price field, this is the product photo, etc.. 
  3. You train ML models on all this data before letting them loose to parse websites. 

After the training is done, you get a model that is mostly domain agnostic – so, you don’t need to retrain it for every new domain. 

The downsides are thus:

  • Intensive training: before your ML can start parsing websites, you need to train the model. And to train the model, you need to process and label thousands of websites. That’s a lot of manual labor.
  • Data drift: websites change over time, but the ML doing the parsing can’t account for that, so you will have to invest in the maintenance of the model as well.

Visual Parsing

Visual parsing is a novel take on ML parsing, and it made Diffbot famous. Instead of rooting through the code to identify elements your ML model needs to seek out, visual parsing renders the page in the browser. The model then parses the page via computer vision and returns structured contents. It’s kind of like what you do as a human when viewing a website. 

  • The big upside of the Diffbot approach is that you don’t need to know how to code to train the model: you mark all the segments on a website as you visually understand them, and then the ML model will learn from that. 
  • Since it doesn’t look into the code of the web page, just the visual output, it’s less sensitive to any changes that may happen in the background that are invisible to the eye.
  • On the other hand, it still needs a lot of human work to prepare the training materials, and the maintenance requirement isn’t going anywhere either. 

With that in mind, we can consider AI web parsing.

Using AI for Web Parsing

AI web parsing will involve large language models. There are currently two main methods at play: LLM-based instruction generation and an LLM-based JSON parser.

LLM-Based Instruction Generation

This method may also be called LLM-based parser generation – it’s what Oxylabs’ OxyCopilot runs on. You take the HTML of a target page and feed into an LLM together with instructions to generate a parser (which would include what things you want to parse). The LLM will then write a parser – xPaths and all – for you. 

In this situation, it replaces the programmer who would have to write that algorithm manually. You do it for a single page on the domain, and you now have a static and deterministic parser that will be able to snag data from any page on the same website. 

So, this approach:

  • Saves labor and time: you don’t need a specialist to painstakingly code the parser for every domain you want to scrape. 
  • Has a measure of self-healing: If you set up an alarm for when any changes to the pages you scrape are detected, the LLM can be instructed to rewrite the scraper, making maintenance that much faster. 

The downsides:

  • You need a new parser for each domain, just like with the write-the-parser-yourself methods. However, this is alleviated by the fact that you can just make the AI write more parsers. 
  • Human-written algorithms still remain superior when it comes to accuracy. To bring an AI parser up to par (at least somewhat), you’ll need to implement validation strategies, which increase complexity and cost.

LLM-Based JSON Parser

But what about skipping the middle-man – or the middle code, to be precise? Method two, LLM-based JSON parser, cuts out the whole “having to build a parser” part. What you do is take the HTML of the page, define your scraping requirements in JSON, and feed them both into a cheap LLM.

AI is much better at following the rules than writing them. Once it’s done parsing, it can then present the output as the structured data you need. You can use your own LLM for this! And with a wide-variety of MCPs available these days, all that data will then be sent to your database without you having to do anything. 

Plus, unlike your static parser which will break when encountering any changes in the website, an LLM will, with no alterations to the JSON instructions, parse the website no matter what happens.

A couple of downsides, however:

  • It is non-deterministic: you may ask the LLM to scrape the price, but the results aren’t guaranteed to always be the same even when scraping the same page twice. 
  • It’s also a little expensive: you’re making an AI query per HTML parsed, and those aren’t cheap. Also, a single LLM request can take 5-8 seconds to process, while a parser does it in one. 
  • Local models require expensive infrastructure: you’re not running a million requests on a MacBook. You have to consider at which point it becomes more economical to have a home scraping setup vs. just buying more tokens.

Still, this method is employed by Crawl4AI, SpiderScrape, Firecrawl, AI Studio and many others. That’s because there are scenarios where it is actually more efficient. 

Imagine scenario #1: you have a single domain and one million parsing requests to make:

  1. Method one runs the AI once, gets the parser, and the parser then scrapes those 1 million pages on the cheap. 
  2. Method two would make one million AI queries – you pay for each one (and remember: queries take more time than scraping).

But what about scenario #2: 100,000 domains and 10 requests per?

  1. Method one creates 100,000 algorithms that you then have to match with their specific domains and then run one million scraping requests. And if you don’t have the self-healing algorithm set up, you now have to manage your scrapers. 
  2. Method two runs that single JSON request on every page, at which point the price issue comes down to whether you’re using a local model or not, how much you paid for the infrastructure, and the alternative costs of following method one.

In Conclusion

AI web parsing is the logical next step in the evolution of web parsing. The previous methods were already good at parsing. The introduction of LLMs solve the issues of scaling and maintenance, making it easier to increase the scope of web scraping operations and to keep them going in the face of constant change.

Picture of Chris Becker
Chris Becker
Proxy reviewer and tester.

The post What Is an AI Data Parser? appeared first on Proxyway.

]]>
https://proxyway.com/guides/what-is-an-ai-data-parser/feed 0
What Is a Residential VPN? https://proxyway.com/guides/what-is-a-residential-vpn https://proxyway.com/guides/what-is-a-residential-vpn#respond Thu, 04 Sep 2025 07:57:26 +0000 https://proxyway.com/?post_type=guides&p=37428 Residential VPNs gives users a residential IP for accessing services that are either geoblocked or that block VPN connections.

The post What Is a Residential VPN? appeared first on Proxyway.

]]>

What Is a Residential VPN?

Many users may not have a working understanding of what a VPN is, and they still get assaulted with terms like residential VPN. As with a lot of networking technology, it’s not that straightforward to understand or explain. But by gum, we have it in us to do so! Read our short explanation of residential VPNs, how they work, whether you need them, and their possible alternatives.

What Is a Residential VPN?

A residential VPN is a specific type of virtual private network. Like all VPNs, it encrypts your data and routes it via an intermediary device – a server. That way, all of your data is labeled with the IP address of the intermediary device, hiding your real IP and location. 

But here’s the big difference: while a regular commercial VPN uses datacenter servers as their intermediaries, residential VPNs send your requests through a computer or a laptop that belongs to a regular person. This usually takes the form of some sort of bandwidth-sharing agreement.

There’s also the possibility that a residential VPN is residential in the same way that a static residential proxy – also known as an ISP proxyis residential. By which I mean that the ISP hosts proxy servers in a datacenter, but marks their IPs as residential. 

The main benefit of a residential VPN is that you get the IP of another real internet user. This is great for various use cases: services and businesses are less likely to block a residential IP since it represents a potential customer. Datacenter IPs, on the other hand,  are almost invariably used by anonymization services and bots. 

The downside is that residential IPs are exposed to technical limitations you can expect from using a random guy’s laptop. A VPN server at a datacenter is a machine that’s optimized for handling huge volumes of traffic, served by high-grade internet connections. The random guy’s laptop – less so, so the connection may be slower and less reliable. 

Plus, just because it’s a residential IP, it doesn’t mean that the guy the IP belongs to hasn’t gone and gotten himself banned on a bunch of online services.

How Does a Residential VPN Work?

Here’s how a residential VPN works. 

  1. You either set up a VPN to route data through your buddy’s PC/laptop/etc., or subscribe to a commercial VPN that pays users to use their devices as VPN servers. 
  2. You connect to the residential VPN server – this creates a VPN tunnel: any data that travels between your device and the residential VPN server is encrypted (in addition to any encryption it may naturally have, like HTTPS).
  3. The VPN server decrypts the data (removing the VPN encryption – it can’t remove any pre-existing encryption like HTTPS) and forwards it to the website or service you wanted to reach – the data now bears the VPN server’s IP address. 
  4. The website or service sends the reply back to the residential VPN device
  5. The VPN app on the server forwards the data to your device (via the the encrypted VPN tunnel mentioned in #2). 
  6. The VPN app on your device removes the VPN encryption.


This is how you get to use websites and services without revealing your true IP.

Why Use a Residential VPN?

The main reason to use a residential VPN is to bypass geoblocks on services that are eager to block VPNs. This includes streaming services, online stores, even banking. They put in a lot of effort to sniff out likely fake (automated or scam) users. But it’s a lot harder to detect a VPN connection when it presents a residential IP address.

The rest of the use cases are identical to regular VPN:

  • Overcoming geoblocking: connect to a server in the right country, get a local IP, gain access to local content. 
  • Maintaining your privacy from your ISP: it can only see that you’re connecting to a VPN.
  • Overcoming local firewalls: your employer/school/library Wi-Fi can’t block YouTube if it doesn’t see you connect to YouTube. 

What Are the Differences Between Residential VPNs and Proxies?

VPNs and proxies are closely-related technologies, with one crucial difference: proxies don’t have to encrypt the data traveling between your device and the proxy. This is a matter of privacy, as you may not want your ISP to be able to tell what websites you’re visiting or when. Without this encryption, a VPN would be no different from a proxy. 

So why not use VPN all day, every day? Encryption has a cost, that’s why. There’s a concept called “encryption overhead” which is the additional information you need to transmit for the other device to be able to decrypt your data. This incurs a constant drain on your bandwidth, usually nearly imperceptible. However, the drain can become increasingly large when you undertake tasks that are data intensive (scraping) or speed-reliant (gaming, coping, etc.). 

That right there is the use case difference: VPNs are favored for manual tasks – as in, something the user might do themselves. This includes everyday online activities, streaming video and so on. Proxies, on the other hand, are employed for large scale automated tasks like web scraping.

Pros and Cons of Residential VPNs

So, with all these explanations of what a residential VPN is, here are the pros and cons summed up:

Residential VPN prosResidential VPN cons
Hides your IP just like any VPNThe connection is less reliable
Gives you a likely-not-banned residential IPThe IP may still get banned 
Lets you enjoy VPN benefits with a lower likelihood of being detected as a VPN userResidential VPNs are more expensive

Residential VPN Alternatives

There are three main residential VPN alternatives: residential proxies, mobile proxies, and dedicated IP on VPNs.

  • Residential proxies: literally the same as residential VPN, but without the encryption overhead, which makes residential proxies the faster option. Also, residential proxy subscriptions charge by traffic (not great for regular browsing) and, like all proxies, usually cover a single app that can be configured with proxies (while VPN coverage is system-wide).
  • Mobile proxies: like residential proxies or residential VPN, but the devices in question are on mobile carrier connections. This makes their IPs even less likely to be detected and blocked, but the connection can be shakier than with regular residential proxies. Plus, there may be more IP rotation as mobile devices move between networks.
  • Dedicated IP on VPN: this is what VPN developers that don’t offer residential VPNs will try to market as their “residential VPN-like” service. Simply put, this means that you get to use a single, unchanging VPN IP address. This ensures that you’re the only user of that address, freeing up bandwidth and lowering the likelihood of blocks… but you’re still using a datacenter IP.

In Conclusion

A residential VPN is a good choice for someone who cares less about speed than the ability to access websites and services. If you want to bypass geo-blocking for the content from a specific region, a residential VPN is hard to beat.

However, if you require volume and power, a residential proxy will suit your needs a lot better. So if you’re an enterprise user who needs to scrape data and to scrape a lot of it, go for a residential proxy.

Picture of Chris Becker
Chris Becker
Proxy reviewer and tester.

The post What Is a Residential VPN? appeared first on Proxyway.

]]>
https://proxyway.com/guides/what-is-a-residential-vpn/feed 0
What Is an MCP Server? Explaining The Important AI Enabler https://proxyway.com/guides/what-is-mcp-server https://proxyway.com/guides/what-is-mcp-server#comments Thu, 14 Aug 2025 08:09:45 +0000 https://proxyway.com/?post_type=guides&p=36473 MCP servers give LLMs/AIs easy access to tools and resources. This enables them to use real-time data and complete complex tasks.

The post What Is an MCP Server? Explaining The Important AI Enabler appeared first on Proxyway.

]]>

What Is an MCP Server? Explaining The Important AI Enabler

MCP servers are a crucial tool for AI development and the future of agentic internet. They’re an important enabler for providing AIs with tools that allow them to not just talk, but act. This is how large language models can easily access databases, interact with text-to-speech services and 3D modeling applications, and yes, scrape websites. But what exactly is an MCP server? 

What Is MCP?

The MCP (Model Context Protocol) server is a major component of the MCP: an open standard for tools that LLMs can use. The protocol was launched by the Claude creators Anthropic on November 26, 2024. 

LLMs can talk all day long based on their training data – but that’s it. By default, the AI doesn’t have access to real-time data and can’t manipulate anything. You can give it such capabilities with specialized APIs, but that is time-consuming and labor-intensive. So every time you want to add a capability like looking up the time or interfacing with Slack, you have to do custom work for the specific model-app/service combination. 

But the MCP framework has introduced a new standard for creating a translator that sits between the LLM (or, to be precise, the AI application) and the tools you want it to use. Whatever weird “language” a tool speaks, its MCP server will translate into something any AI model – Claude, ChatGPT, etc. – can understand.

The MCP system contains these major components:

  • The MCP host: that’s the AI application you’re working with. 
  • The MCP client: that’s what the AI uses to create a secure connection to the MCP server.
  • The MCP server: does the translation between what the AI wants and what the service in question puts out.

What Is an MCP server?

An MPC server is the majestic translator that allows models to interact with systems and data. While an API would have to be created for a specific combination of service and LLM, an MCP server only has to be specific to a service. 

So, for example, the Oxylabs MCP server will provide web scraping functionality for whatever AI model you have. 

MCP servers can contain three types of primitives that can be exposed for AI to use:

  • Resources: this is context in its most raw/usual form: documents, files, databases. It enables AI to look up data in, say, Apache Doris databases. This way, the AI can access more than just the data it was given when the model was developed. 
  • Tools: where resources enable passive consumption, tools allow the AI to do things without human involvement. Tools are the way AI enters new entries, deletes data and otherwise manages databases – or creates memes on ImgFlip. This puts AI beyond a sophisticated chatbot and turns it into an agent.
  • Prompts: Probably the most AI-specific type of MCP server content, prompts are specialized AI instructions that allow it to execute a task in a pre-set, standardised manner. If you tell the model to “plan a holiday”, the prompt template may then enable the AI to then ask about your desired location, duration, budget, and interests.

As a concrete example, consider an MCP server that provides context about a database. It can expose tools for querying the database, a resource that contains the schema of the database, and a prompt that includes few-shot examples for interacting with the tools.

The protocol is built around communicating in JSON-RPC 2.0 – the RPC part refers to “remote procedure calls,” a concept that closely maps to how MCP clients may need to call MCP servers on the same device or somewhere else online. 

But that’s not all – MCP servers can also ask for the clients to provide data as well – or in more technical parlance, there are primitives than can be exposed: 

  • Sampling: allows servers to request language model completions from the client’s AI application to access a language model without having their own language model SDK. 
  • Elicitation: for the times when the server creators want to get either more information from the user or prompt a confirmation for an action. 
  • Logging: the simple act of submitting logs for debugging and monitoring purposes.

What's the Difference Between MCP Servers and APIs?

The key difference between MCP servers and APIs is that MCP servers are made to serve AI/LLMs. Sure, both of them allow software to interact with external services, but that’s where the similarities end:

  • We already mentioned standardization. A classical API will output the data in whatever format the developers felt was best. But since MCP is a standard, no matter what the input from the service is, the MCP server’s output will be something any AI model can easily use. 
  • API are generally created by developers to allow third party software to interact with their apps and services. For example, the Reddit API allowed for the existence of different reddit clients, but it wasn’t made with them in mind. That same API allows AIs to be trained on Reddit data, too. In contrast, an MCP server exists to provide standardised data, tools and prompts for AIs
  • APIs don’t tailor their inputs and outputs for models to easily understand and use. But MCP handles specifically that hard task of calling the API, reading the response, and turning it into usable context. The AI itself doesn’t have to be programmed to “understand” any of the processes happening under the hood.
  • APIs usually leave security to the end user. MCP servers, however, have been developed with security already in place, like the authentication procedures embedded in its transport layer.

What’s the Use of MCP Servers in Web Scraping?

Web scraping has already adopted related technologies: web scraping APIs and AI scraping. Web scraping APIs are like services that access the website and carry out the scraping for you. They do the heavy lifting for the user. AI web scraping is more advanced, since it employs machine learning and whatnot to adapt to fancy website design complexities, anti-scraping tech, and such. 

What MCP does is allow your AI/LLM to make use of those ready-made services. Now you yourself don’t even need to interact with them. You tell the AI what needs to be done, it boots up the MCP clients to reach out to the MCP servers, and they provide the tools (in the general, not MCP-server-primitives sense) to do so.

At the same time, an LLM can be running MCP clients for multiple services, so it can access a web scraper MCP server, get the web scraping data you want, and then feed into a database MCP server for storage, processing and retrieval. Et voila. 

In Conclusion

MCP servers are a key part of the new MCP architecture powering AI agents. Without it, we’d be reduced to a bunch of patchwork solutions that have to be custom fitted for every new circumstance. But now, MCP servers are what makes AI and other services sing in harmony – or scrape the web efficiently.

Picture of Chris Becker
Chris Becker
Proxy reviewer and tester.

The post What Is an MCP Server? Explaining The Important AI Enabler appeared first on Proxyway.

]]>
https://proxyway.com/guides/what-is-mcp-server/feed 1
IPv6 Proxy Guide: What You Need to Know https://proxyway.com/guides/ipv6-proxy-guide https://proxyway.com/guides/ipv6-proxy-guide#respond Tue, 29 Jul 2025 06:38:54 +0000 https://proxyway.com/?post_type=guides&p=36289 IPv6 proxies support the next generation of the Internet Protocol. But what do you need to know about them?

The post IPv6 Proxy Guide: What You Need to Know appeared first on Proxyway.

]]>

Guides

The internet today runs on IPv4 protocol – but the protocol is wildly out of date. IPv6 is the future – it’s just unclear how near or far it is. However, IPv6 will replace IPv4 as well as the pile of patches and workarounds needed to keep it going. And with that, IPv6 proxies will be the dominant type of proxy in the market. Futureproof your plans by learning about it now. 

What Are IPv6 Proxies?

IPv6 proxies are proxy servers that support online communication over the IPv6 protocol. IPv6 is meant to replace the current IPv4 standard. This is a must: IPv4 addressees – necessary for online data exchange – are 8-bit long (and look like this: 104.21.55.78). This allows for about 4 billion unique addresses. As of 2025, there were 5.5 billion internet users. Since there are a lot more devices that there are users, unique IPv4 addresses ran out a long time ago. 

An IPv6 address looks like 2001:0db8:85a3:0000:0000:8a2e:0370:7334 – longer and made up of numbers and letters. This would give us 340 undecillion unique IP addresses, enough to make every sock in the world Wi-Fi capable. IPv6 proxies are configured to use this longer address as well as other new features, like a shorter header (think labels for data packages). 

On a semi-related note, some businesses call their IPv6 gateways – which translate IPv6 traffic into IPv4 and back again – “IPv6 proxies.” The differences between those are murky – they’re both intermediaries for your data – but a regular IPv6 proxy won’t necessarily be able to handle IPv4 traffic.  

What’s the Difference Between IPv4 and IPv6 Proxies?

The crucial difference between IPv4 proxies and IPv6 proxies is the kind of protocol they use: IPv4 for the former, and IPv6 for the latter. As the two formats aren’t interoperable, online infrastructure has to be built to be able to use IPv6.

Here lies the problem: building new infrastructure is expensive. So while IPv4 address exhaustion has been a known problem since the 1980s, the protocol soldiers on thanks to all sorts of smart tricks pulled to make it work. And since IPv6 adoption is slow – important websites like Amazon, Twitter, and GitHub still don’t support it – internet providers don’t feel the pressure to adopt it either.

Regional Internet registries like the European RIPE NCC are working hard to promote IPv6 adoption. Source: ripe.net

This is not a universal constant across the globe. China sees IPv6 adoption as a national goal and India leads IPv6 adoption on a global scale. Part of this is, reportedly, because Asian nations got slim IPv4 address allocations. Meanwhile, companies in the west had plenty of IPv4 to go around and consequently invested into the tricks that keep it going. 

One such trick is Network Address Translation (NAT). These services stand between their own networks and the wider internet. They work as a post forwarding service for the data coming from their own networks, meaning that only the NAT has to have a unique address. At the smallest scale, NAT can exist on your router, so devices using Wi-Fi wouldn’t need unique IPs. At large scales, CG (carrier-grade) NATs exist for ISP networks. 

What does that mean for proxies? On the technical side, IPv6 proxies could be faster because they have shorter headers and sort data in more advanced ways. But on the practical side, IPv4 proxies are both less likely to get banned and more useful in the immediate term. More on that in the next section.

What Are the Benefits/Drawbacks of IPv6 Proxies?

IPv6 proxies have several things going for them, but a few downsides as well.

IPv6 prosIPv6 cons

Virgin proxies: due to both slow adoption and the potentially endless variety of proxy addresses, you can find IPs that have never been used before. 

Low adoption: while large websites are increasingly adopting IPv6, not all of them are. At the time of writing, Twitter, Amazon, and Github are still IPv4-only. 

Security: IPv6 is inherently more secure than IPv4, with IPSec protocol for authentication and encryption applied by default. 

Easy bans: as IPv6 isn’t yet widespread, any suspicious (bot-like) connections are unlikely to come from residential addresses – as such, websites and services are more likely to ban them without the fear of affecting actual customers.

Speed: IPv6 doesn’t have to deal with NAT (Network Address Translation) and has simpler datagram (data package) headers, so it should work faster. 

 

Can I Get IPv6 Proxies? Can I Get Residential IPv6 Proxies?

You can already get IPv6 proxies – the providers are slowly ramping up the supply. Outside of countless small suppliers, you can see companies like Oxylabs and IPRoyal advertising their wares. What’s more, Oxylabs claims theirs are drawn from their 175M+ pool. 

However, considering that the total advertised pool of Oxylabs is 175 million, it’s doubtful that they would have a large separate supply of addresses just for the IPv6 demand. 

So finding genuine IPv6 residential proxies is still difficult – the vast majority will be data center ones. But providers are stepping up their game. Several big name proxy companies now boast IPv6 proxies, including residential: 

Bright Data
Rayobyte
IPRoyal

Moreover, some offer additional services to increase usability: Bright Data supports failover which switches to IPv4 if you’re trying to access a service that doesn’t support IPv6.

Why Are IPv6 Proxies Generally So Cheap?

IPv6 proxies are generally cheaper than IPv4: for example, at the time of writing, a dedicated IPv6 IP on Rayobyte costs $0.20 while a dedicated IPv4 IP is $2.50. That’s because the supply still outstrips the demand:

  1. IPv6 proxies are mainly datacenter: data centers may provide powerful and stable connections, but they are also very likely to end up blocked. 
  2. IPv6 is less useful: a large chunk of major websites outright don’t support IPv6 connections, making them very limited in deployment.

What’s the Future of IPv6 Proxies?

The future will run on IPv6, it’s just hard to tell how long it will take. There is progress in adopting the new standard, but it’s slow. Hopefully, the process will speed up before the internet is paralyzed by IPv4’s workarounds finally breaking under the strain. 

Conclusion

Today, IPv6 proxies lack the universality of IPv4. It’s not the fault of the technology itself, but of the inertia of the wider tech world. But with adoption inexorably coming, proxy suppliers are starting to adapt. Before long, IPv6 offerings are going to be as good and prominent as IPv4s. 

Picture of Chris Becker
Chris Becker
Proxy reviewer and tester.

You May Also Like:

The post IPv6 Proxy Guide: What You Need to Know appeared first on Proxyway.

]]>
https://proxyway.com/guides/ipv6-proxy-guide/feed 0
What Is a UDP Proxy? A Simple Guide https://proxyway.com/guides/what-is-a-udp-proxy-a-simple-guide https://proxyway.com/guides/what-is-a-udp-proxy-a-simple-guide#respond Fri, 27 Jun 2025 08:20:23 +0000 https://proxyway.com/?post_type=guides&p=35617 Learn what a UDP proxy is and when to use it.

The post What Is a UDP Proxy? A Simple Guide appeared first on Proxyway.

]]>

Guides

A UDP proxy is the type of proxy that uses the UDP protocol. This protocol is used for various speedy tasks the more stable TCP protocol is unsuitable for – in turn, UDP proxies are more versatile than the ones relying on TCP. Sometimes, the target may outright refuse TCP connections, making UDP proxies even more important. But that’s just the abstract explanation – for how it works and what it’s best at, read on.  

a server titled udp holds a gushing fire hose, spraying water. It's a metaphor for how casually udp transmits data.

What Is UDP?

UDP stands for User Datagram Protocol, one of the basic technologies of the internet; it sets the rules for how data is transmitted.  

As a connectionless protocol, UDP relies on two assumptions:

  1. The recipient is ready to receive the data – there’s no need to check whether they actually are. Skipping this “handshake” is the major contributor to UDP’s speed in the modern day.
  2. The data packages will arrive in the order they were sent – therefore, there’s no need to check how they actually arrived. The recipient will correctly rebuild the messages because the packages came in one after the other in the correct order. However, packages can get lost or mis-ordered – a risk deemed acceptable.

With UDP, datagrams (the blocks data is broken down into) have much shorter headings (think package labels), so the data takes less bandwidth to transmit than it would with TCP. However, there is some minimal error checking and UDP can end up sending duplicate packages, thus potentially increasing bandwidth use. 

As UPD is one of the basic protocols of the internet, a lot of higher level protocols (and apps, and so on) are built around it.

What’s the Difference Between TCP and UDP?

The benefits and downsides of UDP become clearer when the protocol is compared to its main “rival” TCP (Transmission Control Protocol). In contrast to UDP, TCP is a connection-oriented protocol – it doesn’t assume anything. Accordingly, a handshake is carried out to ensure that the recipient is ready to receive data. Once the transmission is out, there are error checks to see whether all of the data arrived in the correct order. 

All the confirmations and longer datagram headings necessary for all the error checking make TCP slower to operate than UDP.

To explain it in less technical terms, imagine mail delivery via cannon. TCP would aim the cannon at the delivery point and then check via spyglass that the recipient is waiting to receive every time before firing. The recipient would have to acknowledge that he received each package by waving a jaunty little flag or something. 

Meanwhile, UDP would just aim the cannon and fire all the parcels as fast as it can load them. It doesn’t check whether anyone is waiting for them or how they land. Therefore, it goes through the same pile of packages as TCP a lot faster.

What Is UDP Used For?

So the obvious use case for UDP as a protocol is situations where speed matters more than anything else. That’s why it’s used for: 

  • Improvement to HTTP: HTTP/2 is the higher level protocol running the internet, but it has issues. For example, reliance on TCP makes it vulnerable to congestion: if it detects that data arrived incorrectly, the transmission channel is blocked until the data is resent. HTTP/3 aims to solve them with a transport protocol called QUIC. What makes QUIC quick is using multiple UDP channels instead. If the protocol detects  errors in transmission, it blocks only the affected channel, making connections smoother and faster.  
  • VoiP (Voice over IP) communications: your Discord voice chats, WhatsApp calls, and so on. Users prefer to hear the caller in real-time rather than wait for a clear message to arrive. The chopiness and loss of quality you’ve invariably experienced if you’ve ever had a single VoIP (or video) interaction is just UDP packages getting lost. 
  • Online gaming: ping is unavoidable – it will take time for player data to physically reach the server and vice versa. And slowing it down would be worse than losing some of the data. That’s why, say, War Thunder has both ping and packet loss indicators right there on the screen. 
  • Gaming automation: statistically, everyone loves either RuneScape or Growtopia. But if you want to run multiple accounts at the same time (or even bots), you’ll quickly need to turn to proxies for their numerous IPs. 
  • DNS lookup: DNS – Domain Name Service – is the phonebook of the internet; it turns human-readable addresses (https://proxyway.com/) into IP addresses that computers can use (172.67.170.192). So when you enter a website address into a browser, the DNS query is sent via UDP to make this initial step that much faster. 
  • Multicasting: if broadcasting just blasts signals everywhere, multicasting only reaches devices that are, well, interested. So multicasting allows a sender to, say, broadcast a stream that will reach apps tuned to that stream without having to directly connect to each one of them. 

What Is a UDP Proxy?

A UDP proxy is thus a proxy that uses UDP to transmit data. Since it doesn’t establish connections or doesn’t do any error checking, it is one of the fastest proxies around. If you’re doing such data-intensive activities like streaming, UDP is the way to go. 

When it comes to specific applications, UDP proxies are used for:

  • Gaming automation: multiplayer games use UDP, and so do bots; 
  • Torrenting: Micro Transfer Protocol found in modern torrent clients is UDP based;
  • QUIC-based tasks: more of a futureproofing thing, once QUIC becomes standard, so will UDP proxies.

What Is a SOCKS5 UDP Proxy?

SOCKS5 is the newest version of the widely-adopted SOCKS internet protocol, which enables sharing data via proxy. Previously, SOCKS only ran on TCP. But with SOCKS5, it can now use UDP for transferring data via proxies. 

As a higher-level protocol that builds upon UDP, SOCKS can provide advanced benefits like authenticating the connection and data encryption. The big takeaway is that SOCKS5 UDP proxy is probably going to be the way you’re going to use your UDP proxy of choice. 

Notably, not all SOCKS5 proxy providers offer the UDP functionality. Many of them disable UDP support out of risk-avoidance.  

If you want a quick rundown of SOCKS5 proxy providers, including those that support UDP, read our list of the best SOCKS5 proxies.

Conclusion

A UDP proxy is one of the fastest – if not the fastest – proxies around. It cannot be beat for speed or specialized use-cases. 

Picture of Chris Becker
Chris Becker
Proxy reviewer and tester.

You May Also Like:

The post What Is a UDP Proxy? A Simple Guide appeared first on Proxyway.

]]>
https://proxyway.com/guides/what-is-a-udp-proxy-a-simple-guide/feed 0
The Best Free Datasets to Use in Python Skill Practice https://proxyway.com/guides/datasets-in-python https://proxyway.com/guides/datasets-in-python#respond Mon, 17 Mar 2025 12:43:03 +0000 https://proxyway.com/?post_type=guides&p=31738 Find out where to get best datasets for practicing Python skills.

The post The Best Free Datasets to Use in Python Skill Practice appeared first on Proxyway.

]]>

Guides

Python is one of the most popular programming languages used for data analysis. Despite being relatively easy to pick up, it still requires practice to learn. And a great way to improve the skill is by analyzing datasets.

Datasets in Python Data Analysis Skill Practice

Python is an open-source language used for a variety of cases, from web scraping to software development. By itself, it has limited functions that could be useful for scraping or data analysis, but you can find dozens of Python libraries to increase its flexibility and usability.

However, practicing Python can be tricky if you don’t have a project to work on. If you’re looking to improve your data analysis skills with Python, you should look no further than datasets. 

Using Python to examine datasets can help you learn data cleaning, manipulation, handling various types of information (numeric, textual, etc.), and more. Let’s dive into the best datasets you can use to develop your proficiency with Python.

What Is a Dataset?

Datasets are pre-collected records on a specific topic, be it the inventory stock of an e-commerce website or the most popular baby names of this decade. 

They’re static organized compilations of important data points prepared for further analysis. Datasets can be used for a variety of cases, including research and business management purposes, as well as personal use, such as finding relevant job postings or product reviews.

Datasets vary not only in size, but also by type – you can encounter numeric, textual, multimedia, mixed, and other types. They will also differ in structure – the way a dataset is organized usually depends on the data type it holds.

Learn all you need to know about datasets, and how they differ from web scrapers.

What to Look for in a Practice Dataset?

When choosing a dataset to practice your Python skills, consider its size, complexity, and structure. 

If you’re new to Python, opt for smaller, organized datasets with clear labels and fewer data points – it’ll be easier to navigate Python functions with less data to handle. If you already have some familiarity with Python, you can try exploring larger, unstructured datasets that require cleaning and preprocessing.

In general, a good rule of thumb is to look for datasets that match your learning goals. If you want to practice data visualization, choose datasets with diverse numerical and categorical data. On the other hand, if you’re interested in advanced level problem-solving, opt for datasets with missing values, inconsistencies, or unstructured text.  

Lastly, consider availability and documentation. Well-documented datasets, like those from government open data portals, provide descriptions, column explanations, and sample analyses, making them easier to work with. A good dataset challenges your skills while keeping the learning process manageable.

Datasets for Python Learning
Consideration points before choosing a practice dataset

Where to Find Good Datasets for Analysis?

There are a few ways to find datasets to practice Python skills: you can pick free datasets, purchase them from dataset vendors, or make a dataset yourself.

Free Dataset Providers

If you opt for free datasets, there are multiple websites you can get them from. Free providers often have quite high collections of datasets that are often used by professionals and individuals alike. 

The key disadvantage of free datasets is their maintenance – since they are provided by courtesy of others, the data might not always be relevant and fresh enough for your project. Nevertheless, it should do the job if you’re just practicing.

  • Kaggle. Kaggle is probably one of the most popular dataset providers on the market. It has over 400K datasets for all kinds of projects.
  • Google Dataset Search. Google has a specific dataset search engine that will find you relevant datasets from all over the web based on your keyword. Keep in mind that Google Dataset Search will include results with paid datasets, too.
  • GitHub. This developer code sharing platform is great for storing, managing, and publicly sharing code, but can be a great place to find free, pre-collected practice datasets, too. 
  • Public government data websites. Websites like Data.gov or Data.gov.uk are great places to find public datasets on various country-specific topics. They are also often updated.

Paid Dataset Providers

You can also purchase datasets on your topic of interest. These datasets will contain fresh data and will be renewed on your selected frequency. Unfortunately, they don’t come cheap, so might not be the best choice if you’re just learning, but are perfect for business analysis.

  • Bright Data. The provider offers over 190 structured datasets on various business niches. The datasets can be refreshed at a chosen frequency, too. Bright Data also offers a few free datasets as well as custom datasets based on your needs.
  • Oxylabs. This provider offers ready-to-use business- and development-related datasets, such as job postings, e-commerce, or product review data. Oxylabs can also provide custom datasets on your specific interest.
  • Coresignal. The provider has a large collection of datasets on companies, employees, and job postings. It’s a great choice for analyses related to business growth.

Making Your Own Dataset

If you’d like to practice Python for web scraping in addition to data analysis, you can try creating your own dataset by extracting data from relevant websites, structuring, and exporting it in a preferred format. 

We have a useful guide on how to start web scraping with Python. It will help you build a scraper and extract web data which you’ll be able to use for building a dataset later on.

An introductory guide to Python web scraping with a step-by-step tutorial.

Python Libraries for Working With Datasets

Being a general-purpose programming language, Python can be used for various projects, but it’s especially popular for web scraping and data analysis tasks due to helpful packages – libraries. 

Adding libraries will help you increase Python’s functionality by adding features for data cleaning, filtering, clustering, and more. Here are some of the common Python packages you’ll find helpful for practicing data analysis in Python:

  • Pandas. The pandas library can be used for data manipulation and analysis. It makes it easy to clean, filter, and reshape data points as it can handle missing values or formatting issues, group and sort data points.
  • NumPy. This library is excellent for working with numerical datasets as it supports fast mathematical operations, such as algebra equations or random number generation. 
  • Matplotlib. The Matplotlib library can be used for data visualization. It’s very useful for analyzing distributions, correlations, and categorical data, and can assist in creating statistical graphics.
  • Scikit-learn. The library is useful for data preprocessing – it has tools to help with data classification, regression, and clustering, and is often used for machine learning tasks. Scikit-learn can be easily used alongside pandas and NumPy.
  • BeautifulSoup. The BeautifulSoup library can be useful if you need to extract structured information from a website (i.e., product reviews). Combined with the requests library or a headless browser for dynamic websites, it can scrape and process data.

Free Datasets to Try in Python Skill Training

Using datasets for Python training is one of the simplest ways to learn the language, but it comes with its own set of challenges. You might encounter incomplete, inconsistent, or poorly formatted data, so your challenge is to use Python to solve them before extracting necessary data.

Wine Quality Dataset (Kaggle)

The Wine Quality Dataset on Kaggle is a relatively small dataset (around 15K data points), containing information about the amount of various chemical ingredients in the wine and their effect on its quality. 

Based on the given data, your main task would be to use Python to understand the dataset, perform necessary data cleanup (if necessary), and build classification models to predict wine quality.

Wine quality dataset
Wine quality dataset on Kaggle

Electric Vehicle Population Data (Data.gov)

The Electric Vehicle Population Data on Data.gov is a public dataset providing information on various types of electric vehicles currently registered in the State of Washington. This dataset is often updated and has multiple download formats available. 

There, you’ll find counties and cities, car models, electric ranges, and more data points to work with. This dataset can be used to learn data clustering, find the average electric car range, discover most popular vehicle models, and more.

Electric vehicle population dataset
Electric vehicle population dataset on Data.gov

IMDb Movie Reviews Dataset (Kaggle)

The IMDB Movie Ratings Dataset on Kaggle has approximately 50K movie reviews that you can use to learn natural language processing or text analytics. It contains two essential data points – a full written review and the sentiment (positive or negative). 

This dataset can be used in Python practice for learning how to perform text analysis and predict the rating.

IMDb movie review dataset
IMDb movie review dataset on Kaggle

Forest Covertype Dataset (UCI Machine Learning Depository)

The Forest Covertype Dataset on UCI Machine Learning Depository is a small, well-structured dataset on four wilderness areas located in the Roosevelt National Forest of northern Colorado. It’s excellent for predicting forest cover type from cartographic variables only.  

The dataset has multiple variables, like soil type, wilderness areas, and hillshades, to work with. What’s great is that there are no missing values, so you won’t need to worry about filling them in manually.

Forest covertype dataset
Forest covertype dataset on UCI Machine Learning Depository

Surface Water Quality Dataset (Open Baltimore)

The Surface Water Quality Dataset on Open Baltimore is a large dataset covering surface water quality in the City of Baltimore from 1995 to 2024. Available in a CSV file, this dataset contains data values like coordinates, tested parameters, and timestamps. 

You can use Python to predict the surface level quality by analyzing the given parameters and their results in specific locations of the city.

Surface water quality dataset
Surface water quality dataset on Open Baltimore
Picture of Adam Dubois
Adam Dubois
Proxy geek and developer.

The post The Best Free Datasets to Use in Python Skill Practice appeared first on Proxyway.

]]>
https://proxyway.com/guides/datasets-in-python/feed 0
Web Scraping Python vs. PHP: Which One to Pick? https://proxyway.com/guides/web-scraping-python-vs-php https://proxyway.com/guides/web-scraping-python-vs-php#respond Fri, 21 Feb 2025 09:28:36 +0000 https://proxyway.com/?post_type=guides&p=31289 Let's see how two popular languages compare in web scraping tasks.

The post Web Scraping Python vs. PHP: Which One to Pick? appeared first on Proxyway.

]]>

Guides

When building a custom web scraper, you might find yourself wondering which programming language is the most suitable for your project. Let’s see whether Python or PHP is better for your use case.

Web scraping with Python vs PHP

Web scraping is widely used in many industries – business professionals, researchers, and even individuals collect various data about price comparison and market analysis, as well as research and lead generation. While there are quite a few programming languages that can handle web scraping, Python and PHP stand out as the two popular choices. 

Python is known for its simplicity and multiple helpful libraries, while PHP, primarily used for web development, also offers powerful scraping capabilities and easy integration with other web applications. 

In this guide, we’ll compare Python and PHP for web scraping, breaking down their strengths, weaknesses, and use cases to help you make the right choice for your project.

What Is Python?

Python is a high-level, versatile, mostly server-side programming language developed in the 90s, and still widely used today. 

It’s known for code readability, simplicity, and a large amount of supplementary libraries. Python can be used in various fields, including web development, data analysis, as well as artificial intelligence. With its easy-to-read syntax, Python is often a preferred choice for both beginners and experienced developers.  

The language is particularly useful for web scraping due to its powerful libraries. For example, BeautifulSoup is excellent for data parsing, Requests – for sending HTTP requests to websites, and Selenium automates browsers, making scraping data from dynamic elements easy. These tools provide efficacy for the entire scraping process.

What Is PHP?

PHP is a server-side scripting language primarily used for web development. Millions of websites are powered by PHP because of its ability to generate dynamic web pages and interact with databases.

PHP is commonly used for content management systems, e-commerce platforms, and various API integrations. However, it can also be used for web scraping, especially when data extraction needs to be integrated directly into a website. For example, web applications like that scrape airline websites and immediately display the results for the user would benefit from a PHP-based scraper.

With built-in tools like cURL and DOMDocument, PHP allows you to extract and sort data retrieved from the web.

Web Scraping Python vs. PHP: Feature Overview

Python and PHP are both viable options for data extraction, but they differ in syntax, use cases, popularity, and performance. Let’s review in-depth on how both languages compare.

Python is ideal for both small and large scraping projects, making it great for scraping basic HTML as well as dynamic, JavaScript-heavy sites. It’s fast, handles extracted data really well, and has tons of resources for learning.

PHP, on the other hand, relies on built-in functions to support scraping, so it is rather limited. It may be a slightly unorthodox choice for scraping, but it still has its use cases, especially when you need a scraper integrated within a web application.

 PythonPHP
Ease of useVery easy to learnMedium difficulty for learning
Popular libraries and featuresBeautifulSoup, Selenium, RequestscURL, DOMDocument, SimpleHTMLDOM
PerformanceFast and efficient for large-scale scrapingTypically very fast, slower for complex scraping tasks 
JavaScript handlingYes, with Selenium libraryLimited support
Community supportLarge community, great documentationSmall scraping community, great documentation
Typical use casesData analysis, large-scale scrapingWeb-based applications, basic scraping tasks

Popularity

Python is no doubt the more popular of the two languages. Being an easy-to-use, multi-purpose language, it offers flexibility, making it a perfect choice for a broad range of tasks.

PHP, on the other hand, is most commonly used for backend development – it powers over 70% of modern websites and web applications, and is the leading language for server-side development.

In terms of web scraping, Python is a more common choice, too. That’s mainly due to its extensive scraping library collection, simplicity, and large scraping enthusiast community. Nevertheless, PHP is often a preferred choice for light scraping tasks, especially for people already familiar with the language.

Most popular programming languages (GitHub data)
Most popular programming languages in 2022. Source: GitHub

Prerequisites and Installation

Getting both Python and PHP is relatively simple: all you have to do is download the packages from their respective websites (download Python; download PHP) and follow the installation steps. Though, the process might differ based on the operating system you use.

Getting Python

To get Python for Windows, download the Python package, and open the .exe file. Follow the installation wizard. Then, check if it was successfully installed by running python –version in Command Prompt. It should print the current version of Python on your device.

To get Python for macOS, download the Python package from the official website, open the .pkg file, and follow the installation instructions. Check if it was installed by running python3 –version in Terminal. If you see a version number printed, Python was installed successfully.

Getting PHP

Install PHP on Windows by downloading the package and extracting the ZIP file into a folder of your choice. Once you do so, add PHP to System PATH – go to Control Panel -> System -> Advanced -> Advanced system settings -> Environment variables. Under System variables, find Path, click Edit, and add C:\yourfolder.

Note: use the exact name of the folder you extracted PHP in.

To check if it was installed successfully, open Command Prompt, and run php -v. It should show the PHP version installed on your computer.

To install PHP on macOS, you’ll need a third-party package manager like Homebrew. Install Homebrew by running the following command in Terminal:

				
					/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
				
			

It will download and install Homebrew. Then, follow the installation instructions. After the installation, you can run brew –version to confirm (it should print the installed Homebrew version). 

Once you have the package manager, you can easily install PHP by running brew install php in the Terminal.

Performance

Python is a relatively fast language on its own, but it can be further optimized with libraries like asyncio and aiohttp (for sending asynchronous requests concurrently instead of one-by-one). However, complex operations might take longer due to overhead. Nevertheless, Python is better suited for large scraping tasks. Even though it might take slightly longer to complete them, it still works through large amounts of data more efficiently thanks to fast-paced libraries. 

PHP generally is faster than Python because it works natively on the server. It’s also lighter on resources (i.e., CPU, memory) and performs better with basic scraping tasks, like collecting comments from a simple, HTML-based forum. Unfortunately, the speed significantly drops and resource usage increases once you start scaling up.

Best Use Cases

Both Python and PHP have their own set of strengths and thus, should be used in different scenarios.

Python has various helpful libraries to expand its capabilities, so it’s excellent for handling complex scraping tasks, especially where JavaScript-based websites are involved. With Selenium or Playwright installed, Python-based scrapers can interact with the web page and extract data from dynamic elements. 

Additionally, Python-based web scraper is well-suited for large-scale data collection because it supports asynchronous operations (performs multiple operations at the same time instead of one at the time). If you’re also planning to analyze scraped data, Python should be your preferred choice – with libraries like BeautifulSoup, you can parse the information easily. Lastly, it’s very easy to start scraping with Python due to its simple syntax.

PHP, on the other hand, is extremely useful if you’re planning to integrate scraped data directly into a web application (i.e., update product prices in real-time). In addition, PHP is great for lightweight scraping – cURL and DOMDocument packages make it quite easy to scrape data from websites like basic e-commerce sites or online forums. Unfortunately, PHP has very limited support for dynamic webpages.

If you’re a developer primarily working with PHP, you don’t need to learn another language just for scraping. That can make PHP very cost- and resource-effective.

Community Support and Documentation

Being one of the most popular programming languages, Python has extensive documentation and a community of developers and enthusiasts behind it. You can find beginner’s guides, books, series of podcasts and other resources directly on Python’s website. 

It also has large dedicated scraping communities on websites like Reddit, GitHub, or StackOverflow that will gladly help you if you find yourself stuck.

PHP, however, is lacking in terms of scraping-focused community and documentation – it has some resources for learning, but you won’t find much material. Its scraping community is active but also significantly smaller.

Choosing Between Python and PHP

It might not be easy to pick a language for your web scraping project because both PHP and Python have their own unique strengths. Therefore, when deciding which language to use, consider the following:

  • Pick Python if you’re planning to scrape large amounts of web data, work with dynamic (JavaScript-heavy) web pages, or need to process, clean, and analyze data efficiently. Python is also ideal for automation and machine learning applications.
  • Choose PHP if you’re working within a PHP-based web environment, or need simple scraping within a web application without additional dependencies. Also useful if you’re already somewhat familiar with the language.

Ultimately, we would say Python is the better choice for most web scraping tasks due to its readability, ease of use, and rich ecosystem. However, PHP can be a suitable option for people who are already familiar with the programming language and need to perform lightweight scraping tasks.

Alternatives to Python and PHP

If you want to try a completely different language for web scraping, you could pick Node.js. It’s a popular JavaScript-based language often used for scraping. While it can be slightly more difficult to learn, it’s very scalable, has a huge scraping community, and is probably the best option for extracting data from dynamic websites.

Everything you need to know about web scraping with Node.js and JavaScript in one place.

Alternatively, we compiled a list of other programming languages you can use for web scraping. Keep in mind that each language has its own pros and cons, varying performance, community support, and ideal use case.

We compare seven popular programming languages for web scraping.

The post Web Scraping Python vs. PHP: Which One to Pick? appeared first on Proxyway.

]]>
https://proxyway.com/guides/web-scraping-python-vs-php/feed 0
How to Use Wget with a Proxy: A Tutorial https://proxyway.com/guides/wget-with-a-proxy https://proxyway.com/guides/wget-with-a-proxy#respond Mon, 10 Feb 2025 09:01:59 +0000 https://proxyway.com/?post_type=guides&p=30665 Learn all about command-line utility Wget, and how to use it with a proxy.

The post How to Use Wget with a Proxy: A Tutorial appeared first on Proxyway.

]]>

Guides

Wget is a great tool for quickly downloading web content. It also offers the flexibility to route your requests through a proxy server. Here you’ll learn how to use Wget with a proxy.

How to use Wget with a proxy

There are many command-line tools for downloading web content, such as cURL. However, if you want to handle recursive downloads and resume tasks when your connection is unstable, Wget is your best option.

What Is Wget?

Wget is a GNU Project command-line utility built to download HTTP(S) and FTP(S) files. It’s a non-interactive tool, making it especially useful for downloading content in the background while completing other tasks. 

Wget was specifically designed to handle content downloads on unstable networks: if you lose internet, the tool will automatically try to resume the job once the connection restored.

Wget is typically used on Unix-like operating systems such as Linux and macOS. However, it’s also available on Windows.

Key Wget Features

Even though Wget was first introduced in the 90s, it’s still widely used due to its simplicity and reliability. Here are some key features of Wget:

  • Resuming interrupted downloads. If a download is interrupted because of connectivity issues or system shutdown, Wget will automatically retry the task once the connection is restored – no manual input is needed. 
  • Automated file download. Wget can batch process downloads or schedule them for repetitive tasks.
  • Recursive download support. You can create a local copy of a website with Wget to view it offline or archive the website’s snapshot for future reference.
  • High control over downloads. You can script Wget to limit bandwidth, change request headers, as well as adjust retries for downloads.
  • Proxy support. Wget supports HTTP and HTTPS proxies if you need to download geo-restricted or otherwise protected content. 

Wget vs. cURL: the Differences

Both Wget and cURL are command-line tools used for data transferring. However, their functionality and niches slightly differ.

Wget is primarily used to download content from the web. On the other hand, cURL is used for data transfer (upload and download), as well as working with APIs. Therefore, cURL is more versatile but also more complex.

A comparison between Wget and cURL functionality.
A comparison between Wget and cURL functionality.

How to Install Wget

Wget’s installation process is straightforward, but may differ based on your operating system.

Being a command-line utility, Wget run in a command-line interface. In other words,  if you have a Mac or Linux computer, that will be terminal Terminal. The default for Windows is CMD (Command Prompt).

  • Windows users will need to download and install the Wget package first. Once that’s done, copy and paste the wget.exe file to the system32 folder. Finally, run wget in Command Prompt (CMD) to check if it works.
  • For those on MacOS, you’ll need to get the Homebrew package manager by running xcode-select –install in your Terminal.  Then, you can install Wget by running wget -v.

Once you have Wget installed, it’s important to also have the configuration file – .wgetrc. It will be useful for when you need to add proxy settings to Wget.

To create the file on the Windows OS, run C:\Users\YourUsername\.wgetrc in CMD. MacOS users should use run -e ~/.wgetrc in Terminal. If the file doesn’t exist in your system, this command will automatically create and open it. 

How to Use Wget

Let’s take a look at how to download files and retrieve links from webpages using Wget.

Downloading a Single File with Wget

Retrieving a single file using Wget is simple – open your command-line interface and run wget with the URL of the file you want to retrieve:

				
					wget https://example.com/new-file.txt

				
			

Downloading Multiple Files with Wget

There are a couple of ways to download multiple files with Wget. The first method is to send all URLs separated by a space. Here’s an example with three files:

				
					~$ wget https://example.com/file1.txt https://example.com/file2.txt https://example.com/file3.txt

				
			

This method is ideal when you have a limited number of URLs. However, if you want to download dozens of files, it becomes much more complex.

The second method relies on writing down all URLs in a .txt file, and using the -i or –input-file option. In this case, Wget will read the URLs from the file and download them. 

Let’s say you named the file myurls.txt. You can use the –input-file argument:

				
					~$ wget --input-file=myurls.txt

				
			

Getting Links from a Webpage with Wget

You can also use Wget to extract links directly from a webpage. 

If you want Wget to crawl a page, find all the links, and list them without downloading, you can run this command:

				
					wget --spider --force-html -r -l1 https://example.com 2>&1 | grep -oE 'http[s]?://[^ ]+'

				
			

If you’d like Wget to find the URLs and download them for you, simply remove the –spider and –force.html commands that crawl and parse the HTML pages. Instead, your command should look something like this:

				
					wget -r -l1 https://example.com

				
			

Changing the User-Agent with Wget

If you’re planning to use Wget for downloads often, you should modify your user-agent string to rate limits. You can change your user-agent for all future uses by editing the .wgetrc file, or write a command for one-time use.

Modifying the User-Agent for a Single Download

Whether you’re on Windows or macOS, the syntax for changing the user agent is the same. Make sure to use the user-agent string of a new browser version.  

				
					wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" https://example.com
				
			

Modifying the User-Agent Permanently

If you’d like to consistently use a different user-agent, you can change the Wget configuration in the .wgetrc file. The custom user-agent string you’ll put there will be used for all future jobs until you change it.

Simply locate the .wgetrc file and add user_agent = “CustomUserAgent”

It should look something like this:

				
					user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
				
			

How to Use Wget with a Proxy

You can either set up proxy settings in the configuration file or pass proxy settings in the command line for one-time downloads.

Wget only supports HTTP and HTTPS proxies, so make sure you’re not using SOCKS5 proxy servers.

If you’re looking for a proxy server, free proxies may work with simple websites. For anything more – or larger scale – we recommend choosing one of the paid proxy server providers. You’ll find our recommendations here:

Discover top proxy service providers – thoroughly tested and ranked to help you choose.

Using Wget with a Proxy for a Single Download

For using proxies with multiple downloads, we recommend setting a proxy configuration in the .wgetrc file instead. However, you can also specify proxy settings for Wget if you’re planning to use Wget with a proxy once. Instead of modifying the .wgetrc file, you can run a command directly in Terminal or CMD.

It should look like this:

				
					wget -e use_proxy=yes -e http_proxy=https://username:password@proxyserver:port https://example.com/file.zip
				
			

Note: the example uses http_proxy, but Wget supports HTTPS proxies too, so you can use https_proxy for your proxy settings.

Checking Your Current IP Address

It may be useful to check if your IP address has indeed changed to the proxy server’s. You can do that by sending a request to the HTTPBin IP endpoint with Wget:

				
					wget -qO- https://httpbin.io/ip

				
			

You should receive an output similar to the one below:

				
					{
“origin”: “123.45.67.89:000”
}

				
			

Note: this is not a real IP address, rather an example to familiarize you with the format.

Set Up a Proxy for Wget for Multiple Uses

To set up a proxy for Wget, you’ll first have to get the proxy server’s details. Then, set the proxy variables for HTTP and HTTPS in the .wgetrc file that holds the configuration content for Wget.

Add proxy settings to the file:

				
					use_proxy = on
http_proxy = http://proxyserver:port
https_proxy = https://proxyserver:port

				
			

Note: use actual proxy server address and a correct port number when editing the file. These will be given to you by your proxy service provider.

Once you write down proxy settings, you can send a request to HTTPBin to check if the IP address has changed.

Wget Proxy Authentication

Most reputable proxy server providers will require authentication to access the proxy server. Typically, you’ll need to specify your username and password.

You can do that by adding a couple of lines to the .wgetrc file.

				
					proxy_user = YOUR_USERNAME
proxy_password = YOUR_PASSWORD

				
			

So, the entire addition to the file should look like this:

				
					use_proxy = on
http_proxy = http://proxyserver:port
https_proxy = https://proxyserver:port
proxy_user = YOUR_USERNAME
proxy_password = YOUR_PASSWORD

				
			
Picture of Adam Dubois
Adam Dubois
Proxy geek and developer.

The post How to Use Wget with a Proxy: A Tutorial appeared first on Proxyway.

]]>
https://proxyway.com/guides/wget-with-a-proxy/feed 0
What Is a Dataset? Comparing Scraping APIs and Pre-Collected Datasets https://proxyway.com/guides/what-is-a-dataset https://proxyway.com/guides/what-is-a-dataset#respond Wed, 08 Jan 2025 10:50:54 +0000 https://proxyway.com/?post_type=guides&p=30084 Learn all you need to know about datasets, and how they differ from web scrapers

The post What Is a Dataset? Comparing Scraping APIs and Pre-Collected Datasets appeared first on Proxyway.

]]>

Guides

The world runs on data, but it’s not always easy to find it. However, datasets offer an easy way to access large volumes of structured data on essentially any topic.

what is a dataset

Web scraping tools allow you to gather vast volumes of data in seconds. But with more companies offering data-as-a-service (DaaS), you don’t even have to collect information yourself. Instead, you can get pre-collected datasets from basically any website, and jump straight to analysis. 

But what exactly are datasets, and why do they matter? Essentially, a dataset is a collection of structured records on a specific topic for further processing. It allows easy access to information about various fields, topics, and subjects. Since datasets typically are huge collections of information, they make research more accessible and fast. In this article, let’s dig deeper into what datasets are, how they are made, and where to use them.

What Are Datasets?

Datasets are collections of records about a specific topic. It’s a static compilation of important data points that can vary from weather forecast to product prices. The key attribute of a dataset is its structure –  it is organized (often arranged in a table) and prepared for further analysis.

There are numerous ways to use datasets, both for research and business management purposes, such as marketing and social media management, or tracking and analyzing e-commerce data. Datasets can also be valuable for recruitment purposes.

Types of Datasets

There are many types, forms, and structures of datasets. The type of dataset you should get depends on what sort of analysis you’re planning to perform (i.e., qualitative, quantitative).

Firstly, datasets can be broken down into several types:

  • Numerical datasets consist of numbers only. They’re mostly used for quantitative analysis for statistics or mathematics. For example, such data includes stock prices, temperature records, or order values. 
DateTemperature (°C)Wind speed (km/h)
2025-01-017.38
2025-01-028.112
2025-01-036.911
  • Textual datasets are composed of written information, and they’re ideal for qualitative analysis. For example, textual datasets can be a collection of X posts (previously known as tweets), press releases, customer feedback, or research papers.
				
					[
  "Great quality and fast shipping!",
  "The product broke after a week. Very disappointed.",
  "Affordable and works as described. Will buy again."
]

				
			
  • Multimedia datasets include audio, video, and image data. They can be used for both quantitative and qualitative analysis.
Image fileLabel
Monitor
Server
Sneakers
  • Time-series datasets contain data collected periodically. For example, price changes on a monthly basis or daily weather reports.
TimestampStock price ($)Volume
2025-01-01 09:00150.25500,000
2025-01-01 09:15155. 30525,000
2025-01-01 09:30151.75510,000
  • Mixed datasets combine different types of data – textual, numerical, multimedia. They are especially useful for multi-faceted reports, like customer sentiment or customer behavior analyses.
Image IDDescriptionImage fileAuthor
101“Red proxy server icon”

Proxy server

Isabel
102“Yellow globe icon”Adam
103“Blue scraper icon”blue spider robotChris

Secondly, datasets can have varying organization structures:

  • Structured datasets have organized rows and columns containing specific data points. For example, a structured dataset can be an Excel sheet or a CSV file containing data.
  • Unstructured datasets don’t have a predefined format due to the type of data they contain (audio, images, text). They might be more difficult to analyze due to their unorganized nature.

However, if you’re looking to purchase a dataset, you’ll most likely encounter mixed datasets as they allow for various potential analyses.

Dataset Examples

Now that you know the different types of datasets, let’s take a better look at how they can look like.

Below is an example of a mixed dataset in a structured table. The datapoints vary  – you can see text and numbers, yet they are neatly organized within the table. Each element includes several data points, and is arranged in an ascending order.

Product ID

Name

Price

Category

101

Scraping robot

$49

Scrapers

102

Computer monitor

$139

Electronics

103

Proxy server

$2000

Hardware

104

Mobile phone

$250

Electronics

Let’s analyze another table below, it might look like an ordered time-series dataset – an organized table with numeric data points about the weather. However, if you take a closer look, you’ll notice the timestamps don’t really have any logical order. This makes it an unstructured time-series dataset.

Timestamp

Temperature (°C)

Humidity (%)

2024-12-26 14:00:00

13.0

45

2024-12-27 12:00:00

7.4

79

2024-12-25 14:00:00

10.2

56

Both of these datasets can be used for making analyses or training AI, but they will have different applications.

Why Use Datasets?

Datasets are an invaluable tool for various niches, ranging from business to research. For example, companies can adjust pricing strategies due to price changes in competition, improve services by uncovering customer behavior patterns, make future plans by monitoring trends, and more. 

In academia, datasets can help save time in collecting and structuring data. A pre-made dataset reduces the time needed for manually collecting specific data points, and thus allows for more focus on data analysis and drawing conclusions. Additionally, having more data points allows for data validation by improving statistical significance and capturing data variability. 

Finally, datasets can also be used to train AI. Large language models (LLMs) rely on vast volumes of data so they can provide you with detailed answers in a conversational tone. However, if you ever used AI-based tools like Open AI’s ChatGPT or Google’s Gemini, you might have noticed that the answers are not always correct. Providing AI with a collection of fresh data can help the LLM improve accuracy.

Where are datasets used
Practical applications of datasets

Dataset vs Database

While we covered what a dataset is, you might’ve encountered another term – database – when talking about a collection of information. So, how do these terms differ?

A database is a dynamic collection of stored data. It’s a digital library where information is stored, can be quickly found, managed, reorganized, or completely changed. Maintaining a database requires specific software and hardware. 

We can think of a database as being similar to the Contacts app on your phone. The app holds names, phone numbers, and other information about people in your life. You can adjust this data immediately if someone’s name or phone number changes. The app is a specific software that lets you access and manage phone numbers, and your phone’s processor, memory, and storage allow the app to run smoothly.

However, if you decide to print the phone numbers from your Contacts app on a sheet of paper, it becomes a dataset – a static snapshot of data. You can analyze it (i.g., check how many people named John you know), but it cannot be edited, deleted, or otherwise manipulated. It simply reflects the data from the app at a specific point in time.

Both datasets and databases hold information, but as you can see in the example, the database (the Contacts app) is dynamic – information can be accessed, managed, and changed. On the other hand, datasets are static (the printed contacts) – they reflect the current information that exists. If the information in the database is updated, you’ll have to create a new dataset to reflect these changes.

How are Datasets Created?

In order to understand datasets better, it’s important to know how they are made. There are a few ways to collect information for datasets:

  • Web scraping. It’s a more modern way to extract relevant data from online sources using custom-built or third-party web scraping tools.
  • Using existing databases. Use existing public or private (with permission) databases, like government data portals, IMDb, or weather forecast websites to collect structured data.
  • Recording data manually. Manually write down observations, like writing down numbers or descriptions, and conduct surveys.
  • Combining sources. Merge all your data to create a well-rounded dataset on a specific topic. The more sources you use, the more reliable and accurate your dataset will be.

Depending on the type of dataset you need for your research project, you can either create it yourself or purchase a pre-made one from dataset vendors. Some providers that offer web scraping tools also have pre-collected datasets that are regularly updated to minimize the need for manual data collection.

Web Scraping vs. Pre-built Datasets

It would be very difficult to create modern, up-to-date datasets without scraping the web. Manual data collection takes a lot of time, especially when collecting information online since there’s so much of it. 

Instead, web scrapers offer an option to collect, clean, and structure web data automatically. However, choosing between datasets and web scrapers depends on the nature of your project.

When to Choose Web Scraping?

Web scraping is a method of automatically collecting data from the web using a specific software. Web scraping tools – self-made or third-party scraping APIs – can help gather large volumes of data from the selected sites much quicker, compared to manual collection, but that’s not the only benefit they offer. They also often parse (clean) and structure data for better readability, so there’s less need for processing information yourself.

However, customizing a web scraper and extracting data can be a hassle. If you’re planning to do it often, you’ll need to run the tool each time you need to collect fresh information, and adjust it every time something in the website’s structure changes. If you use a self-made scraper, you’ll also have to invest into its maintenance. 

Alternatively, you can purchase pre-made web scrapers to avoid taking care of the tool’s infrastructure, but they can get expensive, especially with larger projects.

Web scraping is ideal for time-sensitive use cases, such as tracking e-commerce statistics (pricing, product availability, etc.), extracting social media, travel, real estate data, or collecting the latest news.

When to Choose Datasets?

While datasets are an incredibly valuable and time-saving tool, they come with their own set of limitations. Notably, their freshness and accuracy to your project.

Firstly, pre-built datasets might not have the specific information you’re looking for. It’s rare for dataset vendors to give customers a peek into what information such datasets contain. Therefore, there’s a risk that the data will be only partially or completely unusable for your specific case. Additionally, datasets can become stale, especially if you need time-sensitive data.

Additionally, you can’t always customize a dataset. By purchasing a pre-made one, you can’t ask for specific information to be included as the datasets are made for the general audience. In this case, choosing a scraping API is much better.

Therefore, where data freshness isn’t the highest priority – analyzing historical e-commerce data, AI training, researching the market demographic, sales, & customer behavior – use datasets.

Datasets and Scraping APIs: Data Delivery Methods

Datasets are static, though periodically updated collections of data. Typically, they are downloaded and stored for offline use. Most often, you’ll find datasets in formats like CSV, JSON, or Excel, so they provide a clear, organized snapshot of information.

This makes datasets ideal for tasks like data analysis, machine learning model training, or accessing archival information where real-time updates are not critical. 

Scraping APIs, on the other hand, deliver data on-demand, providing real-time access to information. Unlike datasets, APIs offer the ability to fetch specific pieces of data. They are ideal for cases requiring up-to-date information, such as stock prices, weather updates, or social media feeds.

 

Datasets

Scraping APIs

Data access

Provides a snapshot of data from a specific time

On-demand access to specific data

Delivery frequency

One-time download, can be updated at selected frequency (weekly, monthly, quarterly)

Real-time or on-demand

Data format

JSON, CSV, Excel, SQL, and other structured formats

Raw HTML, CSV, JSON

Performance

Not affected by network; works offline

Depends on server uptime, network latency

Cost

One-time payment

Subscription- or API credit-based; depends on traffic or requests

Conclusion

Datasets, especially pre-made ones, are becoming an integral part of data-driven decision-making. Valuable for dozens of fields, up-to-date datasets are essential for businesses as well as academia, as they help access loads of data in a readable, structured way.

Picture of Adam Dubois
Adam Dubois
Proxy geek and developer.

The post What Is a Dataset? Comparing Scraping APIs and Pre-Collected Datasets appeared first on Proxyway.

]]>
https://proxyway.com/guides/what-is-a-dataset/feed 0
The Best Python HTML Parsers https://proxyway.com/guides/the-best-python-html-parsers https://proxyway.com/guides/the-best-python-html-parsers#respond Tue, 31 Dec 2024 15:57:56 +0000 https://proxyway.com/?post_type=guides&p=29971 Find the best HTML parsers for Python.

The post The Best Python HTML Parsers appeared first on Proxyway.

]]>

Guides

Scraped web data is of little use to people if they can’t read and analyze it. That’s where HTML parsers play a vital role – they extract only the meaningful pieces from the raw downloaded data, and clean it for better readability.

the best Python HTML parsers

Python is one of the easiest programming languages to learn, but despite that it’s great for web scraping, and has many libraries to expand its capabilities. For example, there are multiple HTML parser libraries available on the market, so it can be tricky to choose the one best suited for your scraping project. In this article, you’ll find everything you need to know about Python HTML parsers: what they are, how they work, and which ones are the easiest to set up and use.

What is HTML Parsing?​

HTML parsing refers to extracting only the relevant information from HTML code. This means that raw HTML data – which includes markup tags, bugs, or other irrelevant pieces of information –is cleaned, structured, and modified into meaningful data points or content.

For example, let’s say you really like this article and want to extract the list of the best parsers for offline reading. While you could download the site as an HTML file, it would be tricky to read because of all the HTML tags. Instead, by using a web scraper to extract the list below and an HTML parser to process it, you would get only the relevant content in a clean format. 

Why Parse HTML Data?

Parsing increases the readability of HTML data by removing all necessary or broken information. To illustrate what HTML parsing does, let’s compare raw HTML with parsed data. 

Below is the code for a simple HTML website:

				
					<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>

				
			

Your browser uses the code, and “translates” it into something that more visually appealing and functional for humans. Here’s how your browser would render this code visually.

example of browser--rendered website
Note: the website looks different. This snapshot is for illustration purposes.

As you can see, the code contains HTML elements such as <html>, <body>. While this data is relevant for browsers to display the website correctly, it’s not particularly useful for humans. What we’re interested in is the website’s name, the link, and the data in bullet points.

By using a Python HTML parser like BeautifulSoup, we can remove irrelevant information pieces and convert the raw HTML into structured, readable data like this:

				
					Title: My Website
H1 Heading: Welcome to My Website
Paragraph: This is an example paragraph.
Link: https://example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

				
			

In this case, the parser removed HTML elements and structured the most important data points.The result includes fewer lines of code, neatly ordered list items, and the retained link, though the ‘Click here’ text was removed. Importantly, no relevant information was lost. This structured data is much easier to read for us and can be further manipulated or analyzed.

Now, let’s take a look at the best HTML parsers to use with your Python scraper.

The Best Python HTML Parsers of 2025

1. BeautifulSoup

The most popular Python parsing library.​

BeautifulSoup is one the most popular Python libraries used for parsing. It’s lightweight, versatile, and relatively easy to learn.

BeautifulSoup is a powerful HTML and XML parser that converts raw HTML documents into Python parse trees (a hierarchical tree model that breaks down structures and syntax based on Python’s rules), and then extracts relevant information from them. You can also navigate, search, and modify these trees as you see fit. BeautifulSoup is also excellent for handling poorly formatted or broken HTML – it can recognize errors, interpret the malformed HTML correctly, and fix it.

Since it’s a library for HTML manipulation, BeautifulSoup doesn’t work alone. To render static content, you’ll need an HTTP client like requests to fetch the web pages for parsing. The same applies for dynamic content – you’ll have to use a headless browser like Selenium or Playwright.

The library is very popular and well-maintained, so you’ll find an active community and extensive documentation to help you out.

To install BeautifulSoup, all you have to do is run pip install beautifulsoup4 in your terminal. 

Let’s see how to use BeautifulSoup to parse our simple HTML website.

				
					from bs4 import BeautifulSoup

html_code = """
<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""

soup = BeautifulSoup(html_code, 'html.parser')

title = soup.title.string
h1 = soup.h1.string
paragraph = soup.p.string
link_text = soup.a.string
link_href = soup.a['href']
list_items = [li.string for li in soup.find_all('li')]

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

				
			

Here’s how the final parsed results would look like:

				
					results = {
    "Title": "My Website",
    "Heading (h1)": "Welcome to My Website",
    "Paragraph": "This is an example paragraph.",
    "Link Text": "Click here",
    "Link Href": "https://example.com",
    "List Items": ["Item 1", "Item 2", "Item 3"]
}

for key, value in results.items():
    print(f"{key}: {value}")

				
			

2. lmxl

An efficient parsing library for HTML and XML documents.​

lxml library is probably one of the most efficient parsing libraries for parsing raw HTML and XML data. It’s fast and performant, so it’s great for handling large HTML documents.

The lxml library connects Python with powerful C libraries for processing HTML and XML. It turns raw data into objects you can navigate using XPath or CSS selectors. However, since it’s a static parser, you’ll need a headless browser for dynamic content. While lxml is very fast, it can be harder to learn if you’re not familiar with XPath queries.

Install lxml by running pip install lxml in your terminal, and adding from lxml import html in your scraping project.

Here’s how lxml would parse a simple website:

				
					from lxml import html

html_code = """
<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""
tree = html.fromstring(html_code)

title = tree.xpath('//title/text()')[0]
h1 = tree.xpath('//h1/text()')[0]
paragraph = tree.xpath('//p/text()')[0]
link_text = tree.xpath('//a/text()')[0]
link_href = tree.xpath('//a/@href')[0]
list_items = tree.xpath('//ul/li/text()')

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

				
			

How parsed results would look like:

				
					Title: My Website
Heading (h1): Welcome to My Website
Paragraph: This is an example paragraph.
Link Text: More information...
Link Href: https://www.example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

				
			

3. PyQuery

Library for parsing HTML and XML documents with jQuery syntax.​

PyQuery is another Python library for parsing and manipulating HTML and XML documents. Its syntax is similar to jQuery, so it’s a good choice if you’re already familiar with the library. 

PyQuery is quite intuitive – CSS-style selectors make it easy to navigate the document and extract or modify HTML and XML content. PyQuery also allows you to create document trees for easier data extraction. It works similarly to BeautifulSoup and lmxl: you can load an HTML or XML document into a Python object and use jQuery-style commands to interact with it, so the key difference is the syntax. PyQuery also has many helper functions, so you won’t have to write that much code yourself.

The library is efficient for static content, but it does not natively handle dynamic content – it needs headless browsers to render JavaScript-driven pages before parsing the content.

To install PyQuery, run pip install pyquery in your terminal, and add from pyquery import PyQuery as pq in your project to use it.

Here’s an example of how to use PyQuery to parse a simple HTML document:

				
					from pyquery import PyQuery as pq

html_code = """
<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""

doc = pq(html_code)

title = doc("title").text()
h1 = doc("h1").text()
paragraph = doc("p").text()
link_text = doc("a").text()
link_href = doc("a").attr("href")
list_items = [li.text() for li in doc("ul li").items()]

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

				
			

And here’s how PyQuery would print the results:

				
					Title: My Website
Heading (h1): Welcome to My Website
Paragraph: This is an example paragraph.
Link Text: More information...
Link Href: https://www.example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

				
			

4. requests-html

Parsing library that supports static and dynamic content.​

requests-html is a Python HTML parsing library capable of rendering HTML that supports both static and dynamic content. It combines the convenience of the requests library (HTTP client for fetching web pages) with JavaScript rendering abilities of a headless browser, so there are less libraries for you to use.

With requests-html, you can easily send HTTP requests to a webpage and receive the fully rendered HTML. requests-html is great for static pages as you can send requests and parse raw data with one package. However, the library stands out because it can scrape JavaScript-based web pages, too – it relies on a Chromium web browser for handling dynamic content natively. Additionally, it has multiple parsing strategies, including CSS selectors and XPath, so it’s very convenient.

requests-html also supports multi-threaded requests, so you can interact with several web pages at once. However, this makes it much harder to learn, and it’s significantly slower than traditional parsers due to requiring additional processing power to render the JavaScript.

To install requests-html, run pip install requests-html in your terminal. Once installed, add from requests_html import HTMLSession to your scraping project.

Here’s how to use requests-html to parse a simple website:

				
					from requests_html import HTMLSession
session = HTMLSession()

response = session.get('https://example.com')

doc = response.html

title = doc.find('title', first=True).text
h1 = doc.find('h1', first=True).text
paragraph = doc.find('p', first=True).text
link_text = doc.find('a', first=True).text
link_href = doc.find('a', first=True).attrs['href']
list_items = [li.text for li in doc.find('ul li')]

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

				
			

The parsed results will look like this:

				
					Title: My Website
Heading (h1): Welcome to My Website
Paragraph: This is an example paragraph.
Link Text: More information...
Link Href: https://www.example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

				
			

The Differences Between Python HTML Parsers

The choice of HTML parser boils down to what your project needs – while some projects might require native JavaScript rendering, some can do without that. Also, check if speed and efficiency are up to your expectations. Here’s how the libraries compare:

Library

Speed

Ease of Use

Native Dynamic Content Handling

Ideal Use Case

BeautifulSoup

Fast

Very easy

No

Simple HTML parsing

lxml

Very fast

Moderate

No

Fast parsing

PyQuery

Fast

Easy

No

Scraping with CSS selectors

requests-html

Fast (static content); moderate (dynamic content)

Easy

Yes

Scraping and parsing dynamic web pages

In short, use BeautifulSoup or lxml for static HTML content. They are efficient and relatively easy to learn. If you want to handle dynamic content, use requests-html which integrates a headless browser. If you’re planning to scrape with CSS selectors, use PyQuery for easy navigation and data manipulation. 

The post The Best Python HTML Parsers appeared first on Proxyway.

]]>
https://proxyway.com/guides/the-best-python-html-parsers/feed 0
How to Scrape Google Flights With Python: A Step-by-Step Tutorial https://proxyway.com/guides/scrape-google-flights https://proxyway.com/guides/scrape-google-flights#respond Mon, 25 Nov 2024 13:50:26 +0000 https://proxyway.com/?post_type=guides&p=27774 This is a step-by-step tutorial on how to build a Google Flights Scraper with Python

The post How to Scrape Google Flights With Python: A Step-by-Step Tutorial appeared first on Proxyway.

]]>

Guides

Instead of having multiple browser tabs open to check every destination, you can scrape Google Flights with a Python-based scraper, and get structured flight data in minutes.

How to scrape Google Flights

Planning trips online has become significantly more convenient, but there are still roadblocks – booking flights can still be time-consuming due to the sheer amount of data. While platforms like Google Flights offer a neat way to check all necessary information and compare it across different airlines, manually looking through each date and destination can be daunting. By automating this process with a Google Flights scraper, gathering large volumes of data and comparing it becomes less of a hassle. 

Whether you’re a person looking for a bargain on flight tickets, a business analyst, or a scraping enthusiast searching for a new challenge, this guide will help you build a scraper that collects Google Flights data from scratch. 

Why Scrape Google Flights?

Google Flights offers a vast amount of valuable data – from flight times and prices to the environmental impact of the flight. By scraping flight pages you can extract prices, schedules, and availability, as well as plan trips and stay updated when changes are made. 

Platforms like Google Flights offer flight information based on your requirements (departure and arrival location, dates, number of passengers), but it’s not always easy to compare it – you need to expand the results to see all relevant information, such as layovers. And having several expanded results can be hardly readable. Scraping real-time data can help you find the best deals, and plan itineraries better. Or, if you’re a business owner, it can help gather market intelligence and analyze customer behavior. 

What Google Flight Data You Can Scrape?

There are dozens of reasons to scrape Google Flights data. While the intention might vary based on what you’re trying to accomplish, both travelers and businesses can benefit from it.

If you’re simply planning a trip, scraping Google Flights data might help you to:

  • Compare prices. Getting information about pricing is one of the key reasons why people choose to scrape Google Flights. Structured scraped results can help to evaluate ticket prices, and compare them across different airlines.
  • Check flight times. Another major reason to extract Google Flights data is flight times. You can collect departure and arrival times and dates, compare them, and select the option that fits your itinerary best.
  • Find out about stops. Most people prefer direct flights. Google Flights has data that allows you to check if there will be any layovers until you reach your destination.
  • Review duration. Knowing how long the flight is going to take will help you plan the trip better, and see how the flight fits into your schedule. Such data can give you insights on the duration of your flights between specific locations.
  • Learn about emissions. Scraped data from Google Flights can help you to evaluate carbon emissions of the flights, and make more sustainable choices.


If you’re looking to scrape Google Flights for business purposes, you can:

  • Analyze user behavior patterns. There are specific times when people tend to travel to certain destinations, such as during winter holidays, summer vacations, and more. By reviewing these behavior patterns, companies can segment user bases and target advertisements better.
  • Improve pricing strategies. Flight information is relevant for more businesses than just airports and airlines. Hotels, taxi services, car rental companies, travel insurance companies can review the increase or decrease of demand for specific locations, and adjust their pricing accordingly.
  • Create bundle deals. Accurate flight data can help travel agencies create better travel deals by bundling tickets, hotels, transportation, and activities for customers.
  • Improve risk management. Travel insurance companies can leverage flight data to identify popular destinations, and adjust policies and pricing to better align with customer demand.
Benefits of scraping Google Flights Data for travelers and businesses

Is Scraping Google Flights Legal?

Data on Google Flights is public, and there are no laws prohibiting the collection of  publicly available information. However, there are several things to keep in mind to avoid legal implications.

Here are several tips on how to scrape Google Flights data ethically:

  • Comply with Google’s terms of use. Take the time to go over Google’s terms of service to make sure you don’t violate any of their guidelines.
  • Read the robots.txt file. The file gives instructions to robots (such as scrapers) about which areas they can and cannot access (e.g., admin panels, password-protected pages). Be respectful and follow the given commands.

How to Scrape Google Flights with Python: Step-by-Step Guide

If you’re looking to build your own Google Flights scraper, here’s a comprehensive guide on how to do so from scratch.

In this example, we’ll use Python with Selenium to build the scraper. Python is a great choice due to its straightforward syntax – it’s relatively easy to write, maintain, and understand. Additionally, since Google Flights is a highly dynamic website, we’ll use Selenium to handle dynamic content and interactive elements, such as buttons.

Below is a table containing all information about the scraper we’re going to build.

Programming languagePython
LibrariesSelenium
Target URLhttps://www.google.com/travel/flights/
Data to scrape

1. Departure date from the origin location

2. Return date from the destination

3. Operating airline

4. Departure time

5. Arrival time

6. Flight duration

7. Departure airport

8. Arrival airport

9. Layovers

10. Cost of the trip

11. Best offer

How to save dataCSV file

Prerequisites

Before the actual scraping begins, you’ll need to install the prerequisites. 

  1. Install Python. You can download the latest version from Python’s official website. If you’re not sure if you have Python installed on your computer, check it by running python –version in your terminal (Terminal on MacOS or Command Prompt on Windows).
  2. Install Selenium. To use Selenium with Python for this scraper, install it by running pip install selenium in the Terminal.
  3. Install Chrome WebDriver. Selenium helps to control headless browsers, such as Chromium (which powers Google Chrome). Download the Chrome WebDriver that corresponds to your Chrome browser.
  4. Get a text editor. You’ll need a text editor to write and execute your code. There’s one preinstalled on your computer (TextEditor on Mac or Notepad on Windows), but you can opt for a third-party editor, like Visual Studio Code, if you prefer.

Importing the Libraries

Once all your tools are installed, it’s time to import the necessary libraries. Since we’ll be using Python with Chrome, we need to import the WebDriver to the system Path for the browser to work with Selenium.

Step 1. Import WebDriver from Selenium module.

				
					from selenium import webdriver

				
			

Step 2. Then, import the By selector module from Selenium to simplify element selection.

				
					from selenium.webdriver.common.by import By

				
			

Step 3. Import all necessary Selenium modules before moving on to the next steps.

				
					from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver import Keys, ActionChains

				
			

Step 4. We want to save our results into a CSV file, so let’s import the CSV module, too.

				
					import csv

				
			

Setting Up Global Variables and Parameters

After importing all the necessary libraries, we need to to set up global variables to store key values. These include the target URL, a timeout (to accommodate page loading time), and any specific parameters.

Step 5. So, let’s set up global variables.

				
					start_url = "https://www.google.com/travel/flights"
timeout = 10 #seconds

				
			

Step 6. Next, set up the parameters for the scraper – specifically, the criteria you’re looking for in the flights. These include departure and arrival locations, as well as travel dates.

				
					my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': '2024-12-03'
}
				
			

Note: You can also define parameters for one-way flights, too. 

				
					my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': None
}

				
			

When browsing Google Flights, you don’t need to specify the exact airport for departure or arrival – you can simply enter a city (or even a country) instead because we’re using the auto-complete feature. It simplifies location input by suggesting relevant options. For example, typing Los will display suggestions that match the input – LOS airport in Nigeria, Los Angeles in the U.S., or Los Cabos in Mexico.

You can edit these values as you see fit – your ‘from’ value can be set to ‘Los Angeles’, and the scraper will target any airport in Los Angeles for departure. You can also specify a different airport, like ‘JFK’ or change the dates completely. But, for the sake of this example, let’s use LAX for departure and any airport in Atlanta for arrival.

Setting Up the Browser

Step 7. Before we start scraping with Selenium, you need to prepare the browser. As mentioned earlier, we’ll be using Chrome in this example.

				
					def prepare_browser() → webdriver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘--disable-gpu’)
driver = webdriver.Chrome(options = chrome_options)
return driver

				
			

Note: This browser setup will allow you to see the scraping in action. However, you can add an additional chrome_options line to run Chrome in headless mode.

				
					def prepare_browser() → webdriver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘--disable-gpu’)
chrome_options.add_argument(“--headless=new”)
driver = webdriver.Chrome(options = chrome_options)
return driver

				
			

Step 8. It’s also important to set up the main() function. It calls the prepare_browser function, which returns a Chrome driver. Additionally, we need to instruct the driver to execute the scraping, and close when it’s finished.

				
					def main() -> None:
driver = prepare_browser()
scrape(driver)
driver.quit()

if __name__ = ‘__main__’:
main()

				
			

Scraping Google Flights

When the browser is prepared, we can actually start scraping the results from the Google Flights page. 

Handling Cookies on Google Flights with Python

While the start_url is the Google Flights main page, the scraper might bump into a cookie consent page first. Hence, we need to instruct our scraper to handle it.

Step 9. Let’s provide the scraper with some extra information to handle the cookie consent page. Namely, find and copy the CSS selectors for the “Accept” button. We can do this by using Inspect Element.

cookie consent button in google flights
Cookie consent button on Google Flights page

If the scraper successfully clicks the “Accept” button on the cookie consent page, we’ll still need to wait until the actual Flights page loads. In this example, we’re using the “Search” button’s appearance as an indication that our target page has loaded. 

Step 10. Using the search button’s CSS selector, instruct the scraper to wait for it to appear before moving on to the next step. So, let’s write a function that will print “Search button found, continuing.” if everything went well, and “Something went wrong.” if the scraper couldn’t locate said button.

Search button in google Flights
Search button in Google Flights

Here’s how the function for accepting cookies and locating the “Search” button looks like:

				
					def scrape(driver: webdriver) -> None:
    driver.get(start_url)
    if driver.current_url.startswith("https://consent.google.com"):
        print ("Hit the consent page, handling it.")
        btn_consent_allow = driver.find_element(
            By.CSS_SELECTOR, 'button.VfPpkd-LgbsSe[jsname="b3VHJd') 
        btn_consent_allow.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]'))
            )
            print ("Search button found, continuing.")
            prepare_query(driver)
        except Exception as e:
            print("Something went wrong: {e}")

				
			

Continuing in the def scrape function, let’s add some code instructing the scraper to locate and click on the “Search” button, and print “Got the results back.” when the scraping is finished.

				
					search_btn = driver.find_element(By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]')
        search_btn.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'ul.Rk10dc'))
            )
            print ("Got the results back.")

				
			

Scraping Google Flights

At the beginning of our script, we defined our parameters: origin location (‘from’), destination (‘to’), a date for departure (‘departure’), and a date for return (‘return’). These parameters will help the scraper fill in the query fields. To allow def scrape to function properly, we need to instruct the scraper about how it should prepare the search query. 

Step 11. While we have our values ready, the scraper needs to know where to use them. For that, we’ll need to find and copy another set of CSS selectors for “Where from?”, “Where to?”, and date fields.

How to find CSS selector for Google Flights "Where to?" field
How to find CSS selector for Google Flights "Where to?" field

However, we need to prepare our scraper for two potential date_to options – if the return date is defined in my_params, and if it’s not

However, if the return date is set to None, we’ll also need to change the selection to One-way (instead of Round trip) in the dropdown menu. Thus, we’ll need a CSS selector for the menu as well.

Dropdown menu CSS selector
FInding the CSS selector for the dropdown menu

Step 12. Instruct the scraper about how it should fill in the “Where from?”, “Where to?”, and date fields.

				
					def prepare_query(driver) -> None:
field_from = driver.find_element(By.CSS_SELECTOR, 'div.BGeFcf div[jsname="pT3pqd"] input')
    field_to = driver.find_element(By.CSS_SELECTOR, 'div.vxNK6d div[jsname="pT3pqd"] input')
    date_from = driver.find_element(By.CSS_SELECTOR, 'div[jsname="I3Yihd"] input')
date_to = None
   if my_params['return'] is None or my_params['return'] == '':
dropdown = driver.find_element(By.CSS_SELECTOR, 'div.VfPpkd-aPP78e')
        dropdown.click()
        ActionChains(driver)\
            .move_to_element(dropdown)\
            .send_keys(Keys.ARROW_DOWN)\
            .pause(1)\
            .send_keys(Keys.ENTER)\
            .perform()



				
			

The if function might find a pre-defined return date in my_params. If that’s the case, we need to find a CSS selector for the return date field instead of changing the the value in the dropdown menu. The scraper will fill in the form using data from my_params.

				
					else: 
date_to = driver.find_element(By.CSS_SELECTOR, 
'div.K2bCpe div[jsname="CpWD9d"] input')
 field_input(driver, field_from, my_params['from'])
    field_input(driver, field_to, my_params['to'])
    field_input(driver, date_from, my_params['departure'])
    if date_to is not None:
        field_input(driver, date_to, my_params['return'])
    ActionChains(driver)\
        .send_keys(Keys.ENTER)\
        .perform()
    print("Done preparing the search query")

				
			

Step 13. Once all the fields we need to fill in are defined, instruct the scraper to enter the information into the selected fields.

We’ll use ActionChains to send the text that needs to be typed in. Additionally, let’s instruct the scraper to press Enter, so that the first suggested option for departure and arrival dates is selected from the dropdown menu.

				
					def field_input(driver, element, text) -> None:
        element.click()
        ActionChains(driver)\
        .move_to_element(element)\
        .send_keys(text)\
        .pause(1)\
        .send_keys(Keys.ENTER)\
        .perform()

				
			

Note: In Step 10, we instructed the scraper to click on the “Search” button to run this search query.

Returning the Results

If you check the Google Flights page source, you’ll notice that the results come in an unordered list, where one list item contains all the information about a single trip – the dates, times, price, layovers, and more. When browsing the page, each list item should look something like this:

Flight result example
One flight result is one list item

Step 14. If we want these results to sit neatly in a table when we save them, we need to store them into our “dictionary”. To do this, we need to collect the CSS selectors for each element in the result.

				
					def get_flight_info(element, best) -> dict:

				
			

Let’s begin with flight times. The departure time time[0] will be time_leave, and arrival time – time[1] as time_arrive.

Finding CSS selectors for flight times on Google Flights results
Finding CSS selectors for flight times on Google Flights results
				
					times = element.find_elements(By.CSS_SELECTOR, 
        'div.Ir0Voe span[role="text"]')

				
			

Let’s do the same thing with airports.

				
					airports = element.find_elements(By.CSS_SELECTOR, 
        'div.QylvBf span span[jscontroller="cNtv4b"]')

				
			

And the rest of the provided information – airlines, layovers, cost, and suggested best result.

				
					flight_info = {
        'airline': element.find_element(By.CSS_SELECTOR, 
            'div.Ir0Voe > div.sSHqwe.tPgKwe.ogfYpf > span:last-child').text,
        'date_leave': my_params['departure'], #This will be filled in from my_params
        'date_arrive': my_params['return'], #This will also be filled from my_params
        'time_leave': times[0].text,
        'time_arrive': times[1].text,
        'duration_string': element.find_element(By.CSS_SELECTOR, 
            'div.Ak5kof > div.gvkrdb.AdWm1c.tPgKwe.ogfYpf').text,
        'airport_leave': airports[0].text,
        'airport_arrive': airports[1].text,
        'layovers': element.find_element(By.CSS_SELECTOR, 
            'div.BbR8Ec > div.EfT7Ae.AdWm1c.tPgKwe > span').text,
        'cost': element.find_element(By.CSS_SELECTOR,
            'div.U3gSDe div div.YMlIz.FpEdX span').text,
        'best': best #True for flights from the suggested best list, or False for everything else
    }
    return flight_info

				
			

Extracting and Parsing the Page Data

Google Flights has a neat feature that provides you with the best results (the shortest flight duration, fewest layovers, the cheapest flight), as well as all available results based on your query. You may not like the suggested best results, so let’s save both best and all other remaining results in a list list_elems.

Step 15. Let’s adjoin these two lists, and return them as a single item under one name – list_of_flights. 

				
					def find_lists(driver):
   list_elems = driver.find_elements(By.CSS_SELECTOR, 'ul.Rk10dc')
   list_of_flights = parse(list_elems[0], True) + parse(list_elems[1], False)
      return list_of_flights
				
			

It’s important to parse the downloaded page to collect only the necessary information – in this case, the flight lists. As mentioned before, we have two of them – the best results list and the rest. But we don’t want them to be separated in our final saved list of all flights. 

Step 16. Let’s parse our page data. The list_of_flights will contain the final results. 

				
					def parse(list_elem: list, best: bool) -> list:
   list_items = list_elem.find_elements(By.CSS_SELECTOR, 'li.pIav2d')
   list_of_flights = []
   for list_item in list_items:
        list_of_flights.append(get_flight_info(list_item, best))
    return list_of_flights


				
			

Saving the Output to CSV

At the very beginning, we imported the CSV library to save our data. 

Step 17. Let’s add a few extra lines of code so that all flight information we previously defined in our dictionary and scraped results are saved.

				
					def write_to_csv(flights):
    field_names = ['airline','date_leave','date_arrive','time_leave',
                   'time_arrive','duration_string','airport_leave',
                   'airport_arrive','layovers','cost','best']
    output_filename = 'flights.csv'
    with open (output_filename, 'w', newline='', encoding='utf-8') as f_out:
        writer = csv.DictWriter(f_out, fieldnames = field_names)
        writer.writeheader()
        writer.writerows(flights)

				
			
Parsed results saved in CSV file
Parsed results saved in a CSV file opened with Numbers (Mac)

Here’s the entire script for this Google Flights scraper:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver import Keys, ActionChains
import csv

start_url = "https://www.google.com/travel/flights"
timeout = 10

my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': '2024-12-03'
}
my_params2 = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': None
}

def prepare_browser() -> webdriver:
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

def field_input(driver, element, text) -> None:
    element.click()
    ActionChains(driver)\
    .move_to_element(element)\
    .send_keys(text)\
    .pause(1)\
    .send_keys(Keys.ENTER)\
    .perform()

def prepare_query(driver) -> None:
    field_from = driver.find_element(By.CSS_SELECTOR, 'div.BGeFcf div[jsname="pT3pqd"] input')
    field_to = driver.find_element(By.CSS_SELECTOR, 'div.vxNK6d div[jsname="pT3pqd"] input')
    date_from = driver.find_element(By.CSS_SELECTOR, 'div[jsname="I3Yihd"] input')
    date_to = None
    if my_params['return'] is None or my_params['return'] == '':
        dropdown = driver.find_element(By.CSS_SELECTOR, 'div.VfPpkd-aPP78e')
        dropdown.click()
        ActionChains(driver)\
            .move_to_element(dropdown)\
            .send_keys(Keys.ARROW_DOWN)\
            .pause(1)\
            .send_keys(Keys.ENTER)\
            .perform()
    else:
        date_to = driver.find_element(By.CSS_SELECTOR, 'div.K2bCpe div[jsname="CpWD9d"] input')
    field_input(driver, field_from, my_params['from'])
    field_input(driver, field_to, my_params['to'])
    field_input(driver, date_from, my_params['departure'])
    if date_to is not None:
        field_input(driver, date_to, my_params['return'])
    ActionChains(driver)\
        .send_keys(Keys.ENTER)\
        .perform()
    print("Done preparing the search query")

def get_flight_info(element, best) -> dict:
    times = element.find_elements(By.CSS_SELECTOR, 'div.Ir0Voe span[role="text"]')
    airports = element.find_elements(By.CSS_SELECTOR, 'div.QylvBf span span[jscontroller="cNtv4b"]')
    flight_info = {
        'airline': element.find_element(By.CSS_SELECTOR, 'div.Ir0Voe > div.sSHqwe.tPgKwe.ogfYpf > span:last-child').text,
        'date_leave': my_params['departure'],
        'date_arrive': my_params['return'],
        'time_leave': times[0].text,
        'time_arrive': times[1].text,
        'duration_string': element.find_element(By.CSS_SELECTOR, 'div.Ak5kof > div.gvkrdb.AdWm1c.tPgKwe.ogfYpf').text,
        'airport_leave': airports[0].text,
        'airport_arrive': airports[1].text,
        'layovers': element.find_element(By.CSS_SELECTOR, 'div.BbR8Ec > div.EfT7Ae.AdWm1c.tPgKwe > span').text,
        'cost': element.find_element(By.CSS_SELECTOR, 'div.U3gSDe div div.YMlIz.FpEdX span').text,
        'best': best
    }
    return flight_info

def parse(list_elem: list, best: bool) -> list:
    list_items = list_elem.find_elements(By.CSS_SELECTOR, 'li.pIav2d')
    list_of_flights = []
    for list_item in list_items:
        list_of_flights.append(get_flight_info(list_item, best))
    return list_of_flights

def find_lists(driver):
    list_elems = driver.find_elements(By.CSS_SELECTOR, 'ul.Rk10dc')
    list_of_flights = parse(list_elems[0], True) + parse(list_elems[1], False)
    return list_of_flights

def write_to_csv(flights):
    field_names = ['airline', 'date_leave', 'date_arrive', 'time_leave',
                   'time_arrive', 'duration_string', 'airport_leave',
                   'airport_arrive', 'layovers', 'cost', 'best']
    output_filename = 'flights.csv'
    with open(output_filename, 'w', newline='', encoding='utf-8') as f_out:
        writer = csv.DictWriter(f_out, fieldnames=field_names)
        writer.writeheader()
        writer.writerows(flights)

def scrape(driver: webdriver) -> None:
    driver.get(start_url)
    if driver.current_url.startswith("https://consent.google.com"):
        print("Hit the consent page, dealing with it.")
        btn_consent_allow = driver.find_element(By.CSS_SELECTOR, 'button.VfPpkd-LgbsSe[jsname="b3VHJd')
        btn_consent_allow.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]'))
            )
            print("Search button found, continuing.")
            prepare_query(driver)
        except Exception as e:
            print("Something went wrong: {e}")
        search_btn = driver.find_element(By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]')
        search_btn.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'ul.Rk10dc'))
            )
            print("Got the results back.")
            flights = find_lists(driver)
            write_to_csv(flights)
        except Exception as e:
            print(f"Something went wrong: {e}")

def main() -> None:
    driver = prepare_browser()
    scrape(driver)
    driver.quit()

if __name__ == '__main__':
    main()

				
			

Avoiding the Roadblocks When Scraping Google Flights

Building a Google Flights scraper can be a pretty daunting task, especially if you’re new to scraping but it can become even more difficult if you’re going to scrape it a lot. While we have solved issues like the cookie consent page already, other issues can arise if you’re scraping at scale.

Use Proxies to Mask Your IP

Websites don’t like bot traffic, so they try to prevent it by using tools like Cloudflare. While scraping the Google Flights page once or twice probably won’t get you rate-limited or banned, it can happen if you try to scale up. 

You can use proxy services to prevent that. Proxies will mask your original IP by routing the requests through different IP addresses, making them blend in with regular human traffic. Typically, human traffic comes from residential IPs, so this type of proxy is the least likely to be detected and blocked.

This is a step-by-step guide on how to set up and authenticate a proxy with Selenium using Python.

Use the Headless Browser Mode

The Google Flights page is a dynamic website that heavily relies on JavaScript – not only for storing data, but also for anti-bot protection. Running your scraper in headless Chrome mode allows it to render JavaScript like a regular user would and even modify the browser fingerprint.

A browser fingerprint is a collection of unique parameters like screen resolution, timezone, IP address, JavaScript configuration, and more, that slightly vary among users but remain typical enough to avoid detection. Headless browsers can mimic these parameters to appear more human-like, reducing the risk of detection.

Step 7 in Setting Up the Browser gives two examples of how to set up Chrome for scraping, one of them containing this line of code: chrome_options.add_argument(“–headless=new”)

Adding this chrome_option will run the browser in headless mode. You may not want to use it now, but it’s good to know how to enable it if necessary. 

Be Aware of Website’s Structural Changes

This Google Flights scraper relies heavily on CSS selectors – they help to find the specific input fields and fill them in. However, if Google makes adjustments to the Flights page, the scraper might break. That’s because the CSS selectors can change when a site developer modifies the HTML structure. 

If you plan to use this Google Flights scraper regularly, keep in mind that selectors can change over time, and you’ll need to update them to keep the scraper functional.

Conclusion

Scraping Google Flights with Python is no easy feat, especially for beginners, but it offers a great deal of information useful for travelers and businesses alike. Despite the project’s difficulty, this data will be helpful when planning a trip, or gathering market intelligence, analyzing trends, and better understanding your customer needs.

The post How to Scrape Google Flights With Python: A Step-by-Step Tutorial appeared first on Proxyway.

]]>
https://proxyway.com/guides/scrape-google-flights/feed 0