Guides - Proxyway

IPv6 Proxy Guide: What You Need to Know

Chris Becker — Tue, 29 Jul 2025 06:38:54 +0000

Guides

The internet today runs on IPv4 protocol – but the protocol is wildly out of date. IPv6 is the future – it’s just unclear how near or far it is. However, IPv6 will replace IPv4 as well as the pile of patches and workarounds needed to keep it going. And with that, IPv6 proxies will be the dominant type of proxy in the market. Futureproof your plans by learning about it now.

What Are IPv6 Proxies?

IPv6 proxies are proxy servers that support online communication over the IPv6 protocol. IPv6 is meant to replace the current IPv4 standard. This is a must: IPv4 addressees – necessary for online data exchange – are 8-bit long (and look like this: 104.21.55.78). This allows for about 4 billion unique addresses. As of 2025, there were 5.5 billion internet users. Since there are a lot more devices that there are users, unique IPv4 addresses ran out a long time ago.

An IPv6 address looks like 2001:0db8:85a3:0000:0000:8a2e:0370:7334 – longer and made up of numbers and letters. This would give us 340 undecillion unique IP addresses, enough to make every sock in the world Wi-Fi capable. IPv6 proxies are configured to use this longer address as well as other new features, like a shorter header (think labels for data packages).

On a semi-related note, some businesses call their IPv6 gateways – which translate IPv6 traffic into IPv4 and back again – “IPv6 proxies.” The differences between those are murky – they’re both intermediaries for your data – but a regular IPv6 proxy won’t necessarily be able to handle IPv4 traffic.

What’s the Difference Between IPv4 and IPv6 Proxies?

The crucial difference between IPv4 proxies and IPv6 proxies is the kind of protocol they use: IPv4 for the former, and IPv6 for the latter. As the two formats aren’t interoperable, online infrastructure has to be built to be able to use IPv6.

Here lies the problem: building new infrastructure is expensive. So while IPv4 address exhaustion has been a known problem since the 1980s, the protocol soldiers on thanks to all sorts of smart tricks pulled to make it work. And since IPv6 adoption is slow – important websites like Amazon, Twitter, and GitHub still don’t support it – internet providers don’t feel the pressure to adopt it either.

This is not a universal constant across the globe. China sees IPv6 adoption as a national goal and India leads IPv6 adoption on a global scale. Part of this is, reportedly, because Asian nations got slim IPv4 address allocations. Meanwhile, companies in the west had plenty of IPv4 to go around and consequently invested into the tricks that keep it going.

One such trick is Network Address Translation (NAT). These services stand between their own networks and the wider internet. They work as a post forwarding service for the data coming from their own networks, meaning that only the NAT has to have a unique address. At the smallest scale, NAT can exist on your router, so devices using Wi-Fi wouldn’t need unique IPs. At large scales, CG (carrier-grade) NATs exist for ISP networks.

What does that mean for proxies? On the technical side, IPv6 proxies could be faster because they have shorter headers and sort data in more advanced ways. But on the practical side, IPv4 proxies are both less likely to get banned and more useful in the immediate term. More on that in the next section.

What Are the Benefits/Drawbacks of IPv6 Proxies?

IPv6 proxies have several things going for them, but a few downsides as well.

IPv6 pros	IPv6 cons
Virgin proxies: due to both slow adoption and the potentially endless variety of proxy addresses, you can find IPs that have never been used before.	Low adoption: while large websites are increasingly adopting IPv6, not all of them are. At the time of writing, Twitter, Amazon, and Github are still IPv4-only.
Security: IPv6 is inherently more secure than IPv4, with IPSec protocol for authentication and encryption applied by default.	Easy bans: as IPv6 isn’t yet widespread, any suspicious (bot-like) connections are unlikely to come from residential addresses – as such, websites and services are more likely to ban them without the fear of affecting actual customers.
Speed: IPv6 doesn’t have to deal with NAT (Network Address Translation) and has simpler datagram (data package) headers, so it should work faster.

Can I Get IPv6 Proxies? Can I Get Residential IPv6 Proxies?

You can already get IPv6 proxies – the providers are slowly ramping up the supply. Outside of countless small suppliers, you can see companies like Oxylabs and IPRoyal advertising their wares. What’s more, Oxylabs claims theirs are drawn from their 175M+ pool.

However, considering that the total advertised pool of Oxylabs is 175 million, it’s doubtful that they would have a large separate supply of addresses just for the IPv6 demand.

So finding genuine IPv6 residential proxies is still difficult – the vast majority will be data center ones. But providers are stepping up their game. Several big name proxy companies now boast IPv6 proxies, including residential:

Bright Data

Rayobyte

IPRoyal

Moreover, some offer additional services to increase usability: Bright Data supports failover which switches to IPv4 if you’re trying to access a service that doesn’t support IPv6.

Why Are IPv6 Proxies Generally So Cheap?

IPv6 proxies are generally cheaper than IPv4: for example, at the time of writing, a dedicated IPv6 IP on Rayobyte costs $0.20 while a dedicated IPv4 IP is $2.50. That’s because the supply still outstrips the demand:

IPv6 proxies are mainly datacenter: data centers may provide powerful and stable connections, but they are also very likely to end up blocked.
IPv6 is less useful: a large chunk of major websites outright don’t support IPv6 connections, making them very limited in deployment.

What’s the Future of IPv6 Proxies?

The future will run on IPv6, it’s just hard to tell how long it will take. There is progress in adopting the new standard, but it’s slow. Hopefully, the process will speed up before the internet is paralyzed by IPv4’s workarounds finally breaking under the strain.

Conclusion

Today, IPv6 proxies lack the universality of IPv4. It’s not the fault of the technology itself, but of the inertia of the wider tech world. But with adoption inexorably coming, proxy suppliers are starting to adapt. Before long, IPv6 offerings are going to be as good and prominent as IPv4s.

Chris Becker

Proxy reviewer and tester.

July 29, 2025

What Is a UDP Proxy? A Simple Guide

Chris Becker — Fri, 27 Jun 2025 08:20:23 +0000

Guides

A UDP proxy is the type of proxy that uses the UDP protocol. This protocol is used for various speedy tasks the more stable TCP protocol is unsuitable for – in turn, UDP proxies are more versatile than the ones relying on TCP. Sometimes, the target may outright refuse TCP connections, making UDP proxies even more important. But that’s just the abstract explanation – for how it works and what it’s best at, read on.

What Is UDP?

UDP stands for User Datagram Protocol, one of the basic technologies of the internet; it sets the rules for how data is transmitted.

As a connectionless protocol, UDP relies on two assumptions:

The recipient is ready to receive the data – there’s no need to check whether they actually are. Skipping this “handshake” is the major contributor to UDP’s speed in the modern day.
The data packages will arrive in the order they were sent – therefore, there’s no need to check how they actually arrived. The recipient will correctly rebuild the messages because the packages came in one after the other in the correct order. However, packages can get lost or mis-ordered – a risk deemed acceptable.

With UDP, datagrams (the blocks data is broken down into) have much shorter headings (think package labels), so the data takes less bandwidth to transmit than it would with TCP. However, there is some minimal error checking and UDP can end up sending duplicate packages, thus potentially increasing bandwidth use.

As UPD is one of the basic protocols of the internet, a lot of higher level protocols (and apps, and so on) are built around it.

What’s the Difference Between TCP and UDP?

The benefits and downsides of UDP become clearer when the protocol is compared to its main “rival” TCP (Transmission Control Protocol). In contrast to UDP, TCP is a connection-oriented protocol – it doesn’t assume anything. Accordingly, a handshake is carried out to ensure that the recipient is ready to receive data. Once the transmission is out, there are error checks to see whether all of the data arrived in the correct order.

All the confirmations and longer datagram headings necessary for all the error checking make TCP slower to operate than UDP.

To explain it in less technical terms, imagine mail delivery via cannon. TCP would aim the cannon at the delivery point and then check via spyglass that the recipient is waiting to receive every time before firing. The recipient would have to acknowledge that he received each package by waving a jaunty little flag or something.

Meanwhile, UDP would just aim the cannon and fire all the parcels as fast as it can load them. It doesn’t check whether anyone is waiting for them or how they land. Therefore, it goes through the same pile of packages as TCP a lot faster.

What Is UDP Used For?

So the obvious use case for UDP as a protocol is situations where speed matters more than anything else. That’s why it’s used for:

Improvement to HTTP: HTTP/2 is the higher level protocol running the internet, but it has issues. For example, reliance on TCP makes it vulnerable to congestion: if it detects that data arrived incorrectly, the transmission channel is blocked until the data is resent. HTTP/3 aims to solve them with a transport protocol called QUIC. What makes QUIC quick is using multiple UDP channels instead. If the protocol detects errors in transmission, it blocks only the affected channel, making connections smoother and faster.
VoiP (Voice over IP) communications: your Discord voice chats, WhatsApp calls, and so on. Users prefer to hear the caller in real-time rather than wait for a clear message to arrive. The chopiness and loss of quality you’ve invariably experienced if you’ve ever had a single VoIP (or video) interaction is just UDP packages getting lost.
Online gaming: ping is unavoidable – it will take time for player data to physically reach the server and vice versa. And slowing it down would be worse than losing some of the data. That’s why, say, War Thunder has both ping and packet loss indicators right there on the screen.
Gaming automation: statistically, everyone loves either RuneScape or Growtopia. But if you want to run multiple accounts at the same time (or even bots), you’ll quickly need to turn to proxies for their numerous IPs.
DNS lookup: DNS – Domain Name Service – is the phonebook of the internet; it turns human-readable addresses (https://proxyway.com/) into IP addresses that computers can use (172.67.170.192). So when you enter a website address into a browser, the DNS query is sent via UDP to make this initial step that much faster.
Multicasting: if broadcasting just blasts signals everywhere, multicasting only reaches devices that are, well, interested. So multicasting allows a sender to, say, broadcast a stream that will reach apps tuned to that stream without having to directly connect to each one of them.

What Is a UDP Proxy?

A UDP proxy is thus a proxy that uses UDP to transmit data. Since it doesn’t establish connections or doesn’t do any error checking, it is one of the fastest proxies around. If you’re doing such data-intensive activities like streaming, UDP is the way to go.

When it comes to specific applications, UDP proxies are used for:

Gaming automation: multiplayer games use UDP, and so do bots;
Torrenting: Micro Transfer Protocol found in modern torrent clients is UDP based;
QUIC-based tasks: more of a futureproofing thing, once QUIC becomes standard, so will UDP proxies.

What Is a SOCKS5 UDP Proxy?

SOCKS5 is the newest version of the widely-adopted SOCKS internet protocol, which enables sharing data via proxy. Previously, SOCKS only ran on TCP. But with SOCKS5, it can now use UDP for transferring data via proxies.

As a higher-level protocol that builds upon UDP, SOCKS can provide advanced benefits like authenticating the connection and data encryption. The big takeaway is that SOCKS5 UDP proxy is probably going to be the way you’re going to use your UDP proxy of choice.

Notably, not all SOCKS5 proxy providers offer the UDP functionality. Many of them disable UDP support out of risk-avoidance.

If you want a quick rundown of SOCKS5 proxy providers, including those that support UDP, read our list of the best SOCKS5 proxies.

Conclusion

A UDP proxy is one of the fastest – if not the fastest – proxies around. It cannot be beat for speed or specialized use-cases.

Chris Becker

Proxy reviewer and tester.

June 27, 2025

The Best Free Datasets to Use in Python Skill Practice

Adam Dubois — Mon, 17 Mar 2025 12:43:03 +0000

Guides

Python is one of the most popular programming languages used for data analysis. Despite being relatively easy to pick up, it still requires practice to learn. And a great way to improve the skill is by analyzing datasets.

Python is an open-source language used for a variety of cases, from web scraping to software development. By itself, it has limited functions that could be useful for scraping or data analysis, but you can find dozens of Python libraries to increase its flexibility and usability.

However, practicing Python can be tricky if you don’t have a project to work on. If you’re looking to improve your data analysis skills with Python, you should look no further than datasets.

Using Python to examine datasets can help you learn data cleaning, manipulation, handling various types of information (numeric, textual, etc.), and more. Let’s dive into the best datasets you can use to develop your proficiency with Python.

What Is a Dataset?

Datasets are pre-collected records on a specific topic, be it the inventory stock of an e-commerce website or the most popular baby names of this decade.

They’re static organized compilations of important data points prepared for further analysis. Datasets can be used for a variety of cases, including research and business management purposes, as well as personal use, such as finding relevant job postings or product reviews.

Datasets vary not only in size, but also by type – you can encounter numeric, textual, multimedia, mixed, and other types. They will also differ in structure – the way a dataset is organized usually depends on the data type it holds.

What Is a Dataset? Comparing Scraping APIs and Pre-Collected Datasets

Learn all you need to know about datasets, and how they differ from web scrapers.

What to Look for in a Practice Dataset?

When choosing a dataset to practice your Python skills, consider its size, complexity, and structure.

If you’re new to Python, opt for smaller, organized datasets with clear labels and fewer data points – it’ll be easier to navigate Python functions with less data to handle. If you already have some familiarity with Python, you can try exploring larger, unstructured datasets that require cleaning and preprocessing.

In general, a good rule of thumb is to look for datasets that match your learning goals. If you want to practice data visualization, choose datasets with diverse numerical and categorical data. On the other hand, if you’re interested in advanced level problem-solving, opt for datasets with missing values, inconsistencies, or unstructured text.

Lastly, consider availability and documentation. Well-documented datasets, like those from government open data portals, provide descriptions, column explanations, and sample analyses, making them easier to work with. A good dataset challenges your skills while keeping the learning process manageable.

Where to Find Good Datasets for Analysis?

There are a few ways to find datasets to practice Python skills: you can pick free datasets, purchase them from dataset vendors, or make a dataset yourself.

Free Dataset Providers

If you opt for free datasets, there are multiple websites you can get them from. Free providers often have quite high collections of datasets that are often used by professionals and individuals alike.

The key disadvantage of free datasets is their maintenance – since they are provided by courtesy of others, the data might not always be relevant and fresh enough for your project. Nevertheless, it should do the job if you’re just practicing.

Kaggle. Kaggle is probably one of the most popular dataset providers on the market. It has over 400K datasets for all kinds of projects.

Google Dataset Search. Google has a specific dataset search engine that will find you relevant datasets from all over the web based on your keyword. Keep in mind that Google Dataset Search will include results with paid datasets, too.

GitHub. This developer code sharing platform is great for storing, managing, and publicly sharing code, but can be a great place to find free, pre-collected practice datasets, too.
Public government data websites. Websites like Data.gov or Data.gov.uk are great places to find public datasets on various country-specific topics. They are also often updated.

Paid Dataset Providers

You can also purchase datasets on your topic of interest. These datasets will contain fresh data and will be renewed on your selected frequency. Unfortunately, they don’t come cheap, so might not be the best choice if you’re just learning, but are perfect for business analysis.

Bright Data. The provider offers over 190 structured datasets on various business niches. The datasets can be refreshed at a chosen frequency, too. Bright Data also offers a few free datasets as well as custom datasets based on your needs.

Oxylabs. This provider offers ready-to-use business- and development-related datasets, such as job postings, e-commerce, or product review data. Oxylabs can also provide custom datasets on your specific interest.
Coresignal. The provider has a large collection of datasets on companies, employees, and job postings. It’s a great choice for analyses related to business growth.

Making Your Own Dataset

If you’d like to practice Python for web scraping in addition to data analysis, you can try creating your own dataset by extracting data from relevant websites, structuring, and exporting it in a preferred format.

We have a useful guide on how to start web scraping with Python. It will help you build a scraper and extract web data which you’ll be able to use for building a dataset later on.

Web Scraping with Python: All You Need to Get Started

An introductory guide to Python web scraping with a step-by-step tutorial.

Python Libraries for Working With Datasets

Being a general-purpose programming language, Python can be used for various projects, but it’s especially popular for web scraping and data analysis tasks due to helpful packages – libraries.

Adding libraries will help you increase Python’s functionality by adding features for data cleaning, filtering, clustering, and more. Here are some of the common Python packages you’ll find helpful for practicing data analysis in Python:

Pandas. The pandas library can be used for data manipulation and analysis. It makes it easy to clean, filter, and reshape data points as it can handle missing values or formatting issues, group and sort data points.
NumPy. This library is excellent for working with numerical datasets as it supports fast mathematical operations, such as algebra equations or random number generation.

Matplotlib. The Matplotlib library can be used for data visualization. It’s very useful for analyzing distributions, correlations, and categorical data, and can assist in creating statistical graphics.

Scikit-learn. The library is useful for data preprocessing – it has tools to help with data classification, regression, and clustering, and is often used for machine learning tasks. Scikit-learn can be easily used alongside pandas and NumPy.
BeautifulSoup. The BeautifulSoup library can be useful if you need to extract structured information from a website (i.e., product reviews). Combined with the requests library or a headless browser for dynamic websites, it can scrape and process data.

Free Datasets to Try in Python Skill Training

Using datasets for Python training is one of the simplest ways to learn the language, but it comes with its own set of challenges. You might encounter incomplete, inconsistent, or poorly formatted data, so your challenge is to use Python to solve them before extracting necessary data.

Wine Quality Dataset (Kaggle)

The Wine Quality Dataset on Kaggle is a relatively small dataset (around 15K data points), containing information about the amount of various chemical ingredients in the wine and their effect on its quality.

Based on the given data, your main task would be to use Python to understand the dataset, perform necessary data cleanup (if necessary), and build classification models to predict wine quality.

Electric Vehicle Population Data (Data.gov)

The Electric Vehicle Population Data on Data.gov is a public dataset providing information on various types of electric vehicles currently registered in the State of Washington. This dataset is often updated and has multiple download formats available.

There, you’ll find counties and cities, car models, electric ranges, and more data points to work with. This dataset can be used to learn data clustering, find the average electric car range, discover most popular vehicle models, and more.

IMDb Movie Reviews Dataset (Kaggle)

The IMDB Movie Ratings Dataset on Kaggle has approximately 50K movie reviews that you can use to learn natural language processing or text analytics. It contains two essential data points – a full written review and the sentiment (positive or negative).

This dataset can be used in Python practice for learning how to perform text analysis and predict the rating.

Forest Covertype Dataset (UCI Machine Learning Depository)

The Forest Covertype Dataset on UCI Machine Learning Depository is a small, well-structured dataset on four wilderness areas located in the Roosevelt National Forest of northern Colorado. It’s excellent for predicting forest cover type from cartographic variables only.

The dataset has multiple variables, like soil type, wilderness areas, and hillshades, to work with. What’s great is that there are no missing values, so you won’t need to worry about filling them in manually.

Surface Water Quality Dataset (Open Baltimore)

The Surface Water Quality Dataset on Open Baltimore is a large dataset covering surface water quality in the City of Baltimore from 1995 to 2024. Available in a CSV file, this dataset contains data values like coordinates, tested parameters, and timestamps.

You can use Python to predict the surface level quality by analyzing the given parameters and their results in specific locations of the city.

Adam Dubois

Proxy geek and developer.

March 17, 2025

The post The Best Free Datasets to Use in Python Skill Practice appeared first on Proxyway.

Web Scraping Python vs. PHP: Which One to Pick?

Adam Dubois — Fri, 21 Feb 2025 09:28:36 +0000

Guides

When building a custom web scraper, you might find yourself wondering which programming language is the most suitable for your project. Let’s see whether Python or PHP is better for your use case.

Web scraping is widely used in many industries – business professionals, researchers, and even individuals collect various data about price comparison and market analysis, as well as research and lead generation. While there are quite a few programming languages that can handle web scraping, Python and PHP stand out as the two popular choices.

Python is known for its simplicity and multiple helpful libraries, while PHP, primarily used for web development, also offers powerful scraping capabilities and easy integration with other web applications.

In this guide, we’ll compare Python and PHP for web scraping, breaking down their strengths, weaknesses, and use cases to help you make the right choice for your project.

What Is Python?

Python is a high-level, versatile, mostly server-side programming language developed in the 90s, and still widely used today.

It’s known for code readability, simplicity, and a large amount of supplementary libraries. Python can be used in various fields, including web development, data analysis, as well as artificial intelligence. With its easy-to-read syntax, Python is often a preferred choice for both beginners and experienced developers.

The language is particularly useful for web scraping due to its powerful libraries. For example, BeautifulSoup is excellent for data parsing, Requests – for sending HTTP requests to websites, and Selenium automates browsers, making scraping data from dynamic elements easy. These tools provide efficacy for the entire scraping process.

What Is PHP?

PHP is a server-side scripting language primarily used for web development. Millions of websites are powered by PHP because of its ability to generate dynamic web pages and interact with databases.

PHP is commonly used for content management systems, e-commerce platforms, and various API integrations. However, it can also be used for web scraping, especially when data extraction needs to be integrated directly into a website. For example, web applications like that scrape airline websites and immediately display the results for the user would benefit from a PHP-based scraper.

With built-in tools like cURL and DOMDocument, PHP allows you to extract and sort data retrieved from the web.

Web Scraping Python vs. PHP: Feature Overview

Python and PHP are both viable options for data extraction, but they differ in syntax, use cases, popularity, and performance. Let’s review in-depth on how both languages compare.

Python is ideal for both small and large scraping projects, making it great for scraping basic HTML as well as dynamic, JavaScript-heavy sites. It’s fast, handles extracted data really well, and has tons of resources for learning.

PHP, on the other hand, relies on built-in functions to support scraping, so it is rather limited. It may be a slightly unorthodox choice for scraping, but it still has its use cases, especially when you need a scraper integrated within a web application.

	Python	PHP
Ease of use	Very easy to learn	Medium difficulty for learning
Popular libraries and features	BeautifulSoup, Selenium, Requests	cURL, DOMDocument, SimpleHTMLDOM
Performance	Fast and efficient for large-scale scraping	Typically very fast, slower for complex scraping tasks
JavaScript handling	Yes, with Selenium library	Limited support
Community support	Large community, great documentation	Small scraping community, great documentation
Typical use cases	Data analysis, large-scale scraping	Web-based applications, basic scraping tasks

Popularity

Python is no doubt the more popular of the two languages. Being an easy-to-use, multi-purpose language, it offers flexibility, making it a perfect choice for a broad range of tasks.

PHP, on the other hand, is most commonly used for backend development – it powers over 70% of modern websites and web applications, and is the leading language for server-side development.

In terms of web scraping, Python is a more common choice, too. That’s mainly due to its extensive scraping library collection, simplicity, and large scraping enthusiast community. Nevertheless, PHP is often a preferred choice for light scraping tasks, especially for people already familiar with the language.

Prerequisites and Installation

Getting both Python and PHP is relatively simple: all you have to do is download the packages from their respective websites (download Python; download PHP) and follow the installation steps. Though, the process might differ based on the operating system you use.

Getting Python

To get Python for Windows, download the Python package, and open the .exe file. Follow the installation wizard. Then, check if it was successfully installed by running python –version in Command Prompt. It should print the current version of Python on your device.

To get Python for macOS, download the Python package from the official website, open the .pkg file, and follow the installation instructions. Check if it was installed by running python3 –version in Terminal. If you see a version number printed, Python was installed successfully.

Getting PHP

Install PHP on Windows by downloading the package and extracting the ZIP file into a folder of your choice. Once you do so, add PHP to System PATH – go to Control Panel -> System -> Advanced -> Advanced system settings -> Environment variables. Under System variables, find Path, click Edit, and add C:\yourfolder.

Note: use the exact name of the folder you extracted PHP in.

To check if it was installed successfully, open Command Prompt, and run php -v. It should show the PHP version installed on your computer.

To install PHP on macOS, you’ll need a third-party package manager like Homebrew. Install Homebrew by running the following command in Terminal:

				
					/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

It will download and install Homebrew. Then, follow the installation instructions. After the installation, you can run brew –version to confirm (it should print the installed Homebrew version).

Once you have the package manager, you can easily install PHP by running brew install php in the Terminal.

Performance

Python is a relatively fast language on its own, but it can be further optimized with libraries like asyncio and aiohttp (for sending asynchronous requests concurrently instead of one-by-one). However, complex operations might take longer due to overhead. Nevertheless, Python is better suited for large scraping tasks. Even though it might take slightly longer to complete them, it still works through large amounts of data more efficiently thanks to fast-paced libraries.

PHP generally is faster than Python because it works natively on the server. It’s also lighter on resources (i.e., CPU, memory) and performs better with basic scraping tasks, like collecting comments from a simple, HTML-based forum. Unfortunately, the speed significantly drops and resource usage increases once you start scaling up.

Best Use Cases

Both Python and PHP have their own set of strengths and thus, should be used in different scenarios.

Python has various helpful libraries to expand its capabilities, so it’s excellent for handling complex scraping tasks, especially where JavaScript-based websites are involved. With Selenium or Playwright installed, Python-based scrapers can interact with the web page and extract data from dynamic elements.

Additionally, Python-based web scraper is well-suited for large-scale data collection because it supports asynchronous operations (performs multiple operations at the same time instead of one at the time). If you’re also planning to analyze scraped data, Python should be your preferred choice – with libraries like BeautifulSoup, you can parse the information easily. Lastly, it’s very easy to start scraping with Python due to its simple syntax.

PHP, on the other hand, is extremely useful if you’re planning to integrate scraped data directly into a web application (i.e., update product prices in real-time). In addition, PHP is great for lightweight scraping – cURL and DOMDocument packages make it quite easy to scrape data from websites like basic e-commerce sites or online forums. Unfortunately, PHP has very limited support for dynamic webpages.

If you’re a developer primarily working with PHP, you don’t need to learn another language just for scraping. That can make PHP very cost- and resource-effective.

Community Support and Documentation

Being one of the most popular programming languages, Python has extensive documentation and a community of developers and enthusiasts behind it. You can find beginner’s guides, books, series of podcasts and other resources directly on Python’s website.

It also has large dedicated scraping communities on websites like Reddit, GitHub, or StackOverflow that will gladly help you if you find yourself stuck.

PHP, however, is lacking in terms of scraping-focused community and documentation – it has some resources for learning, but you won’t find much material. Its scraping community is active but also significantly smaller.

Choosing Between Python and PHP

It might not be easy to pick a language for your web scraping project because both PHP and Python have their own unique strengths. Therefore, when deciding which language to use, consider the following:

Pick Python if you’re planning to scrape large amounts of web data, work with dynamic (JavaScript-heavy) web pages, or need to process, clean, and analyze data efficiently. Python is also ideal for automation and machine learning applications.
Choose PHP if you’re working within a PHP-based web environment, or need simple scraping within a web application without additional dependencies. Also useful if you’re already somewhat familiar with the language.

Ultimately, we would say Python is the better choice for most web scraping tasks due to its readability, ease of use, and rich ecosystem. However, PHP can be a suitable option for people who are already familiar with the programming language and need to perform lightweight scraping tasks.

Alternatives to Python and PHP

If you want to try a completely different language for web scraping, you could pick Node.js. It’s a popular JavaScript-based language often used for scraping. While it can be slightly more difficult to learn, it’s very scalable, has a huge scraping community, and is probably the best option for extracting data from dynamic websites.

Web Scraping with Node.js: A Practical Guide for Beginners

Everything you need to know about web scraping with Node.js and JavaScript in one place.

Alternatively, we compiled a list of other programming languages you can use for web scraping. Keep in mind that each language has its own pros and cons, varying performance, community support, and ideal use case.

The Best Programming Languages for Web Scraping: An Ultimate List

We compare seven popular programming languages for web scraping.

Adam Dubois

Proxy geek and developer.

February 21, 2025

The post Web Scraping Python vs. PHP: Which One to Pick? appeared first on Proxyway.

How to Use Wget with a Proxy: A Tutorial

Adam Dubois — Mon, 10 Feb 2025 09:01:59 +0000

Guides

Wget is a great tool for quickly downloading web content. It also offers the flexibility to route your requests through a proxy server. Here you’ll learn how to use Wget with a proxy.

There are many command-line tools for downloading web content, such as cURL. However, if you want to handle recursive downloads and resume tasks when your connection is unstable, Wget is your best option.

What Is Wget?

Wget is a GNU Project command-line utility built to download HTTP(S) and FTP(S) files. It’s a non-interactive tool, making it especially useful for downloading content in the background while completing other tasks.

Wget was specifically designed to handle content downloads on unstable networks: if you lose internet, the tool will automatically try to resume the job once the connection restored.

Wget is typically used on Unix-like operating systems such as Linux and macOS. However, it’s also available on Windows.

Key Wget Features

Even though Wget was first introduced in the 90s, it’s still widely used due to its simplicity and reliability. Here are some key features of Wget:

Resuming interrupted downloads. If a download is interrupted because of connectivity issues or system shutdown, Wget will automatically retry the task once the connection is restored – no manual input is needed.
Automated file download. Wget can batch process downloads or schedule them for repetitive tasks.
Recursive download support. You can create a local copy of a website with Wget to view it offline or archive the website’s snapshot for future reference.
High control over downloads. You can script Wget to limit bandwidth, change request headers, as well as adjust retries for downloads.
Proxy support. Wget supports HTTP and HTTPS proxies if you need to download geo-restricted or otherwise protected content.

Wget vs. cURL: the Differences

Both Wget and cURL are command-line tools used for data transferring. However, their functionality and niches slightly differ.

Wget is primarily used to download content from the web. On the other hand, cURL is used for data transfer (upload and download), as well as working with APIs. Therefore, cURL is more versatile but also more complex.

How to Install Wget

Wget’s installation process is straightforward, but may differ based on your operating system.

Being a command-line utility, Wget run in a command-line interface. In other words, if you have a Mac or Linux computer, that will be terminal Terminal. The default for Windows is CMD (Command Prompt).

Windows users will need to download and install the Wget package first. Once that’s done, copy and paste the wget.exe file to the system32 folder. Finally, run wget in Command Prompt (CMD) to check if it works.
For those on MacOS, you’ll need to get the Homebrew package manager by running xcode-select –install in your Terminal. Then, you can install Wget by running wget -v.

Once you have Wget installed, it’s important to also have the configuration file – .wgetrc. It will be useful for when you need to add proxy settings to Wget.

To create the file on the Windows OS, run C:\Users\YourUsername\.wgetrc in CMD. MacOS users should use run -e ~/.wgetrc in Terminal. If the file doesn’t exist in your system, this command will automatically create and open it.

How to Use Wget

Let’s take a look at how to download files and retrieve links from webpages using Wget.

Downloading a Single File with Wget

Retrieving a single file using Wget is simple – open your command-line interface and run wget with the URL of the file you want to retrieve:

				
					wget https://example.com/new-file.txt

Downloading Multiple Files with Wget

There are a couple of ways to download multiple files with Wget. The first method is to send all URLs separated by a space. Here’s an example with three files:

				
					~$ wget https://example.com/file1.txt https://example.com/file2.txt https://example.com/file3.txt

This method is ideal when you have a limited number of URLs. However, if you want to download dozens of files, it becomes much more complex.

The second method relies on writing down all URLs in a .txt file, and using the -i or –input-file option. In this case, Wget will read the URLs from the file and download them.

Let’s say you named the file myurls.txt. You can use the –input-file argument:

				
					~$ wget --input-file=myurls.txt

Getting Links from a Webpage with Wget

You can also use Wget to extract links directly from a webpage.

If you want Wget to crawl a page, find all the links, and list them without downloading, you can run this command:

				
					wget --spider --force-html -r -l1 https://example.com 2>&1 | grep -oE 'http[s]?://[^ ]+'

If you’d like Wget to find the URLs and download them for you, simply remove the –spider and –force.html commands that crawl and parse the HTML pages. Instead, your command should look something like this:

				
					wget -r -l1 https://example.com

Changing the User-Agent with Wget

If you’re planning to use Wget for downloads often, you should modify your user-agent string to rate limits. You can change your user-agent for all future uses by editing the .wgetrc file, or write a command for one-time use.

Modifying the User-Agent for a Single Download

Whether you’re on Windows or macOS, the syntax for changing the user agent is the same. Make sure to use the user-agent string of a new browser version.

				
					wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" https://example.com

Modifying the User-Agent Permanently

If you’d like to consistently use a different user-agent, you can change the Wget configuration in the .wgetrc file. The custom user-agent string you’ll put there will be used for all future jobs until you change it.

Simply locate the .wgetrc file and add user_agent = “CustomUserAgent”

It should look something like this:

				
					user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"

How to Use Wget with a Proxy

You can either set up proxy settings in the configuration file or pass proxy settings in the command line for one-time downloads.

Wget only supports HTTP and HTTPS proxies, so make sure you’re not using SOCKS5 proxy servers.

If you’re looking for a proxy server, free proxies may work with simple websites. For anything more – or larger scale – we recommend choosing one of the paid proxy server providers. You’ll find our recommendations here:

The Best Proxy Server Providers

Discover top proxy service providers – thoroughly tested and ranked to help you choose.

Using Wget with a Proxy for a Single Download

For using proxies with multiple downloads, we recommend setting a proxy configuration in the .wgetrc file instead. However, you can also specify proxy settings for Wget if you’re planning to use Wget with a proxy once. Instead of modifying the .wgetrc file, you can run a command directly in Terminal or CMD.

It should look like this:

				
					wget -e use_proxy=yes -e http_proxy=https://username:password@proxyserver:port https://example.com/file.zip

Note: the example uses http_proxy, but Wget supports HTTPS proxies too, so you can use https_proxy for your proxy settings.

Checking Your Current IP Address

It may be useful to check if your IP address has indeed changed to the proxy server’s. You can do that by sending a request to the HTTPBin IP endpoint with Wget:

				
					wget -qO- https://httpbin.io/ip

You should receive an output similar to the one below:

				
					{
“origin”: “123.45.67.89:000”
}

Note: this is not a real IP address, rather an example to familiarize you with the format.

Set Up a Proxy for Wget for Multiple Uses

To set up a proxy for Wget, you’ll first have to get the proxy server’s details. Then, set the proxy variables for HTTP and HTTPS in the .wgetrc file that holds the configuration content for Wget.

Add proxy settings to the file:

				
					use_proxy = on
http_proxy = http://proxyserver:port
https_proxy = https://proxyserver:port

Note: use actual proxy server address and a correct port number when editing the file. These will be given to you by your proxy service provider.

Once you write down proxy settings, you can send a request to HTTPBin to check if the IP address has changed.

Wget Proxy Authentication

Most reputable proxy server providers will require authentication to access the proxy server. Typically, you’ll need to specify your username and password.

You can do that by adding a couple of lines to the .wgetrc file.

				
					proxy_user = YOUR_USERNAME
proxy_password = YOUR_PASSWORD

So, the entire addition to the file should look like this:

				
					use_proxy = on
http_proxy = http://proxyserver:port
https_proxy = https://proxyserver:port
proxy_user = YOUR_USERNAME
proxy_password = YOUR_PASSWORD

Adam Dubois

Proxy geek and developer.

February 10, 2025

The post How to Use Wget with a Proxy: A Tutorial appeared first on Proxyway.

What Is a Dataset? Comparing Scraping APIs and Pre-Collected Datasets

Adam Dubois — Wed, 08 Jan 2025 10:50:54 +0000

Guides

The world runs on data, but it’s not always easy to find it. However, datasets offer an easy way to access large volumes of structured data on essentially any topic.

Web scraping tools allow you to gather vast volumes of data in seconds. But with more companies offering data-as-a-service (DaaS), you don’t even have to collect information yourself. Instead, you can get pre-collected datasets from basically any website, and jump straight to analysis.

But what exactly are datasets, and why do they matter? Essentially, a dataset is a collection of structured records on a specific topic for further processing. It allows easy access to information about various fields, topics, and subjects. Since datasets typically are huge collections of information, they make research more accessible and fast. In this article, let’s dig deeper into what datasets are, how they are made, and where to use them.

What Are Datasets?

Datasets are collections of records about a specific topic. It’s a static compilation of important data points that can vary from weather forecast to product prices. The key attribute of a dataset is its structure – it is organized (often arranged in a table) and prepared for further analysis.

There are numerous ways to use datasets, both for research and business management purposes, such as marketing and social media management, or tracking and analyzing e-commerce data. Datasets can also be valuable for recruitment purposes.

Types of Datasets

There are many types, forms, and structures of datasets. The type of dataset you should get depends on what sort of analysis you’re planning to perform (i.e., qualitative, quantitative).

Firstly, datasets can be broken down into several types:

Numerical datasets consist of numbers only. They’re mostly used for quantitative analysis for statistics or mathematics. For example, such data includes stock prices, temperature records, or order values.

Date	Temperature (°C)	Wind speed (km/h)
2025-01-01	7.3	8
2025-01-02	8.1	12
2025-01-03	6.9	11

Textual datasets are composed of written information, and they’re ideal for qualitative analysis. For example, textual datasets can be a collection of X posts (previously known as tweets), press releases, customer feedback, or research papers.

				
					[
  "Great quality and fast shipping!",
  "The product broke after a week. Very disappointed.",
  "Affordable and works as described. Will buy again."
]

Multimedia datasets include audio, video, and image data. They can be used for both quantitative and qualitative analysis.

Image file	Label
	Monitor
	Server
	Sneakers

Time-series datasets contain data collected periodically. For example, price changes on a monthly basis or daily weather reports.

Timestamp	Stock price ($)	Volume
2025-01-01 09:00	150.25	500,000
2025-01-01 09:15	155. 30	525,000
2025-01-01 09:30	151.75	510,000

Mixed datasets combine different types of data – textual, numerical, multimedia. They are especially useful for multi-faceted reports, like customer sentiment or customer behavior analyses.

Image ID	Description	Image file	Author
101	“Red proxy server icon”		Isabel
102	“Yellow globe icon”		Adam
103	“Blue scraper icon”		Chris

Secondly, datasets can have varying organization structures:

Structured datasets have organized rows and columns containing specific data points. For example, a structured dataset can be an Excel sheet or a CSV file containing data.
Unstructured datasets don’t have a predefined format due to the type of data they contain (audio, images, text). They might be more difficult to analyze due to their unorganized nature.

However, if you’re looking to purchase a dataset, you’ll most likely encounter mixed datasets as they allow for various potential analyses.

Dataset Examples

Now that you know the different types of datasets, let’s take a better look at how they can look like.

Below is an example of a mixed dataset in a structured table. The datapoints vary – you can see text and numbers, yet they are neatly organized within the table. Each element includes several data points, and is arranged in an ascending order.

Product ID	Name	Price	Category
101	Scraping robot	$49	Scrapers
102	Computer monitor	$139	Electronics
103	Proxy server	$2000	Hardware
104	Mobile phone	$250	Electronics

Let’s analyze another table below, it might look like an ordered time-series dataset – an organized table with numeric data points about the weather. However, if you take a closer look, you’ll notice the timestamps don’t really have any logical order. This makes it an unstructured time-series dataset.

Timestamp	Temperature (°C)	Humidity (%)
2024-12-26 14:00:00	13.0	45
2024-12-27 12:00:00	7.4	79
2024-12-25 14:00:00	10.2	56

Both of these datasets can be used for making analyses or training AI, but they will have different applications.

Why Use Datasets?

Datasets are an invaluable tool for various niches, ranging from business to research. For example, companies can adjust pricing strategies due to price changes in competition, improve services by uncovering customer behavior patterns, make future plans by monitoring trends, and more.

In academia, datasets can help save time in collecting and structuring data. A pre-made dataset reduces the time needed for manually collecting specific data points, and thus allows for more focus on data analysis and drawing conclusions. Additionally, having more data points allows for data validation by improving statistical significance and capturing data variability.

Finally, datasets can also be used to train AI. Large language models (LLMs) rely on vast volumes of data so they can provide you with detailed answers in a conversational tone. However, if you ever used AI-based tools like Open AI’s ChatGPT or Google’s Gemini, you might have noticed that the answers are not always correct. Providing AI with a collection of fresh data can help the LLM improve accuracy.

Dataset vs Database

While we covered what a dataset is, you might’ve encountered another term – database – when talking about a collection of information. So, how do these terms differ?

A database is a dynamic collection of stored data. It’s a digital library where information is stored, can be quickly found, managed, reorganized, or completely changed. Maintaining a database requires specific software and hardware.

We can think of a database as being similar to the Contacts app on your phone. The app holds names, phone numbers, and other information about people in your life. You can adjust this data immediately if someone’s name or phone number changes. The app is a specific software that lets you access and manage phone numbers, and your phone’s processor, memory, and storage allow the app to run smoothly.

However, if you decide to print the phone numbers from your Contacts app on a sheet of paper, it becomes a dataset – a static snapshot of data. You can analyze it (i.g., check how many people named John you know), but it cannot be edited, deleted, or otherwise manipulated. It simply reflects the data from the app at a specific point in time.

Both datasets and databases hold information, but as you can see in the example, the database (the Contacts app) is dynamic – information can be accessed, managed, and changed. On the other hand, datasets are static (the printed contacts) – they reflect the current information that exists. If the information in the database is updated, you’ll have to create a new dataset to reflect these changes.

How are Datasets Created?

In order to understand datasets better, it’s important to know how they are made. There are a few ways to collect information for datasets:

Web scraping. It’s a more modern way to extract relevant data from online sources using custom-built or third-party web scraping tools.

Using existing databases. Use existing public or private (with permission) databases, like government data portals, IMDb, or weather forecast websites to collect structured data.

Recording data manually. Manually write down observations, like writing down numbers or descriptions, and conduct surveys.
Combining sources. Merge all your data to create a well-rounded dataset on a specific topic. The more sources you use, the more reliable and accurate your dataset will be.

Depending on the type of dataset you need for your research project, you can either create it yourself or purchase a pre-made one from dataset vendors. Some providers that offer web scraping tools also have pre-collected datasets that are regularly updated to minimize the need for manual data collection.

Web Scraping vs. Pre-built Datasets

It would be very difficult to create modern, up-to-date datasets without scraping the web. Manual data collection takes a lot of time, especially when collecting information online since there’s so much of it.

Instead, web scrapers offer an option to collect, clean, and structure web data automatically. However, choosing between datasets and web scrapers depends on the nature of your project.

When to Choose Web Scraping?

Web scraping is a method of automatically collecting data from the web using a specific software. Web scraping tools – self-made or third-party scraping APIs – can help gather large volumes of data from the selected sites much quicker, compared to manual collection, but that’s not the only benefit they offer. They also often parse (clean) and structure data for better readability, so there’s less need for processing information yourself.

However, customizing a web scraper and extracting data can be a hassle. If you’re planning to do it often, you’ll need to run the tool each time you need to collect fresh information, and adjust it every time something in the website’s structure changes. If you use a self-made scraper, you’ll also have to invest into its maintenance.

Alternatively, you can purchase pre-made web scrapers to avoid taking care of the tool’s infrastructure, but they can get expensive, especially with larger projects.

Web scraping is ideal for time-sensitive use cases, such as tracking e-commerce statistics (pricing, product availability, etc.), extracting social media, travel, real estate data, or collecting the latest news.

When to Choose Datasets?

While datasets are an incredibly valuable and time-saving tool, they come with their own set of limitations. Notably, their freshness and accuracy to your project.

Firstly, pre-built datasets might not have the specific information you’re looking for. It’s rare for dataset vendors to give customers a peek into what information such datasets contain. Therefore, there’s a risk that the data will be only partially or completely unusable for your specific case. Additionally, datasets can become stale, especially if you need time-sensitive data.

Additionally, you can’t always customize a dataset. By purchasing a pre-made one, you can’t ask for specific information to be included as the datasets are made for the general audience. In this case, choosing a scraping API is much better.

Therefore, where data freshness isn’t the highest priority – analyzing historical e-commerce data, AI training, researching the market demographic, sales, & customer behavior – use datasets.

Datasets and Scraping APIs: Data Delivery Methods

Datasets are static, though periodically updated collections of data. Typically, they are downloaded and stored for offline use. Most often, you’ll find datasets in formats like CSV, JSON, or Excel, so they provide a clear, organized snapshot of information.

This makes datasets ideal for tasks like data analysis, machine learning model training, or accessing archival information where real-time updates are not critical.

Scraping APIs, on the other hand, deliver data on-demand, providing real-time access to information. Unlike datasets, APIs offer the ability to fetch specific pieces of data. They are ideal for cases requiring up-to-date information, such as stock prices, weather updates, or social media feeds.

	Datasets	Scraping APIs
Data access	Provides a snapshot of data from a specific time	On-demand access to specific data
Delivery frequency	One-time download, can be updated at selected frequency (weekly, monthly, quarterly)	Real-time or on-demand
Data format	JSON, CSV, Excel, SQL, and other structured formats	Raw HTML, CSV, JSON
Performance	Not affected by network; works offline	Depends on server uptime, network latency
Cost	One-time payment	Subscription- or API credit-based; depends on traffic or requests

Conclusion

Datasets, especially pre-made ones, are becoming an integral part of data-driven decision-making. Valuable for dozens of fields, up-to-date datasets are essential for businesses as well as academia, as they help access loads of data in a readable, structured way.

Adam Dubois

Proxy geek and developer.

January 8, 2025

The post What Is a Dataset? Comparing Scraping APIs and Pre-Collected Datasets appeared first on Proxyway.

The Best Python HTML Parsers

Adam Dubois — Tue, 31 Dec 2024 15:57:56 +0000

Guides

Scraped web data is of little use to people if they can’t read and analyze it. That’s where HTML parsers play a vital role – they extract only the meaningful pieces from the raw downloaded data, and clean it for better readability.

Python is one of the easiest programming languages to learn, but despite that it’s great for web scraping, and has many libraries to expand its capabilities. For example, there are multiple HTML parser libraries available on the market, so it can be tricky to choose the one best suited for your scraping project. In this article, you’ll find everything you need to know about Python HTML parsers: what they are, how they work, and which ones are the easiest to set up and use.

What is HTML Parsing?

HTML parsing refers to extracting only the relevant information from HTML code. This means that raw HTML data – which includes markup tags, bugs, or other irrelevant pieces of information –is cleaned, structured, and modified into meaningful data points or content.

For example, let’s say you really like this article and want to extract the list of the best parsers for offline reading. While you could download the site as an HTML file, it would be tricky to read because of all the HTML tags. Instead, by using a web scraper to extract the list below and an HTML parser to process it, you would get only the relevant content in a clean format.

Why Parse HTML Data?

Parsing increases the readability of HTML data by removing all necessary or broken information. To illustrate what HTML parsing does, let’s compare raw HTML with parsed data.

Below is the code for a simple HTML website:

				
					<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>

Your browser uses the code, and “translates” it into something that more visually appealing and functional for humans. Here’s how your browser would render this code visually.

As you can see, the code contains HTML elements such as , . While this data is relevant for browsers to display the website correctly, it’s not particularly useful for humans. What we’re interested in is the website’s name, the link, and the data in bullet points.

By using a Python HTML parser like BeautifulSoup, we can remove irrelevant information pieces and convert the raw HTML into structured, readable data like this:

				
					Title: My Website
H1 Heading: Welcome to My Website
Paragraph: This is an example paragraph.
Link: https://example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

In this case, the parser removed HTML elements and structured the most important data points.The result includes fewer lines of code, neatly ordered list items, and the retained link, though the ‘Click here’ text was removed. Importantly, no relevant information was lost. This structured data is much easier to read for us and can be further manipulated or analyzed.

Now, let’s take a look at the best HTML parsers to use with your Python scraper.

The Best Python HTML Parsers of 2025

1. BeautifulSoup

The most popular Python parsing library.

BeautifulSoup is one the most popular Python libraries used for parsing. It’s lightweight, versatile, and relatively easy to learn.

BeautifulSoup is a powerful HTML and XML parser that converts raw HTML documents into Python parse trees (a hierarchical tree model that breaks down structures and syntax based on Python’s rules), and then extracts relevant information from them. You can also navigate, search, and modify these trees as you see fit. BeautifulSoup is also excellent for handling poorly formatted or broken HTML – it can recognize errors, interpret the malformed HTML correctly, and fix it.

Since it’s a library for HTML manipulation, BeautifulSoup doesn’t work alone. To render static content, you’ll need an HTTP client like requests to fetch the web pages for parsing. The same applies for dynamic content – you’ll have to use a headless browser like Selenium or Playwright.

The library is very popular and well-maintained, so you’ll find an active community and extensive documentation to help you out.

To install BeautifulSoup, all you have to do is run pip install beautifulsoup4 in your terminal.

Let’s see how to use BeautifulSoup to parse our simple HTML website.

				
					from bs4 import BeautifulSoup

html_code = """
<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""

soup = BeautifulSoup(html_code, 'html.parser')

title = soup.title.string
h1 = soup.h1.string
paragraph = soup.p.string
link_text = soup.a.string
link_href = soup.a['href']
list_items = [li.string for li in soup.find_all('li')]

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

Here’s how the final parsed results would look like:

				
					results = {
    "Title": "My Website",
    "Heading (h1)": "Welcome to My Website",
    "Paragraph": "This is an example paragraph.",
    "Link Text": "Click here",
    "Link Href": "https://example.com",
    "List Items": ["Item 1", "Item 2", "Item 3"]
}

for key, value in results.items():
    print(f"{key}: {value}")

2. lmxl

An efficient parsing library for HTML and XML documents.

lxml library is probably one of the most efficient parsing libraries for parsing raw HTML and XML data. It’s fast and performant, so it’s great for handling large HTML documents.

The lxml library connects Python with powerful C libraries for processing HTML and XML. It turns raw data into objects you can navigate using XPath or CSS selectors. However, since it’s a static parser, you’ll need a headless browser for dynamic content. While lxml is very fast, it can be harder to learn if you’re not familiar with XPath queries.

Install lxml by running pip install lxml in your terminal, and adding from lxml import html in your scraping project.

Here’s how lxml would parse a simple website:

				
					from lxml import html

html_code = """
<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""
tree = html.fromstring(html_code)

title = tree.xpath('//title/text()')[0]
h1 = tree.xpath('//h1/text()')[0]
paragraph = tree.xpath('//p/text()')[0]
link_text = tree.xpath('//a/text()')[0]
link_href = tree.xpath('//a/@href')[0]
list_items = tree.xpath('//ul/li/text()')

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

How parsed results would look like:

				
					Title: My Website
Heading (h1): Welcome to My Website
Paragraph: This is an example paragraph.
Link Text: More information...
Link Href: https://www.example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

3. PyQuery

Library for parsing HTML and XML documents with jQuery syntax.

PyQuery is another Python library for parsing and manipulating HTML and XML documents. Its syntax is similar to jQuery, so it’s a good choice if you’re already familiar with the library.

PyQuery is quite intuitive – CSS-style selectors make it easy to navigate the document and extract or modify HTML and XML content. PyQuery also allows you to create document trees for easier data extraction. It works similarly to BeautifulSoup and lmxl: you can load an HTML or XML document into a Python object and use jQuery-style commands to interact with it, so the key difference is the syntax. PyQuery also has many helper functions, so you won’t have to write that much code yourself.

The library is efficient for static content, but it does not natively handle dynamic content – it needs headless browsers to render JavaScript-driven pages before parsing the content.

To install PyQuery, run pip install pyquery in your terminal, and add from pyquery import PyQuery as pq in your project to use it.

Here’s an example of how to use PyQuery to parse a simple HTML document:

				
					from pyquery import PyQuery as pq

html_code = """
<html>
  <head>
    <title>My Website</title>
  </head>
  <body>
    <h1>Welcome to My Website</h1>
    <p>This is an example paragraph.</p>
    <a href="https://example.com">Click here</a>
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""

doc = pq(html_code)

title = doc("title").text()
h1 = doc("h1").text()
paragraph = doc("p").text()
link_text = doc("a").text()
link_href = doc("a").attr("href")
list_items = [li.text() for li in doc("ul li").items()]

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

And here’s how PyQuery would print the results:

				
					Title: My Website
Heading (h1): Welcome to My Website
Paragraph: This is an example paragraph.
Link Text: More information...
Link Href: https://www.example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

4. requests-html

Parsing library that supports static and dynamic content.

requests-html is a Python HTML parsing library capable of rendering HTML that supports both static and dynamic content. It combines the convenience of the requests library (HTTP client for fetching web pages) with JavaScript rendering abilities of a headless browser, so there are less libraries for you to use.

With requests-html, you can easily send HTTP requests to a webpage and receive the fully rendered HTML. requests-html is great for static pages as you can send requests and parse raw data with one package. However, the library stands out because it can scrape JavaScript-based web pages, too – it relies on a Chromium web browser for handling dynamic content natively. Additionally, it has multiple parsing strategies, including CSS selectors and XPath, so it’s very convenient.

requests-html also supports multi-threaded requests, so you can interact with several web pages at once. However, this makes it much harder to learn, and it’s significantly slower than traditional parsers due to requiring additional processing power to render the JavaScript.

To install requests-html, run pip install requests-html in your terminal. Once installed, add from requests_html import HTMLSession to your scraping project.

Here’s how to use requests-html to parse a simple website:

				
					from requests_html import HTMLSession
session = HTMLSession()

response = session.get('https://example.com')

doc = response.html

title = doc.find('title', first=True).text
h1 = doc.find('h1', first=True).text
paragraph = doc.find('p', first=True).text
link_text = doc.find('a', first=True).text
link_href = doc.find('a', first=True).attrs['href']
list_items = [li.text for li in doc.find('ul li')]

print("Title:", title)
print("Heading (h1):", h1)
print("Paragraph:", paragraph)
print("Link Text:", link_text)
print("Link Href:", link_href)
print("List Items:", list_items)

The parsed results will look like this:

				
					Title: My Website
Heading (h1): Welcome to My Website
Paragraph: This is an example paragraph.
Link Text: More information...
Link Href: https://www.example.com
List Items: ['Item 1', 'Item 2', 'Item 3']

The Differences Between Python HTML Parsers

The choice of HTML parser boils down to what your project needs – while some projects might require native JavaScript rendering, some can do without that. Also, check if speed and efficiency are up to your expectations. Here’s how the libraries compare:

Library	Speed	Ease of Use	Native Dynamic Content Handling	Ideal Use Case
BeautifulSoup	Fast	Very easy	No	Simple HTML parsing
lxml	Very fast	Moderate	No	Fast parsing
PyQuery	Fast	Easy	No	Scraping with CSS selectors
requests-html	Fast (static content); moderate (dynamic content)	Easy	Yes	Scraping and parsing dynamic web pages

In short, use BeautifulSoup or lxml for static HTML content. They are efficient and relatively easy to learn. If you want to handle dynamic content, use requests-html which integrates a headless browser. If you’re planning to scrape with CSS selectors, use PyQuery for easy navigation and data manipulation.

Adam Dubois

Proxy geek and developer.

December 31, 2024

The post The Best Python HTML Parsers appeared first on Proxyway.

How to Scrape Google Flights With Python: A Step-by-Step Tutorial

Adam Dubois — Mon, 25 Nov 2024 13:50:26 +0000

Guides

Instead of having multiple browser tabs open to check every destination, you can scrape Google Flights with a Python-based scraper, and get structured flight data in minutes.

Planning trips online has become significantly more convenient, but there are still roadblocks – booking flights can still be time-consuming due to the sheer amount of data. While platforms like Google Flights offer a neat way to check all necessary information and compare it across different airlines, manually looking through each date and destination can be daunting. By automating this process with a Google Flights scraper, gathering large volumes of data and comparing it becomes less of a hassle.

Whether you’re a person looking for a bargain on flight tickets, a business analyst, or a scraping enthusiast searching for a new challenge, this guide will help you build a scraper that collects Google Flights data from scratch.

Why Scrape Google Flights?

Google Flights offers a vast amount of valuable data – from flight times and prices to the environmental impact of the flight. By scraping flight pages you can extract prices, schedules, and availability, as well as plan trips and stay updated when changes are made.

Platforms like Google Flights offer flight information based on your requirements (departure and arrival location, dates, number of passengers), but it’s not always easy to compare it – you need to expand the results to see all relevant information, such as layovers. And having several expanded results can be hardly readable. Scraping real-time data can help you find the best deals, and plan itineraries better. Or, if you’re a business owner, it can help gather market intelligence and analyze customer behavior.

What Google Flight Data You Can Scrape?

There are dozens of reasons to scrape Google Flights data. While the intention might vary based on what you’re trying to accomplish, both travelers and businesses can benefit from it.

If you’re simply planning a trip, scraping Google Flights data might help you to:

Compare prices. Getting information about pricing is one of the key reasons why people choose to scrape Google Flights. Structured scraped results can help to evaluate ticket prices, and compare them across different airlines.

Check flight times. Another major reason to extract Google Flights data is flight times. You can collect departure and arrival times and dates, compare them, and select the option that fits your itinerary best.

Find out about stops. Most people prefer direct flights. Google Flights has data that allows you to check if there will be any layovers until you reach your destination.

Review duration. Knowing how long the flight is going to take will help you plan the trip better, and see how the flight fits into your schedule. Such data can give you insights on the duration of your flights between specific locations.
Learn about emissions. Scraped data from Google Flights can help you to evaluate carbon emissions of the flights, and make more sustainable choices.

If you’re looking to scrape Google Flights for business purposes, you can:

Analyze user behavior patterns. There are specific times when people tend to travel to certain destinations, such as during winter holidays, summer vacations, and more. By reviewing these behavior patterns, companies can segment user bases and target advertisements better.

Improve pricing strategies. Flight information is relevant for more businesses than just airports and airlines. Hotels, taxi services, car rental companies, travel insurance companies can review the increase or decrease of demand for specific locations, and adjust their pricing accordingly.

Create bundle deals. Accurate flight data can help travel agencies create better travel deals by bundling tickets, hotels, transportation, and activities for customers.
Improve risk management. Travel insurance companies can leverage flight data to identify popular destinations, and adjust policies and pricing to better align with customer demand.

Is Scraping Google Flights Legal?

Data on Google Flights is public, and there are no laws prohibiting the collection of publicly available information. However, there are several things to keep in mind to avoid legal implications.

Here are several tips on how to scrape Google Flights data ethically:

Comply with Google’s terms of use. Take the time to go over Google’s terms of service to make sure you don’t violate any of their guidelines.
Read the robots.txt file. The file gives instructions to robots (such as scrapers) about which areas they can and cannot access (e.g., admin panels, password-protected pages). Be respectful and follow the given commands.

How to Scrape Google Flights with Python: Step-by-Step Guide

If you’re looking to build your own Google Flights scraper, here’s a comprehensive guide on how to do so from scratch.

In this example, we’ll use Python with Selenium to build the scraper. Python is a great choice due to its straightforward syntax – it’s relatively easy to write, maintain, and understand. Additionally, since Google Flights is a highly dynamic website, we’ll use Selenium to handle dynamic content and interactive elements, such as buttons.

Below is a table containing all information about the scraper we’re going to build.

Programming language	Python
Libraries	Selenium
Target URL	https://www.google.com/travel/flights/
Data to scrape	1. Departure date from the origin location 2. Return date from the destination 3. Operating airline 4. Departure time 5. Arrival time 6. Flight duration 7. Departure airport 8. Arrival airport 9. Layovers 10. Cost of the trip 11. Best offer
How to save data	CSV file

Prerequisites

Before the actual scraping begins, you’ll need to install the prerequisites.

Install Python. You can download the latest version from Python’s official website. If you’re not sure if you have Python installed on your computer, check it by running python –version in your terminal (Terminal on MacOS or Command Prompt on Windows).
Install Selenium. To use Selenium with Python for this scraper, install it by running pip install selenium in the Terminal.
Install Chrome WebDriver. Selenium helps to control headless browsers, such as Chromium (which powers Google Chrome). Download the Chrome WebDriver that corresponds to your Chrome browser.
Get a text editor. You’ll need a text editor to write and execute your code. There’s one preinstalled on your computer (TextEditor on Mac or Notepad on Windows), but you can opt for a third-party editor, like Visual Studio Code, if you prefer.

Importing the Libraries

Once all your tools are installed, it’s time to import the necessary libraries. Since we’ll be using Python with Chrome, we need to import the WebDriver to the system Path for the browser to work with Selenium.

Step 1. Import WebDriver from Selenium module.

				
					from selenium import webdriver

Step 2. Then, import the By selector module from Selenium to simplify element selection.

				
					from selenium.webdriver.common.by import By

Step 3. Import all necessary Selenium modules before moving on to the next steps.

				
					from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver import Keys, ActionChains

Step 4. We want to save our results into a CSV file, so let’s import the CSV module, too.

				
					import csv

Setting Up Global Variables and Parameters

After importing all the necessary libraries, we need to to set up global variables to store key values. These include the target URL, a timeout (to accommodate page loading time), and any specific parameters.

Step 5. So, let’s set up global variables.

				
					start_url = "https://www.google.com/travel/flights"
timeout = 10 #seconds

Step 6. Next, set up the parameters for the scraper – specifically, the criteria you’re looking for in the flights. These include departure and arrival locations, as well as travel dates.

				
					my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': '2024-12-03'
}

Note: You can also define parameters for one-way flights, too.

				
					my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': None
}

When browsing Google Flights, you don’t need to specify the exact airport for departure or arrival – you can simply enter a city (or even a country) instead because we’re using the auto-complete feature. It simplifies location input by suggesting relevant options. For example, typing Los will display suggestions that match the input – LOS airport in Nigeria, Los Angeles in the U.S., or Los Cabos in Mexico.

You can edit these values as you see fit – your ‘from’ value can be set to ‘Los Angeles’, and the scraper will target any airport in Los Angeles for departure. You can also specify a different airport, like ‘JFK’ or change the dates completely. But, for the sake of this example, let’s use LAX for departure and any airport in Atlanta for arrival.

Setting Up the Browser

Step 7. Before we start scraping with Selenium, you need to prepare the browser. As mentioned earlier, we’ll be using Chrome in this example.

				
					def prepare_browser() → webdriver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘--disable-gpu’)
driver = webdriver.Chrome(options = chrome_options)
return driver

Note: This browser setup will allow you to see the scraping in action. However, you can add an additional chrome_options line to run Chrome in headless mode.

				
					def prepare_browser() → webdriver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘--disable-gpu’)
chrome_options.add_argument(“--headless=new”)
driver = webdriver.Chrome(options = chrome_options)
return driver

Step 8. It’s also important to set up the main() function. It calls the prepare_browser function, which returns a Chrome driver. Additionally, we need to instruct the driver to execute the scraping, and close when it’s finished.

				
					def main() -> None:
driver = prepare_browser()
scrape(driver)
driver.quit()

if __name__ = ‘__main__’:
main()

Scraping Google Flights

When the browser is prepared, we can actually start scraping the results from the Google Flights page.

Handling Cookies on Google Flights with Python

While the start_url is the Google Flights main page, the scraper might bump into a cookie consent page first. Hence, we need to instruct our scraper to handle it.

Step 9. Let’s provide the scraper with some extra information to handle the cookie consent page. Namely, find and copy the CSS selectors for the “Accept” button. We can do this by using Inspect Element.

If the scraper successfully clicks the “Accept” button on the cookie consent page, we’ll still need to wait until the actual Flights page loads. In this example, we’re using the “Search” button’s appearance as an indication that our target page has loaded.

Step 10. Using the search button’s CSS selector, instruct the scraper to wait for it to appear before moving on to the next step. So, let’s write a function that will print “Search button found, continuing.” if everything went well, and “Something went wrong.” if the scraper couldn’t locate said button.

Here’s how the function for accepting cookies and locating the “Search” button looks like:

				
					def scrape(driver: webdriver) -> None:
    driver.get(start_url)
    if driver.current_url.startswith("https://consent.google.com"):
        print ("Hit the consent page, handling it.")
        btn_consent_allow = driver.find_element(
            By.CSS_SELECTOR, 'button.VfPpkd-LgbsSe[jsname="b3VHJd') 
        btn_consent_allow.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]'))
            )
            print ("Search button found, continuing.")
            prepare_query(driver)
        except Exception as e:
            print("Something went wrong: {e}")

Continuing in the def scrape function, let’s add some code instructing the scraper to locate and click on the “Search” button, and print “Got the results back.” when the scraping is finished.

				
					search_btn = driver.find_element(By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]')
        search_btn.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'ul.Rk10dc'))
            )
            print ("Got the results back.")

Scraping Google Flights

At the beginning of our script, we defined our parameters: origin location (‘from’), destination (‘to’), a date for departure (‘departure’), and a date for return (‘return’). These parameters will help the scraper fill in the query fields. To allow def scrape to function properly, we need to instruct the scraper about how it should prepare the search query.

Step 11. While we have our values ready, the scraper needs to know where to use them. For that, we’ll need to find and copy another set of CSS selectors for “Where from?”, “Where to?”, and date fields.

However, we need to prepare our scraper for two potential date_to options – if the return date is defined in my_params, and if it’s not.

However, if the return date is set to None, we’ll also need to change the selection to One-way (instead of Round trip) in the dropdown menu. Thus, we’ll need a CSS selector for the menu as well.

Step 12. Instruct the scraper about how it should fill in the “Where from?”, “Where to?”, and date fields.

				
					def prepare_query(driver) -> None:
field_from = driver.find_element(By.CSS_SELECTOR, 'div.BGeFcf div[jsname="pT3pqd"] input')
    field_to = driver.find_element(By.CSS_SELECTOR, 'div.vxNK6d div[jsname="pT3pqd"] input')
    date_from = driver.find_element(By.CSS_SELECTOR, 'div[jsname="I3Yihd"] input')
date_to = None
   if my_params['return'] is None or my_params['return'] == '':
dropdown = driver.find_element(By.CSS_SELECTOR, 'div.VfPpkd-aPP78e')
        dropdown.click()
        ActionChains(driver)\
            .move_to_element(dropdown)\
            .send_keys(Keys.ARROW_DOWN)\
            .pause(1)\
            .send_keys(Keys.ENTER)\
            .perform()

The if function might find a pre-defined return date in my_params. If that’s the case, we need to find a CSS selector for the return date field instead of changing the the value in the dropdown menu. The scraper will fill in the form using data from my_params.

				
					else: 
date_to = driver.find_element(By.CSS_SELECTOR, 
'div.K2bCpe div[jsname="CpWD9d"] input')
 field_input(driver, field_from, my_params['from'])
    field_input(driver, field_to, my_params['to'])
    field_input(driver, date_from, my_params['departure'])
    if date_to is not None:
        field_input(driver, date_to, my_params['return'])
    ActionChains(driver)\
        .send_keys(Keys.ENTER)\
        .perform()
    print("Done preparing the search query")

Step 13. Once all the fields we need to fill in are defined, instruct the scraper to enter the information into the selected fields.

We’ll use ActionChains to send the text that needs to be typed in. Additionally, let’s instruct the scraper to press Enter, so that the first suggested option for departure and arrival dates is selected from the dropdown menu.

				
					def field_input(driver, element, text) -> None:
        element.click()
        ActionChains(driver)\
        .move_to_element(element)\
        .send_keys(text)\
        .pause(1)\
        .send_keys(Keys.ENTER)\
        .perform()

Note: In Step 10, we instructed the scraper to click on the “Search” button to run this search query.

Returning the Results

If you check the Google Flights page source, you’ll notice that the results come in an unordered list, where one list item contains all the information about a single trip – the dates, times, price, layovers, and more. When browsing the page, each list item should look something like this:

Step 14. If we want these results to sit neatly in a table when we save them, we need to store them into our “dictionary”. To do this, we need to collect the CSS selectors for each element in the result.

				
					def get_flight_info(element, best) -> dict:

Let’s begin with flight times. The departure time time[0] will be time_leave, and arrival time – time[1] as time_arrive.

				
					times = element.find_elements(By.CSS_SELECTOR, 
        'div.Ir0Voe span[role="text"]')

Let’s do the same thing with airports.

				
					airports = element.find_elements(By.CSS_SELECTOR, 
        'div.QylvBf span span[jscontroller="cNtv4b"]')

And the rest of the provided information – airlines, layovers, cost, and suggested best result.

				
					flight_info = {
        'airline': element.find_element(By.CSS_SELECTOR, 
            'div.Ir0Voe > div.sSHqwe.tPgKwe.ogfYpf > span:last-child').text,
        'date_leave': my_params['departure'], #This will be filled in from my_params
        'date_arrive': my_params['return'], #This will also be filled from my_params
        'time_leave': times[0].text,
        'time_arrive': times[1].text,
        'duration_string': element.find_element(By.CSS_SELECTOR, 
            'div.Ak5kof > div.gvkrdb.AdWm1c.tPgKwe.ogfYpf').text,
        'airport_leave': airports[0].text,
        'airport_arrive': airports[1].text,
        'layovers': element.find_element(By.CSS_SELECTOR, 
            'div.BbR8Ec > div.EfT7Ae.AdWm1c.tPgKwe > span').text,
        'cost': element.find_element(By.CSS_SELECTOR,
            'div.U3gSDe div div.YMlIz.FpEdX span').text,
        'best': best #True for flights from the suggested best list, or False for everything else
    }
    return flight_info

Extracting and Parsing the Page Data

Google Flights has a neat feature that provides you with the best results (the shortest flight duration, fewest layovers, the cheapest flight), as well as all available results based on your query. You may not like the suggested best results, so let’s save both best and all other remaining results in a list list_elems.

Step 15. Let’s adjoin these two lists, and return them as a single item under one name – list_of_flights.

				
					def find_lists(driver):
   list_elems = driver.find_elements(By.CSS_SELECTOR, 'ul.Rk10dc')
   list_of_flights = parse(list_elems[0], True) + parse(list_elems[1], False)
      return list_of_flights

It’s important to parse the downloaded page to collect only the necessary information – in this case, the flight lists. As mentioned before, we have two of them – the best results list and the rest. But we don’t want them to be separated in our final saved list of all flights.

Step 16. Let’s parse our page data. The list_of_flights will contain the final results.

				
					def parse(list_elem: list, best: bool) -> list:
   list_items = list_elem.find_elements(By.CSS_SELECTOR, 'li.pIav2d')
   list_of_flights = []
   for list_item in list_items:
        list_of_flights.append(get_flight_info(list_item, best))
    return list_of_flights

Saving the Output to CSV

At the very beginning, we imported the CSV library to save our data.

Step 17. Let’s add a few extra lines of code so that all flight information we previously defined in our dictionary and scraped results are saved.

				
					def write_to_csv(flights):
    field_names = ['airline','date_leave','date_arrive','time_leave',
                   'time_arrive','duration_string','airport_leave',
                   'airport_arrive','layovers','cost','best']
    output_filename = 'flights.csv'
    with open (output_filename, 'w', newline='', encoding='utf-8') as f_out:
        writer = csv.DictWriter(f_out, fieldnames = field_names)
        writer.writeheader()
        writer.writerows(flights)

Here’s the entire script for this Google Flights scraper:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver import Keys, ActionChains
import csv

start_url = "https://www.google.com/travel/flights"
timeout = 10

my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': '2024-12-03'
}
my_params2 = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': None
}

def prepare_browser() -> webdriver:
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

def field_input(driver, element, text) -> None:
    element.click()
    ActionChains(driver)\
    .move_to_element(element)\
    .send_keys(text)\
    .pause(1)\
    .send_keys(Keys.ENTER)\
    .perform()

def prepare_query(driver) -> None:
    field_from = driver.find_element(By.CSS_SELECTOR, 'div.BGeFcf div[jsname="pT3pqd"] input')
    field_to = driver.find_element(By.CSS_SELECTOR, 'div.vxNK6d div[jsname="pT3pqd"] input')
    date_from = driver.find_element(By.CSS_SELECTOR, 'div[jsname="I3Yihd"] input')
    date_to = None
    if my_params['return'] is None or my_params['return'] == '':
        dropdown = driver.find_element(By.CSS_SELECTOR, 'div.VfPpkd-aPP78e')
        dropdown.click()
        ActionChains(driver)\
            .move_to_element(dropdown)\
            .send_keys(Keys.ARROW_DOWN)\
            .pause(1)\
            .send_keys(Keys.ENTER)\
            .perform()
    else:
        date_to = driver.find_element(By.CSS_SELECTOR, 'div.K2bCpe div[jsname="CpWD9d"] input')
    field_input(driver, field_from, my_params['from'])
    field_input(driver, field_to, my_params['to'])
    field_input(driver, date_from, my_params['departure'])
    if date_to is not None:
        field_input(driver, date_to, my_params['return'])
    ActionChains(driver)\
        .send_keys(Keys.ENTER)\
        .perform()
    print("Done preparing the search query")

def get_flight_info(element, best) -> dict:
    times = element.find_elements(By.CSS_SELECTOR, 'div.Ir0Voe span[role="text"]')
    airports = element.find_elements(By.CSS_SELECTOR, 'div.QylvBf span span[jscontroller="cNtv4b"]')
    flight_info = {
        'airline': element.find_element(By.CSS_SELECTOR, 'div.Ir0Voe > div.sSHqwe.tPgKwe.ogfYpf > span:last-child').text,
        'date_leave': my_params['departure'],
        'date_arrive': my_params['return'],
        'time_leave': times[0].text,
        'time_arrive': times[1].text,
        'duration_string': element.find_element(By.CSS_SELECTOR, 'div.Ak5kof > div.gvkrdb.AdWm1c.tPgKwe.ogfYpf').text,
        'airport_leave': airports[0].text,
        'airport_arrive': airports[1].text,
        'layovers': element.find_element(By.CSS_SELECTOR, 'div.BbR8Ec > div.EfT7Ae.AdWm1c.tPgKwe > span').text,
        'cost': element.find_element(By.CSS_SELECTOR, 'div.U3gSDe div div.YMlIz.FpEdX span').text,
        'best': best
    }
    return flight_info

def parse(list_elem: list, best: bool) -> list:
    list_items = list_elem.find_elements(By.CSS_SELECTOR, 'li.pIav2d')
    list_of_flights = []
    for list_item in list_items:
        list_of_flights.append(get_flight_info(list_item, best))
    return list_of_flights

def find_lists(driver):
    list_elems = driver.find_elements(By.CSS_SELECTOR, 'ul.Rk10dc')
    list_of_flights = parse(list_elems[0], True) + parse(list_elems[1], False)
    return list_of_flights

def write_to_csv(flights):
    field_names = ['airline', 'date_leave', 'date_arrive', 'time_leave',
                   'time_arrive', 'duration_string', 'airport_leave',
                   'airport_arrive', 'layovers', 'cost', 'best']
    output_filename = 'flights.csv'
    with open(output_filename, 'w', newline='', encoding='utf-8') as f_out:
        writer = csv.DictWriter(f_out, fieldnames=field_names)
        writer.writeheader()
        writer.writerows(flights)

def scrape(driver: webdriver) -> None:
    driver.get(start_url)
    if driver.current_url.startswith("https://consent.google.com"):
        print("Hit the consent page, dealing with it.")
        btn_consent_allow = driver.find_element(By.CSS_SELECTOR, 'button.VfPpkd-LgbsSe[jsname="b3VHJd')
        btn_consent_allow.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]'))
            )
            print("Search button found, continuing.")
            prepare_query(driver)
        except Exception as e:
            print("Something went wrong: {e}")
        search_btn = driver.find_element(By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]')
        search_btn.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'ul.Rk10dc'))
            )
            print("Got the results back.")
            flights = find_lists(driver)
            write_to_csv(flights)
        except Exception as e:
            print(f"Something went wrong: {e}")

def main() -> None:
    driver = prepare_browser()
    scrape(driver)
    driver.quit()

if __name__ == '__main__':
    main()

Avoiding the Roadblocks When Scraping Google Flights

Building a Google Flights scraper can be a pretty daunting task, especially if you’re new to scraping but it can become even more difficult if you’re going to scrape it a lot. While we have solved issues like the cookie consent page already, other issues can arise if you’re scraping at scale.

Use Proxies to Mask Your IP

Websites don’t like bot traffic, so they try to prevent it by using tools like Cloudflare. While scraping the Google Flights page once or twice probably won’t get you rate-limited or banned, it can happen if you try to scale up.

You can use proxy services to prevent that. Proxies will mask your original IP by routing the requests through different IP addresses, making them blend in with regular human traffic. Typically, human traffic comes from residential IPs, so this type of proxy is the least likely to be detected and blocked.

How to Set Up a Proxy with Selenium using Python

This is a step-by-step guide on how to set up and authenticate a proxy with Selenium using Python.

Use the Headless Browser Mode

The Google Flights page is a dynamic website that heavily relies on JavaScript – not only for storing data, but also for anti-bot protection. Running your scraper in headless Chrome mode allows it to render JavaScript like a regular user would and even modify the browser fingerprint.

A browser fingerprint is a collection of unique parameters like screen resolution, timezone, IP address, JavaScript configuration, and more, that slightly vary among users but remain typical enough to avoid detection. Headless browsers can mimic these parameters to appear more human-like, reducing the risk of detection.

Step 7 in Setting Up the Browser gives two examples of how to set up Chrome for scraping, one of them containing this line of code: chrome_options.add_argument(“–headless=new”).

Adding this chrome_option will run the browser in headless mode. You may not want to use it now, but it’s good to know how to enable it if necessary.

Be Aware of Website’s Structural Changes

This Google Flights scraper relies heavily on CSS selectors – they help to find the specific input fields and fill them in. However, if Google makes adjustments to the Flights page, the scraper might break. That’s because the CSS selectors can change when a site developer modifies the HTML structure.

If you plan to use this Google Flights scraper regularly, keep in mind that selectors can change over time, and you’ll need to update them to keep the scraper functional.

Conclusion

Scraping Google Flights with Python is no easy feat, especially for beginners, but it offers a great deal of information useful for travelers and businesses alike. Despite the project’s difficulty, this data will be helpful when planning a trip, or gathering market intelligence, analyzing trends, and better understanding your customer needs.

Adam Dubois

Proxy geek and developer.

November 25, 2024

The post How to Scrape Google Flights With Python: A Step-by-Step Tutorial appeared first on Proxyway.

How to Use Inspect Element for Web Scraping

Adam Dubois — Tue, 08 Oct 2024 08:01:45 +0000

Guides

Modern, interactive websites are intuitive and user-friendly, but have you ever wondered how much valuable data they hold? If you know how to use Inspect Element, you can reveal the underlying HTML and CSS structure that is crucial for web scraping.

Developers and designers spend a lot of time making websites visually appealing and easy to use. However, underneath all that user experience stands the source code – the backbone structure that makes it all work. Thanks to Inspect Element, we, as users, can peek inside and see how websites are made.

Using Inspect Element has several benefits, especially if you’re a web designer, developer, a marketer, or you’re building a custom web scraper. The tool allows you to look into how elements function, check how they appear on different screen resolutions, get data from the source code for scraping purposes, and more. So, let’s dig deeper into how to use Inspect Element.

What Is Inspect Element?

Inspect Element is a web browser feature that lets you peek into the HTML structure of the website. It’s built into most modern desktop browsers, such as Chrome, Firefox, or Safari, and is a part of Developer Tools – a set of tools that allow developers to test, troubleshoot, and experiment with websites.

By using Inspect Element, you can view and temporarily edit HTML, CSS, and JavaScript code, monitor website’s performance, as well as find layout information about specific elements (i.e., the dimensions and margins of a button).

Inspect Element can also be useful for web scraping. When building a scraper, you’ll likely need it to extract only the relevant data points rather than the entire HTML structure. The feature allows you to find and collect specific CSS selectors, metadata, or hidden elements that you want to scrape.

Why Use Inspect Element?

The Inspect Element feature makes web designer’s and developer’s lives easier by allowing them to temporarily modify element features in real time. Basically, it allows them to test bug fixes and experiment with new features before making actual changes to the website yet.

The feature can also be useful for web scraping.

It can help you collect specific CSS selectors or other data points for your scraper.
Inspect Element can also help to see if your scraper’s request header won’t be blocked by the website. The headers used by scrapers often look different compared to regular users. But the feature will let you see it, decide if it looks natural enough, and make changes accordingly.

Inspect Element is handy for marketers and SEO specialists.

By reviewing the HTML and CSS code, marketing specialists can check if the website’s content is up to par. You can see if the colors and fonts follow the brand guidelines, check link information (a follow or a nofollow link), review if the website’s content looks good on desktop, responsive, and mobile views, and more.

Website users can also find ways to benefit from it.

If a website makes it hard to download certain media, you can use Inspect Element to find the source URL, and download it from there.
You can also have some fun – try using Inspect Element to rewrite this entire article by changing the HTML element values.

There are other use cases for Inspect Element, too. Sometimes, you can bypass a paywall by modifying the source code to gain full access to the site. However, we discourage you from using Inspect Element for unethical reasons – not only can you violate the website’s terms of use, but face legal implications, too.

How to Use Inspect Element: A Detailed Explanation

Using Inspect Element is quite simple. There are several different tabs in the tool that you can explore, but you’ll find the most use cases for:

Elements panel allows inspecting and modifying HTML and CSS code.

Console panel allows inspecting and debugging JavaScript.

Sources panel allows inspecting and modifying CSS or JavaScript source files (pop-ups, analytical files).

Network panel allows monitoring HTTP requests.

Performance panel allows analyzing website’s performance – load, and download speeds.
Application panel allows managing cookies, local storage, and other session data.

Keep in mind that these panels have different names in each browser. For example, the “Elements panel” in Chrome and Safari is called “Elements,” while Firefox calls it “Inspector.” However, the content there remains the same.

If it’s your first time using Inspect Element, you might want to use it on the same website with different browsers to familiarize yourself with the tool’s terms.

The easiest way to access the HTML code is to right-click any element on the website and select “Inspect” or “Inspect element.” However, it varies slightly across different browsers and operating systems. Let’s break them down.

Google Chrome & Mozilla Firefox

There are three ways to open Inspect Element on Google Chrome and Mozilla Firefox: 1) right-clicking to inspect, 2) using a shortcut, or 3) opening Developer Tools through the browser’s menu bar.

Using right-click: use your mouse or a trackpad, right-click any element on the website. Then, select “Inspect.”

Using a shortcut: you can use a keyboard shortcut to open Inspect Element. On MacOS, click command + option + i. On Windows, click Ctrl + Shift + i. Additionally, you can use F12 to open Developer Tools on both Windows and MacOS (sometimes you might need to press the Fn key as well).
Using the browser’s menu: locate the kebab menu (three stacked dots) on Chrome or hamburger menu (three stacked lines) on Firefox. Typically, it’s located on the right side of the URL bar. Find “More tools,” and then select “Developer Tools.”

Safari

If you’re a Windows user, you can’t use the latest versions of Safari, so opening Inspect Element on this browser is only possible on MacOS. The process is also significantly different than on Chrome or Firefox.

Firstly, you need to enable Developer Tools in Safari settings. To do so, you can either press command and , (comma button), or in the Menu bar on top of your device’s screen (the bar with an Apple logo), click on Safari, and Settings… In the settings panel, navigate to the “Advanced” bar. There, check the “Show features for web developers” box. Now you’re able to open Inspect Element.

Using right-click: right-click any element on the website, and press “Inspect Element”

Using a shortcut: press command + option + i on your keyboard to open Inspect element. Safari doesn’t use the F12 key shortcut by default, but you can manually set it up through keyboard settings.
Using a menu bar on your device: once you have a page on Safari open, you can access various settings on the menu bar on the top of your screen (make sure Safari screen is active). There, you can press on “Develop” and click on “Show Web Inspector”.

Mobile devices (iOS and Android)

There’s no easy way to use Inspect Element on iOS or Android as most of the time it still requires a desktop. However, there are a few methods you can try.

Remote Debugging

You can try remote debugging using Chrome on Android or Safari on iOS. Remote debugging allows you to use your phone’s environment on your computer. This method requires a computer, and it’s the most functional.

On an Android device:

Enable developer options. In Settings, click About phone, and tap Build number seven times.
Then, connect your mobile device to your computer via USB.
Open Chrome on your computer, and go to chrome://inspect. There, you’ll see your connected device.
Now you can go to your target website, and open Inspect Element. This way, you can inspect the mobile view on your desktop.

On an iOS device, the process is similar:

On your mobile device, go to Settings, and find Safari.
When Safari settings open, find Advanced, and toggle the Web Inspector there.
Then, connect your iOS device with your Mac, and open Safari on your computer. Go to Develop in the Menu bar, and select your iOS device.
Now, go to your target website, and open Inspect Element.

Using Browser Developer Tools

Some mobile browsers also have an option to “Request Desktop Site”, which gives a layout similar to a desktop one, so you can do some basic inspection. The feature is helpful if you don’t have access to a computer but need to check different layouts of a website (mobile & desktop). That said, you still won’t be able to open Inspect Element.

You can also see how websites adapt to screen resolutions on a desktop. Many browsers have a “Responsive Design Mode” that can help to see if a website’s content looks good on different screen parameters. You can open Inspect Element while using the feature, but it requires a computer.

Using Third-Party Apps

There are some third-party applications for element inspection – you can download them for your iOS and Android devices. However, they can be pricey, and their functionality might be limited compared to using Developer Tools on a desktop.

Inspect Element Playground

Now that you know what Inspect Element is and how to use it, you can try putting your knowledge into practice. While some changes are very easy (i.e., changing the text), some might require basic HTML and CSS understanding. If you don’t know how to use them, there are various free HTML, CSS, and JavaScript tutorials online.

Inspect Element Playground for Web Scraping

If you’re looking to use Inspect Element for scraping purposes, you’ll likely need to copy some selectors from the target website – it helps the scraper extract only what’s relevant for you. Seasoned developers will know how to find these selectors and write the scraping logic, but if you’re new to this, ChatGPT can help to scrape, too.

In the playground below, you’ll find a table with terms and their descriptions. Use Inspect Element to find CSS selectors for them (hovering over code with your mouse will highlight the area that piece of code defines).

Once you do, you can write a simple Python scraper to scrape all terms and descriptions from the table below, and save them in a CSV format.

Term	Description
Proxy server	A proxy server is a middleman between the client and server (website). It masks the user’s original IP address, thus increasing security, privacy, and allows bypassing geo-restrictions.
Web browser	A web browser is a software that allows users access and view websites on the internet.
IP address	An IP address is a unique string of numbers given to any device connected to the internet. IPs help identify devices on the internet.
Web scraper	Web scraper is a software that automatically extracts data from websites. You can build a web scraper yourself using languages like Python or choose a pre-made one.

Other Easy Ideas to Test Inspect Element

Below you’ll find several typical HTML elements that can be manipulated with Inspect Element – headings, tables, images. Here are some ideas for you to try out.

Idea No.1: Modify the size of the headings using the Inspect Element tool.

Locate the
heading in the Inspect Element section.
The code should look like this: ... .
Double-click on the
tag and change the number (2) to any number between 1 and 6.
Observe how the website updates as the heading size changes.

Idea No.2: Change the content inside paragraphs.

Locate the
tag that contains the text: “This is a paragraph.”
Double-click on the paragraph text.
Replace the text with your own sentence or phrase.

Idea No.3: Add more rows and columns to the table.

Locate the table in the source code. Table rows are defined by tags, and columns by tags.
Add a new tag to create a new row.
Inside the new row, use multiple tags to add columns.
Experiment with changing the width of each by adjusting its attributes with CSS (e.g., 'width="100px"').

Idea No.4: Customize the table’s background color using CSS.

Find the
tag in the source code. Inside the “style” attribute, which already contains properties like “width” and “border-collapse”, add a new rule: ‘background-color:Tomato;’. Make sure all properties remain inside the quotation marks, like this: ‘"style="border-collapse: collapse; width: 100%; background-color:Tomato;"’. Check how the background color changes. Idea No.5: Replace an image using a different URL. Locate the tag in the source code. Find the “src” attribute, which contains the current image source (URL). Double-click the URL, delete it, and replace it with the link to any other image. The new image should now appear in place of the original one. THIS IS AN H2 HEADING This Is an H3 Heading This is a paragraph. Cell 1 Cell 2 Content Content More content More content


				
				
												
																					A step-by-step-guide on how to temporarily replace this image with Inspect Element
										
									
				
				
				
					Conclusion
				
				
				
				
									The Inspect Element feature is an excellent tool for web developers, designers, marketers, and regular web users alike. It’s also a beneficial feature for people who build their own web scrapers – Inspect Element allows exploring, finding, and collecting necessary data points for scrapers from the raw HTML structure of the website.
Using Inspect Element is easiest with a desktop, but some basic inspection is also available on mobile devices. While the process slightly varies on different browsers and operating systems, right-clicking and selecting “Inspect” is the simplest way to see the HTML structure of sites and find relevant data pieces for scrapers.
Getting acquainted with Inspect Element and Developer Tools, in general, can provide insights into a website’s performance, help test and troubleshoot web scrapers, fix bugs, and improve SEO performance, as well as provide information about how websites behave.


		
				
				
							
			
				Table of Contents


		
					
		
		
					
				
				
							
			
						
		
						
				
		
		
				
				
							
							
					
				
			
			
									
						
							Adam Dubois						
					
				
									
						Proxy geek and developer.					
				
							
		
						
				
				
		
				
				
							
								
													
										October 8, 2024					
								
				
						
				
				
				
				
				
							
			
						
		
						
				
					
				
		
					
				
				
					You May Also Like:
				
				
				
				
							
				
				
			
		
				
				
			
				ChatGPT for Web Scraping: A Step-by-Step Guide			
		
				
				
				
				
			
		
				
				
			
				What Is Web Scraping: The Ultimate Beginner’s Guide


		The post How to Use Inspect Element for Web Scraping appeared first on Proxyway.



ChatGPT for Web Scraping: A Step-by-Step Guide
Adam Dubois — Mon, 30 Sep 2024 13:20:00 +0000
		
				
					
		
				
				
					Guides
				
				
		
					
				
				
									Since its release in 2022, ChatGPT immediately became a popular choice for any query. The chatbot can help you to brainstorm ideas, write and edit text or code, and much more. So, it only makes sense to use ChatGPT for web scraping, too.
								
				
					
				
				
				
																														
				
				
				
									Search engines like Google are a huge help if you are looking for information, but it can take ages to find the right answer. ChatGPT, on the other hand, simplifies this process. It’s trained on large amounts of data, and it can summarize, provide understandable answers to complex questions, and respond to follow-up questions, which makes it great for many tasks, web scraping included.
While not perfect, ChatGPT can write simple code, as well as explain the logic behind it. It’s an excellent supporting tool for beginners trying to learn or seasoned scraping enthusiasts looking to improve and test their skills. So, let’s learn all the basics about web scraping with ChatGPT and try to build a simple web scraper.
								
				
				
				
					What Is ChatGPT Web Scraping
				
				
				
				
									ChatGPT web scraping is the process of automatically gathering data from websites using a code that’s written with ChatGPT. But if ChatGPT is a chatbot, what exactly does it do in this equation? Basically, you can ask ChatGPT to build you a scraper for a specific website you want to get data from. All you need to do is give instructions, collect certain data points for ChatGPT to use, and run the code.
								
				
				
				
					How ChatGPT Can be Used for Web Scraping Tasks
				
				
				
				
									ChatGPT itself isn’t a scraper and it can’t scrape websites directly, but it can help you write a simple scraper (or improve an old one) based on the instructions you give. However, while you don’t need a lot of knowledge about web scraping or writing code yourself, knowing how to create a good prompt is useful.
								
				
				
				
												
																					ChatGPT can’t scrape a website directly, but it can write you a scraper.
										
									
				
				
				
					Scraping with ChatGPT: A Step-by-Step Guide
				
				
				
				
									Let’s see how we can use ChatGPT for web scraping. In this step-by-step guide, we’ll use a website that loves to be scraped – https://books.toscrape.com/. 
								
				
				
				
												
																					Books.toscrape.com website
										
									
				
				
				
									This website is a sandbox environment specifically designed to practice your skills and test scrapers. It’s a safe place to do different scraping experiments without worrying about violating Terms of Service or robots.txt file. There are other sandbox websites similar to books.toscrape.com, but for the sake of this experiment, let’s use this one.
								
				
		
				
				
					The Best Websites to Practice Your Web Scraping Skills
				
				
				
				
					We prepared a list of sandboxes just for you.
				
				
				
				
				
									Let’s create a Python scraper that extracts book titles and their prices from the page. We’ll be using Python because it’s a beginner-friendly programming language with a simple syntax that excels in tasks like web scraping. We’ll also use two Python libraries to assist, namely, Requests for handling HTML requests, and Beautiful Soup for parsing (cleaning) extracted data.
We’ll later save extracted data in CSV – a simple text format which will allow you to open, edit, manipulate, and move the data later on. 
								
				
				
				
					Step 1: Install the Preliminaries
				
				
				
				
									Before jumping straight into web scraping, there are a couple of things you need to do first.
Create a ChatGPT account. ChatGPT requires users to log in before using the service. So first, log in to your existing account or, if you don’t have one yet, click “Sign up”, and create it.
Get a text editor. To execute the code, you’ll need a text editor. There’s one already pre-installed on your computer (TextEdit on MacOS or Notepad on Windows), but you can use any third-party editors like Notepad++ or Visual Studio Code. Such text editors tend to have more advanced features and usually highlight functions in different colors for better readability. However, extra functionality might not be necessary, so it’s completely up to you which tool to use.
Install the latest Python version. If you use an older version of MacOS (12.3 or older) or Linux, Python may be preinstalled on your device. If it’s not, download Python from the official website and install it to your computer. 
Install necessary libraries. Then, install Requests and BeautifulSoup by running pip install requests beautifulsoup4 in the terminal (search for Terminal on MacOS or Command Prompt on Windows).
								
				
				
				
					Step 2: Locate Elements You Want to Scrape
				
				
				
				
									Now, it’s time to gather some information from the target website about the elements you’re going to scrape. Specifically, the CSS selectors defining book names and pricing. ChatGPT needs this information to build a scraper that targets these exact elements.
Visit https://books.toscrape.com/. You’ll see a list of books right away. To get the elements for the scraper, we’ll need to use “Inspect Element”. The easiest way to do this is to right-click on one of the book titles, and select “Inspect.” This action will open the browser’s Developer Tools, and you’ll see the HTML code of the website there.
Then, locate the book title element. Hovering over the code will highlight the area it defines, so you should find it easily.  Once you locate the element, right-click on it, select “Copy,” and click “Copy selector.”
								
				
				
				
												
																					Copying book title CSS selector in books.toscrape.com
										
									
				
				
				
									Next, paste the copied selector somewhere to save (a desktop sticky note or your text editor) because we’ll need it in the next step. Copy and write down the selector for book prices, too. At this point, you’ve collected all the necessary information to write a prompt for ChatGPT.
								
				
				
				
					Step 3: Prepare a Prompt for ChatGPT
				
				
				
				
									Now, let’s write a prompt for ChatGPT to create a web scraper. A good rule of thumb is to write a clear, well-explained prompt with keywords and phrases that outline everything the chatbot needs to know. 
Prompt question Instruction for ChatGPT
What does ChatGPT need to do? Write a web scraper that extracts book titles and prices of all books in the target website.
What is the target URL? Target website is: https://books.toscrape.com/
What programming language should it use? Use Python programming language
What libraries should it use? Use the following libraries:
– Requests
– Beautiful Soup
What elements it needs to scrape? Defined CSS selectors are:
#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a
#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color
What are the output requirements? Save titles and prices in a CSV file.
A well-written prompt ensures that the response is less error-prone. Think about the task at hand and what questions need answering to build a scraper that does exactly what we need. Your answers will provide instructions for ChatGPT. Now, all you need to do is give these instructions to ChatGPT and get a response! Here’s the result:
								
				
				
				
							
							
					import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the target website
url = "https://books.toscrape.com/"

# Send a GET request to the website
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all book containers on the page
    books = soup.select('#default > div > div > div > div > section > div:nth-child(2) > ol > li')

    # Prepare a list to hold the titles and prices
    books_data = []

    for book in books:
        # Extract the title using the specified CSS selector
        title = book.select_one('article > h3 > a')['title']

        # Extract the price using the specified CSS selector
        price = book.select_one('article > div.product_price > p.price_color').text

        # Append the extracted data to the list
        books_data.append([title, price])

    # Define the CSV file to save the data
    csv_file = "books.csv"

    # Save the data to a CSV file
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Price"])  # Write the header
        writer.writerows(books_data)  # Write the data rows

    print(f"Data successfully saved to {csv_file}")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

				
			
		
						
				
				
				
					Step 4: Review and Test the Code
				
				
				
				
									Even if you don’t know a thing about writing code with Python, I highly recommend reviewing the code once ChatGPT generates it. Take some time to check if your prompt is clear, includes all necessary information, and if ChatGPT followed all your directions. 
For example, can you spot a mistake here?
								
				
				
				
												
																					ChatGPT-generated code with a mistake
										
									
				
				
				
									When you try to run this code, it will show you an error. That’s because I forgot to mention that ChatGPT has to import BeautifulSoup for data parsing. 
If you run the code and something doesn’t work, you can ask ChatGPT for help – simply write a new prompt by asking to review and edit the code. Always double-check your requirements and, if you see error messages when trying to run the command, define them. You can always add extra information to the prompt, if necessary.
								
				
				
				
					Step 5: Check Parsed Data
				
				
				
				
									In our prompt, we asked ChatGPT to save data in a CSV file. So, the scraper ChatGPT generated should be able to do so after it’s done with scraping and parsing. Once you open the saved file, your results should look similar to this: 
								
				
				
				
												
																					CSV file with parsed data opened in Numbers (MacOS)
										
									
				
				
				
									If they do – congratulations, you successfully used ChatGPT for web scraping! If not, go back to step 4 and troubleshoot. Common mistakes can be: forgetting to import CSV to your scraper, you have extracted too much data (this shouldn’t be the case here), or you don’t have enough memory in your system.
Now, you can try to experiment further: rewrite your scraper to extract book ratings or book titles only from specific categories. You probably also noticed that this scraper only scraped the first page only. Try asking ChatGPT to improve your scraper to be able to scrape titles and prices from all 50 pages. The more you practice, the better your understanding of scraping logic will be. 
Once you feel confident, you can try advanced data retrieving. For example, you can ask ChatGPT to write a scraper for dynamic content using Selenium. You can use it to scrape flight ticket prices from websites like Expedia or Skyscanner.
However, always keep in mind that actual websites are trickier than the sandboxes. Also, websites rarely enjoy being scraped. For example, scraping Amazon is possible but much more complicated. Nevertheless, whichever website you choose to scrape next, always be respectful to its Terms of Service and robots.txt file, data protection laws, and always follow good scraping practices.
								
				
		
				
				
					7 Python Web Scraping Projects: Ideas for Beginners and Advanced Users in 2024
				
				
				
				
					We compiled a list of ideas for you to try.
				
				
				
				
				
					Advanced Techniques for ChatGPT Web Scraping
				
				
				
				
									The website we scraped is designed to test scrapers, so we didn’t encounter any roadblocks like CAPTCHAs or gotchas. Moreover, all data we scraped was simple HTML content. 
However, what if you need to scrape a website that uses dynamically-generated content based on JavaScript? What about bot-protection systems that prevent scrapers from extracting data? To tackle that, you’ll need some more advanced tools. But don’t worry – ChatGPT will still be able to help you out.
								
				
				
				
					Handling Dynamic Content
				
				
				
				
									If you’re planning to do some advanced scraping, a simple web scraper that only handles regular HTML content probably won’t be enough. Most websites nowadays use JavaScript to load their content. It allows you to enjoy things like endless scrolling and infinite loading. These elements, however, are too difficult for simple scrapers to grasp.
Try hovering your mouse on the proxy server icon below. While you can do it and see some extra information about the image, simple scrapers can’t. 
								
				
				
				
							
							
					
											
															
														
																		This icon depicts a proxy server								
												
									
					
					
				
				
				
									There are a couple of ways to scrape dynamic content. One, is to use headless browsers that can execute JavaScript with your scraper. Another – use pre-built tools that can handle and scrape dynamic content in websites. 
								
				
				
				
					Using Headless Browsers
				
				
				
				
									A headless browser helps scrapers “browse” the web. Basically, it’s a regular browser like the one you’re using to read this, but it doesn’t have a graphical user interface. So, instead of clicking buttons or typing in URL bars, you interact with the browser programmatically by writing code. 
Traditional web scraping methods rely on sending HTTP requests, so they can only retrieve information that is static – text, images, multimedia. Unfortunately, this means that dynamic elements like pop-ups are left behind. In this case, a headless browser can execute JavaScript by rendering it – just like a real user’s browser would. This allows scraping data from dynamic elements. You can use ChatGPT to assist you in writing code for headless browsers like Selenium, Playwright, Puppeteer, or any other. 
								
				
				
				
					Writing a Prompt for Web Scraping with Selenium
				
				
				
				
									We can use the same prompt as before, and alter it. Instead of using Requests and Beautiful Soup, let’s ask ChatGPT to use Selenium – a powerful Python library used to automate web browsers. This time, let’s use another sandbox – https://quotes.toscrape.com. We’ll scrape quotes and their authors this time.
Step 1. Install the necessary libraries.  
Run the command pip install selenium in the terminal to gain access to a library.
Download the Chrome Web Driver. This Chrome-based web driver is the headless browser that you’ll be able to control with Selenium.
								
				
				
				
							
			
				NOTE: You’ll have to copy different CSS selectors for this scraping experiment. Repeat the same steps (using “Inspect Element” and copying data) for getting selectors for quotes and authors in this case.			
					
						
				
				
				
									Step 2. Write your prompt. It should look something like this:
								
				
		
				
				
									Please write me a Python web scraper using Selenium. The scraper should extract quotes and authors from the target website.
Target website: https://quotes.toscrape.com/
CSS selectors: 
body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text
body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span:nth-child(2) > small
Output: save all quotes and authors in a CSV file.
								
				
				
				
				
									Here’s how the ChatGPT-generated code should look in your code editor. Notice how the syntax differs from the one we wrote before:
								
				
				
				
							
							
					from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

# Set up the web driver (ensure the driver is in your PATH or specify the path)
driver = webdriver.Chrome()

# URL to scrape
url = "https://quotes.toscrape.com/"

# Open the URL
driver.get(url)

# Create lists to store quotes and authors
quotes_list = []
authors_list = []

# Locate the quotes using the provided CSS selector
quotes = driver.find_elements(By.CSS_SELECTOR, "body > div > div:nth-child(2) > div.col-md-8 > div > span.text")
authors = driver.find_elements(By.CSS_SELECTOR, "body > div > div:nth-child(2) > div.col-md-8 > div > span:nth-child(2) > small")

# Extract the text and append it to the lists
for quote, author in zip(quotes, authors):
    quotes_list.append(quote.text)
    authors_list.append(author.text)

# Close the web driver
driver.quit()

# Save the quotes and authors to a CSV file
with open("quotes.csv", mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Quote", "Author"])
    for quote, author in zip(quotes_list, authors_list):
        writer.writerow([quote, author])

print("Scraping completed and saved to quotes.csv")

				
			
		
						
				
				
				
									Step 3. Find your CSV file, and open it. Your results should sit in a neat table.
								
				
				
				
												
																					CSV file opened in Numbers (MacOS)
										
									
				
				
				
					Using Pre-Built Tools
				
				
				
				
									Instead of using ChatGPT to build a web scraper from scratch, you can get ready-made scrapers to help with your scraping projects.
Firstly, you can opt for a non-scraping option – some providers offer pre-scraped datasets from various industries. Instead of building a scraper, extracting, cleaning, and parsing data yourself, you can check if there are prepared and detailed datasets for your use case. If there aren’t datasets that suit your use case, and you still need to scrape data yourself, experiment with no-code scrapers. These tools require you to interact with the website, but they translate your clicks into scraping logic. They work by letting you browse, click, and scroll through the site like a regular user, while simultaneously extracting the data based on your interactions.
You can also get web scraping APIs or proxy-based APIs. These tools are made to handle any amount of scraping, and are designed to work with most modern websites. They can handle JavaScript content, create a unique fingerprint, parse extracted data, manage proxies, bypass anti-scraping systems, and more. However, they require some programming knowledge to set up and run, but you can use ChatGPT to help you out here.
Additionally, you can register to use your target website’s API for scraping. Instead of getting the entire HTML document, parsing, and working around JavaScript elements and anti-scraping measures, API lets you get specific and structured data without much hassle. Then, you can send requests with Python – ChatGPT can help you write them, too. It’s important to keep in mind that not all websites offer APIs, they may restrict the data you can access, and more often than not, they come with a cost.
								
				
				
				
					Avoiding Anti-Scraping Measures
				
				
				
				
									While websites like books.toscrape.com love being scraped, most websites don’t. Instead of giving you a hall pass to look around and collect their data, they implement antibot measures. CAPTCHAs and services like Cloudflare protect them from malicious bots. Though, not all is lost if your target website has these measures implemented. 
Websites want human traffic because it brings revenue. Bot traffic, however, can overload the servers, preventing real people from accessing the site. If you want to scrape successfully, you need your scraper to look like a real human.
								
				
				
				
					Use Proxies
				
				
				
				
									A proxy server is a middleman between you and the web server. When you route your traffic through a proxy server it masks your original IP address, so the web server can’t detect you. Many proxy service providers offer quality proxies that come in large pools, so you can switch addresses to prevent detection and IP blocks. Typically, real people use residential IPs, so this type of  proxy is less likely to be blocked.
Residential proxies come from real devices like smartphones, laptops, or even smart TVs that are connected to the internet via Wi-Fi. By using residential IPs, your requests appear natural, especially when you’re doing some heavy-duty scraping. 
								
				
				
				
					Spoof Your Browser’s Fingerprint
				
				
				
				
									Have you ever noticed that websites tend to know it’s you visiting their site, even when you’re not logged in? That’s because your browser’s fingerprint shows some information about you.
A fingerprint is a collection of certain data points – screen resolution, OS, installed fonts, timezone, saved cookies – that help identify you on the web. Regular users like you have pretty detailed fingerprints that are unique, but still quite typical looking. Automated bots, however, tend to have browser fingerprints that either miss human-like information like cookies or precise user-agent strings or are inconsistent, so they stand out.
When you’re building your scraper, it’s important to find ways to spoof your browser’s fingerprint so it blends in. For example, there are Python libraries that can change the fingerprint by adjusting or rotating user-agent strings, or modify parameters like screen resolution and timezone. Antidetect browsers can also be useful – they let you create separate browsing environments with their own digital fingerprints without interlinking them.
								
				
				
				
					Other Tips for Web Scraping with ChatGPT
				
				
				
				
									Don’t expect it to be perfect. Despite being trained on large amounts of data, ChatGPT is still an automated program, and not a developer. It can provide inaccurate information or code, even if your prompt is brilliant.
Treat ChatGPT like a junior-level specialist. Think of the chatbot as a capable, but rookie specialist that’s just starting out in the field. Make sure to give clear instructions, and always double-check the result (and your query!).
Rephrase, reiterate, and regenerate answers. ChatGPT is so popular because it’s very flexible. If you’re not happy with the answer it provides, you can give additional instructions or completely regenerate questions until you like the response. For example, if you decide you want your scraper to be in Node.js rather than Python, you can ask to recreate it.
Ask follow-up questions. Let’s say ChatGPT helped you write a web scraper with Python, but you don’t understand what certain functions mean. Don’t be afraid to ask “What’s X?” or “What does Y do?” to get a better grasp on what you’re doing. By actually understanding the output, you’ll be able to see if it’s correct.
								
				
				
				
					Conclusion
				
				
				
				
									Although it can’t scrape websites for you, ChatGPT is a great tool that can get you started with web scraping. While not a scraper itself, it can generate code based on your query. Using the chatbot can be the first step to web scraping even if you can’t write code yourself, but it’s important to remember that it’s an automated program that can provide inaccurate information. 
								
				
				
		
				
				
							
			
				Table of Contents			
					
		
			
							
		
						
				
				
					
				
		
					
		
		
					
				
				
							
			
						
		
						
				
		
		
				
				
							
							
					
				
			
			
									
						
							Adam Dubois						
					
				
									
						Proxy geek and developer.					
				
							
		
						
				
				
		
				
				
							
								
													
										September 30, 2024					
								
				
						
				
				
				
				
				
							
			
						
		
						
				
					
				
		
					
				
				
					You May Also Like:
				
				
				
				
							
				
				
			
		
				
				
			
				What is AI Web Scraping: A Comprehensive Guide			
		
				
				
				
				
			
		
				
				
			
				Web Scraping with Beautiful Soup: An Easy Tutorial for Beginners with Python			
		
				
				
				
		
						
				
					
				
				
		
				
					
				
				
		The post ChatGPT for Web Scraping: A Step-by-Step Guide appeared first on Proxyway.



The 9 Best Methods to Bypass Cloudflare When Web Scraping in 2025
Adam Dubois — Fri, 06 Sep 2024 08:07:35 +0000
		
				
					
		
				
				
					Guides
				
				
		
					
				
				
									Websites don’t like bots crawling around, so they use anti-bot measures like Cloudflare to stop them. If you’re looking to scrape data from Cloudflare-protected websites, you’ll need to know how to bypass it. Let’s find out how to achieve that.
								
				
					
				
				
				
																														
				
				
				
									If you ever encountered a CAPTCHA with an orange cloud asking you to check the box that you’re human, congratulations – you’re already somewhat familiar with Cloudflare’s service.
Cloudflare is a company that offers content delivery network (CDN) service to increase web performance and security. Apart from making websites more optimized, Cloudflare protects them from things like DDoS attacks, as well as other malicious and unwanted bot traffic. But what about web scrapers?
Unfortunately, scrapers are identified as bots, and they often get blocked. While Cloudflare allows some verified bot traffic to pass through (i.e. Google’s crawlers that index web pages), bot-protection systems rarely know the difference between good and bad bots. And more often than not, web scrapers are interpreted as the bad ones. But if you’re looking to scrape Cloudflare-protected websites without getting blocked, there are ways to bypass this security layer.
								
				
				
				
					What Is Cloudflare’s Antibot Protection?
				
				
				
				
									Cloudflare is known for its Cloudflare Bot Management designed to shield websites from automated attacks and malicious bots. It provides advanced protection against a wide range of threats, including credential stuffing, web bypasses, and account takeover.
Cloudflare’s Bot Management has a Web Application Firewall (WAF) that uses traffic patterns and advanced algorithms to identify threats and stop malicious traffic from reaching a website’s origin server. 
Additionally, the DNS bypass feature differentiates between known good bots and potentially harmful bots. This allows legitimate bots, such as web crawlers, to access a website without being blocked by security measures, but stops suspicious ones.
								
				
				
				
					Cloudflare errors and response codes
				
				
				
				
									When Cloudflare prevents you from entering a specific website, you’re going to see an error code which explains the reasoning behind the block. Here are some of the most popular Cloudflare error codes you can encounter.
Cloudflare Error 1020: Access Denied. Cloudflare doesn’t specify the reasons behind this error, but it means that the client or the browser has been blocked by Cloudflare customer’s (website) Firewall rules. Likely, because the Cloudflare-protected website detects malicious activity, considers you a bot, or you’re denied access to a specific page (i.e., admin panel).
Cloudflare Error 1010: The owner of this website has banned your access based on your browser’s signature. Your request to access certain data was blocked because of your browser’s signature. It often happens when scraping tools expose fingerprint information (usually it lacks information or isn’t unique as a human-like fingerprint), and Cloudflare detects it. 
Cloudflare Error 1015: You are being rate limited. You sent too many requests from the same IP address, and got blocked. Real user traffic rarely gets limited, so you either need to send requests through different IPs or reduce the amount altogether.
Cloudflare Error 1009: Your country is blocked. A typical Cloudflare error code, but it has nothing to do with your scraper. The error means that the website owner wants traffic only from specific countries or regions, and yours isn’t on the list. 
								
				
				
				
												
																					Most popular Cloudflare error codes
										
									
				
				
				
					How Does Cloudflare Detect Web Scrapers?
				
				
				
				
									Cloudflare uses two methods to detect web scrapers: passive and active. Passive method refers to using backend fingerprinting tests, while active relies on client-side analysis.
								
				
				
				
					Passive Detection Methods
				
				
				
				
									Cloudflare checks your IP address reputation to distinguish if you’re a bot or a human. For example, humans typically browse the web through residential or mobile IP addresses, as these are provided to them by their internet service providers. In contrast, datacenter IPs are often used by automated systems, like web crawlers or scrapers. So, someone visiting a website with a datacenter address is likely not a human. 
Additionally, IPs that were previously used for phishing, scamming, or other malicious activity will be banned. So, if you use a proxy service with your web scraper, make sure the provider offers reputable IP addresses.
Cloudflare also keeps track of bot networks. It stores information about IP addresses, devices, and even behavior patterns associated with these networks. So, it will block IPs and ASNs suspected to be a part of a botnet or give them client-side challenges (“What’s 3+7?” or “Select all red shapes”) to solve in order to prove human activity.
HTTP request headers – client’s browser information – can also be used to distinguish bots from real people. A header from real human traffic has a detailed user-agent string – it defines the operating system, the browser’s version, and other parameters. Also, there are saved cookies, matching IP and geolocation. Bots, however, tend to have a lot of this information missing or mismatched.
								
				
				
				
					Active Detection Methods
				
				
				
				
									To check if the user is not a bot, Cloudflare-protected websites can present Turnstiles – non-interactive CAPTCHA alternatives. These tools are only presented if your traffic looks suspicious. For example, if you have lacking user-agent strings, unusual web interactions, or use datacenter IPs. After Cloudflare checks your parameters and decides you’re a human, you’ll see a “Success!” message after the Turnstile solves itself.
								
				
				
				
												
																					Cloudflare's Turnstile in action
										
									
				
				
				
									Using proxies or a VPN service increases the chance of encountering Turnstiles. So, if you use proxies for web scraping, keep in mind that the scraper will need to be able to mimic human-like browsing to avoid them. 
Additionally, Cloudflare uses canvas fingerprinting to detect automated activity by checking the user’s device class. This class refers to the combination of your browser, OS, and even hardware parameters. It’s defined by three layers: hardware (GPU), low-level software (installed fonts, pixel rendering), and high-level software (web browser). A combination of these parameters creates a unique fingerprint, and Cloudflare can detect if you’re trying to spoof them.
And that’s not all. Cloudflare also has event tracking. It checks mouse activity, such as movements, scrolls, and clicks, as well as what keyboard buttons you press. Typically, humans need a mouse and a keyboard to browse the web, so if your scraper won’t click enough or move the mouse at all, it will raise suspicions. 
								
				
				
				
					How to Bypass Cloudflare Bot Management When Web Scraping
				
				
				
				
									There are several ways to avoid Cloudflare’s challenges when web scraping. While no method is perfect, these tools can help bypass Cloudflare rather simply.
								
				
				
				
					Using Headless Browsers
				
				
				
				
									Headless browsers like Selenium or Puppeteer, are regular browsers, just without a user interface. They don’t have buttons, URL bars, and no other elements to interact with. Instead, you use it programmatically. But how does a headless browser help to bypass Cloudflare? There are several advantages to using the tool when dealing with Cloudflare-protected websites.
You can customize the HTTP browser header. You can design them to mimic real human user-agent string, language selection, cookies, and more. By creating a human-like browser header, you can trick Cloudflare into allowing your scraping bot to successfully operate.
They make interactions appear human-like. Scrapers are typically very methodical about visiting sites. Humans, on the other hand, are more chaotic, and have distinct patterns. For example, we tend to visit the landing page first, then move to product selection, go back and forth until we choose, and so on. Headless browsers can imitate real browsing patterns and web interactions, such as browsing speed, human-like typing, mouse movements.
Headless browsers can be fortified. There are several plugins for headless browsers like Selenium and Puppeteer that patch and optimize them to look more human-like. These plugins (you can find them on GitHub) can help spoof fingerprints, modify and rotate user-agent strings, emulate human-like browsing, and more. This is especially important for scraping tasks as it stops fingerprint leaking and fixes parameters that help Cloudflare identify the browser as bot. 
								
				
				
				
					Using Proxies
				
				
				
				
									Proxies are intermediaries between you and the web server. They route your traffic through a different IP and mask the original address and location. As mentioned before, humans almost never browse the web through datacenter IPs. So, in order to prevent detection, it’s important to pick residential or mobile proxies for your web scraper. 
Requests appear more natural. When scraping the web, your scraper will be sending loads of requests to the website’s server. You will get blocked if you throttle the server with too many requests. Therefore, it’s a good idea to use different IPs and rotate them. Most proxy services offer sizable IP pools, many geolocations, and an option to rotate proxies. This way, you can make your scraper’s requests appear more scattered and natural – as if they come from different people.
You can access geo-restricted content. Proxies can help you avoid Cloudflare Error 1009. Setting your IP address to the one allowed by the website will give you the opportunity to access and scrape websites if they are geo-restricted in your actual location.
								
				
				
				
					Using Web Scraping APIs
				
				
				
				
									Web scraping APIs are tools that combine data extraction logic and proxies with little input from you.  There are many platforms to choose from, each with different strengths and capabilities. Some platforms will require programming knowledge to set up, while others offer ready-made scrapers for Cloudflare-protected websites (like G2) with integrated proxy service, data parsing, and other beneficial features.
								
				
				
				
					Scraping a Cached Version
				
				
				
				
									Since Cloudflare allows web crawlers, there’s likely an indexed and cached version of the website you want to scrape. So, instead of trying to bypass Cloudflare yourself, you can scrape the cached version.
To scrape the Google cache, add https://webcache.googleusercontent.com/search?q=cache: to the start of the URL you want to scrape. 
For example, if you want to scrape Proxyway’s homepage, your URL should look like this: https://webcache.googleusercontent.com/search?q=cache:https://www.proxyway.com/
While this method is a cost-effective solution, it’s the most unpredictable. The cache might be too old, especially if you’re looking for fresh data. Or your target website isn’t cached at all. If you check the web cache page, it will tell you when the last snapshot was made, so make sure this data is still viable for use.
								
				
				
				
												
																					Cached version of Proxyway's homepage
										
									
				
				
				
									Additionally, some websites like LinkedIn tell Google crawlers not to cache their web pages, so you won’t be able to scrape cache altogether.
								
				
				
				
					Alternative Methods to Bypass Cloudflare
				
				
				
				
									If, for one or another reason, abovementioned methods don’t work for you, you can try some alternative ways to bypass Cloudflare.
Cloudflare solvers are often used to bypass Cloudflare’s security measures, such as exposing fingerprint spoofing or detecting unhuman-like browsing. There are several Cloudflare solvers and libraries available, some open-source and some paid. However, they often fail to keep up with Cloudflare’s protection methods that are becoming way harder to bypass.
Cloudflare only blocks requests that pass through their network. Instead, you can try sending a request to the origin server. Call the origin server, find the IP address of the server hosting the content, and then you can send your requests directly, bypassing Cloudflare. However, this method is quite challenging – it’s not easy to find the origin server’s IP address. And even if you do, the host might reject your request. 
If you have the original IP, you can try to request data from the origin server. Once you find the original IP, you can try pasting it into the URL bar, but that might fail because the request lacks a valid Host header that indicates which specific website you want to access. Tools like cURL allow you to specify a host header while requesting the origin server’s IP address. Unfortunately, this method often fails because many origin servers only accept traffic from trusted IP addresses.
You can also reverse engineer Cloudflare’s antibot system to find the workaround. In essence, reverse engineering is taking the whole system apart to figure out how it works. Knowing a system inside out can help you find loopholes to bypass it. If you have the skill for it, reverse engineering can be a cost-effective solution, especially if you aim to scrape large volumes of data. However, Cloudflare’s antibot system was purposefully made to be difficult to understand and it tends to change frequently, so this approach is definitely not for everyone.
When you visit a Cloudflare-protected website, you must first wait in the Cloudflare waiting room. It’s a virtual traffic control system that temporarily stops you from accessing the target website. During the wait time, your browser solves challenges to prove you’re not a robot. If all goes well, you’ll be redirected to the web page. However, if you’re labeled as a bot, you’ll be given an “Access Denied” error. You can reverse engineer Cloudflare’s waiting room challenges by checking the network log and debugging the challenge’s script. However, this also requires a very high programming skill level.
								
				
				
				
					Conclusion
				
				
				
				
									There are several ways to bypass Cloudflare, but the one you choose depends on the amount of resources you have, how much scraping you’re planning to do, and even your skill level.
If you’re looking for cheaper options, you can scrape cached versions of websites, try to reverse engineer the antibot system, or get an open-source Cloudflare solver. While not perfect, these methods can be great for the tech-savvy and those looking to save a buck. On the other hand, proxies and web scraping APIs are more expensive but also much more efficient. They are reliable, relatively simple to understand, and the cost usually depends on how much traffic you use. 
								
				
				
		
				
				
							
			
				Table of Contents			
					
		
			
							
		
						
				
				
					
				
		
					
		
		
					
				
				
							
			
						
		
						
				
		
		
				
				
							
							
					
				
			
			
									
						
							Adam Dubois						
					
				
									
						Proxy geek and developer.					
				
							
		
						
				
				
		
				
				
							
								
													
										September 6, 2024					
								
				
						
				
				
				
				
				
							
			
						
		
						
				
					
				
		
					
				
				
					You May Also Like:
				
				
				
				
							
				
				
			
		
				
				
			
				How to Use cURL With Python for Web Scraping			
		
				
				
				
				
			
		
				
				
			
				Main Web Scraping Challenges and Ways to Overcome Them			
		
				
				
				
		
						
				
					
				
				
		
				
					
				
				
		The post The 9 Best Methods to Bypass Cloudflare When Web Scraping in 2025 appeared first on Proxyway.



Web Scraping JavaScript vs Python: Which is Best for Your Needs?
Adam Dubois — Tue, 03 Sep 2024 13:09:41 +0000
		
				
					
		
				
				
					Guides
				
				
		
					
				
				
									JavaScript and Python are two popular programming languages used for web scraping. But which one suits your needs better? Let’s find out.  
								
				
					
				
				
				
																														
				
				
				
									JavaScript and Python, commonly used for web and mobile application development, data science, and various other tasks, are also popular for web scraping. There is no definitive answer to whether JavaScript or Python is better for web scraping. Usually, it boils down to the project complexity, the requirements (i.e., dynamic content handling), your familiarity and experience with the language, and even personal preference.
It’s important to understand that both languages are perfectly capable of various scraping tasks. Let’s dive into how each programming language works for web scraping, and which one would work better in your case.
								
				
				
				
					What Is Python?
				
				
				
				
									Python is a versatile programming language used for web development, data analysis, and web scraping. It’s pretty easy to grasp because of the relatively simple syntax. 
Even though starting out with Python requires minimal coding experience, it doesn’t mean the programming language itself is somewhat limited. Contrarily, it has extensive resources, such as libraries and frameworks for extra functionality, as well as large communities of professionals and enthusiasts who can help out if you’re facing challenges. 
In addition to its readability and resources, many choose Python for web scraping because it’s capable of handling data processing and manipulation (data cleaning, filtering, adding variables, analysis) really well. Python excels in automating repetitive tasks (i.e., data extraction) and working with large datasets.
								
				
				
				
					What Is JavaScript?
				
				
				
				
									JavaScript (you might see it abbreviated as JS) is a dynamic programming language primarily used to create dynamic content on websites, such as fill-in forms and animations. 
JavaScript may not be as intuitive as Python, and the learning curve is much steeper, but considering how many resources exist already, it’s frequently used for scraping, too. That’s because it’s very compatible with the web in general – most websites nowadays use JavaScript. Some of the most advanced scraping libraries, such as Puppeteer and Playwright, are also written for JavaScript, so it’s useful to have at least some understanding.
It also has a huge and supportive community to help you get started with your project. On top of that, javaScript is fast, easily handles asynchronous operations, and is well-suited for scraping dynamic content from modern websites.
								
				
				
				
					What is Node.js?
				
				
				
				
									Node.js is an open-source runtime environment that allows you to run JavaScript on the server side; outside of a web browser. But why would you want that?
Each web browser has a dedicated engine to run JavaScript code. JavaScript is usually used on the client side to manipulate web page behavior – fetch new social media posts, run animations, and manipulate other interactive elements. You can try to disable it on your browser, and see how websites behave then. 
								
				
				
				
							
			


		
						
				
				
				
									Using the Node.js framework, however, allows you to build web servers, real-time applications, such as chats or live-streaming services using JavaScript. Ultimately, it makes sense to use JavaScript with the Node.js environment for scraping interactive web pages, due to JavaScript’s ability to handle real-time interactions on web applications natively.
								
				
				
				
					Python vs JavaScript for Web Scraping: Main Features
				
				
				
				
									While both can be used for web scraping, let’s take a closer look at JavaScript vs. Python features and how they compare.
								
				
				
				
					Popularity
				
				
				
				
									It might come as a surprise for newbies, but JavaScript and Python are two of the most popular programming languages in the world, JavaScript being the number one for several years now. No wonder JavaScript is so prevalent – it’s used everywhere on the web.
Python, despite being first introduced in 1991, is still seeing an increase in popularity. The language is still actively maintained so it’s a primary choice for data science, web scraping, and other tasks.
								
				
				
				
												
																					Most popular programming languages in 2022. Source: GitHub
										
									
				
				
				
					Available Libraries 
				
				
				
				
									Both languages have multiple web scraping libraries that add additional functionality to their vanilla versions. Designated libraries help simplify workload, parse data, handle requests, and manage scraping projects.
Most popular web scraping libraries used with Python are BeautifulSoup (HTML parsing), Scrapy (data extraction), Selenium (browser automation; used with JavaScript and other languages too), Requests (handling requests), Pandas (data analysis and parsing). 
As for JavaScript, you’ll see Puppeteer (Chromium browser automation), Cheerio (HTML parsing), Playwright (browser automation), Axios (handling requests) libraries as some of the most used Node.js libraries. 
								
				
				
				
												
																					Most popular JavaScript and Python libraries
										
									
				
				
				
					Asynchronous Web Scraping
				
				
				
				
									Synchronous web scraping works by sending an HTTP request, waiting for it to finish, and then moving to the next one. Asynchronous web scraping, on the other hand, basically means that your scraper can handle concurrent requests. In simple terms, you don’t need to wait for one request to complete, so you can scrape multiple pages in parallel. Both JavaScript and Python are able to perform web scraping tasks asynchronously, but there are some differences in execution.
When scraping with JavaScript, you’ll use the Node.js environment. It’s inherently asynchronous and capable of handling concurrent requests as it uses an I/O (input/output) model where blocking doesn’t occur. In addition, Node.js uses async/await and Promises features which simplify working with asynchronous code. Though JavaScript with Node.js generally handles asynchronous requests really well, it all comes down to how these operations are dealt with – it can be difficult for less experienced.
Python, on the other hand, isn’t inherently asynchronous, but it has asynchronous libraries, like asyncio, aiohttp and httpx, as well as async/await syntax. Thus, using these web scraping libraries can make a Python scraper have similar asynchronous capabilities like a JavaScript one, though slower. However, it can become complex once you set up browser automation tools like Selenium – Python typically is easy to read, so the extensive configuration with extra libraries can complicate readability.
								
				
				
				
					Ability to Handle Dynamic Content
				
				
				
				
									Dynamic content is usually fetched by JavaScript, so you might think that only JavaScript-based scrapers can handle it. However, that’s not entirely true. 
While JavaScript excels in handling dynamic content natively with tools like Puppeteer, Python can do it, too. More specifically, with the help of headless browser libraries, like Selenium and Playwright. However, it’s worth noting that it can add complexity to the scraping process. Beginners might be discouraged due to the sheer amount of code management involved – the more libraries you use, the worse the readability gets.
								
				
				
				
					Performance
				
				
				
				
									In terms of performance, it’s hard to compare the two – both are great, but in different areas. 
Python is perfect for processing and manipulating scraped data – parsing, cleaning, and analyzing information from different web pages. It’s also great for writing scripts for repetitive tasks, such as scraping similar data from multiple web pages. Python is very effective for large web scraping projects because of additional frameworks and libraries. It makes it rather simple to deal with vast volumes of data.
But JavaScript also deserves some praise. While not as great with data processing, JavaScript-based scrapers are very efficient and fast. Node.js is built to handle loads of concurrent connections, so if you’re looking to scrape web applications with AJAX or JavaScript elements, it will easily handle them. It’s also scalable.
								
				
				
				
					Ease of Use
				
				
				
				
									So, JavaScript vs. Python – which is easier to use? The short answer is Python. The long one – it depends on your experience and what you’re aiming to do.
Python is easy to pick up even for people without any prior programming experience. Python has more available integrations, libraries, and frameworks to expand its capabilities. However, juggling multiple integrations can be tricky – the code becomes less readable, and scraping speed decreases.
JavaScript isn’t as easy to begin working with, but it’s more efficient and faster for complex and dynamic scraping tasks. Nevertheless, managing multiple concurrent tasks is a hassle because it can raise complexity with things like error handling, synchronization, and data consistency.
Ultimately, many in the scraping community agree that Python is easier to use because it’s more user-friendly. 
								
				
				
				
					Community Support and Documentation
				
				
				
				
									Python has extensive documentation about its use; it’s full of syntax manuals and tutorials on the setup, library use, third-party modules, and more. The documentation also covers every existing Python version (even the ones from the 90s!) and the ones that are still in development. Not only that, but it also covers additional resources, such as book lists, audio/visual talks, and developer’s guides. It’s a massive collection of information with everything you need to know about coding with Python.
Here’s the thing about JavaScript documentation – it doesn’t exactly exist. However, the JSDoc on GitHub can help you understand it better. Don’t worry, Node.js – the runtime environment used for web scraping with JavaScript – has pretty awesome documentation. It’s vast, full of useful resources, and information on past releases. There’s also a code repository and issue tracker if you need that!
JavaScript and Python both have large communities that are supportive, and full of tech enthusiasts as well as seasoned developers who are ready to help you out. You can find them on Stack Overflow, GitHub, Reddit or Discord.
								
				
				
				
					Web Scraping JavaScript vs. Python: Side-by-side Comparison
				
				
				
				
									  Python JavaScript
Libraries BeautifulSoup,
Selenium,
Scrapy,
Requests,
And more Puppeteer,
Playwright,
Cheerio,
Axios,
Selenium,
And more
Asynchronous scraping Available with some libraries Available
Dynamic content handling Available with headless browser libraries Available natively
Performance and scalability Slower; scalable Faster; scalable
Usability Easy to pick up; becomes more difficult with multiple integrations Harder to pick up; becomes even more difficult with multiple integrations
Documentation and community support Extensive documentation; large community No official documentation; large community
								
				
				
				
					Starting Web Scraping With Python vs. JavaScript
				
				
				
				
									If you want to test Python vs.JavaScript in practice, you can try a few scraping projects with each and see how your experience differs. We created a tutorial for beginners that will help you test the programming language.
								
				
		
					
				
				
					Web Scraping with Python: All You Need to Get Started
				
				
				
				
					An introductory guide to Python web scraping with a step-by-step tutorial.
				
				
					
				
				
				
									Also, we compiled a list of small and large scale Python project ideas that you can use to practice your skills.
								
				
		
					
				
				
					7 Python Web Scraping Projects: Ideas for Beginners and Advanced Users in 2024
				
				
				
				
					We compiled a list of ideas for you to try.
				
				
					
				
				
				
									As for JavaScript, you can try following a step-by-step guide to scrape data from a website with Node.js environment. It will help you choose the essential libraries for your scraping project, write the code, and introduce other helpful tips for starting out.
								
				
		
					
				
				
					Web Scraping with Node.js: A Practical Guide for Beginners
				
				
				
				
					Everything you need to know about web scraping with Node.js and JavaScript in one place.
				
				
					
				
				
				
									By trying web scraping with both JavaScript and Python in practice, you’ll be able to make an informed decision about which one you should choose for your upcoming projects.
								
				
				
				
					Conclusion
				
				
				
				
									JavaScript and Python are popular choices for web scraping. But they both come with  strengths, weaknesses, and distinctive features. 
Python’s simplicity and extensive libraries are great for beginners and advanced users alike, especially for data-heavy projects. On the other hand, JavaScript’s impeccable ability to handle dynamic content and asynchronous requests by default is highly useful when scraping modern web applications.
In reality, there’s no right or wrong option – both languages are good and full of various resources to expand their capabilities. Choose Python if code readability and scraped data analysis is your priority. But if you’re planning to scrape JavaScript-heavy web applications, you should pick JavaScript instead.
								
				
				
		
				
				
							
			
				Table of Contents			
					
		
			
							
		
						
				
				
					
				
		
					
		
		
					
				
				
							
			
						
		
						
				
		
		
				
				
							
							
					
				
			
			
									
						
							Adam Dubois						
					
				
									
						Proxy geek and developer.					
				
							
		
						
				
				
		
				
				
							
								
													
										September 3, 2024					
								
				
						
				
				
				
				
				
							
			
						
		
						
				
					
				
		
					
				
				
					You May Also Like:
				
				
				
				
							
				
				
			
		
				
				
			
				Python Asynchronous Requests: A Web Scraping Tutorial with AIOHTTP			
		
				
				
				
				
			
		
				
				
			
				What Is Data Parsing and What’s It For?			
		
				
				
				
		
						
				
					
				
				
		
				
					
				
				
		The post Web Scraping JavaScript vs Python: Which is Best for Your Needs? appeared first on Proxyway.

Prompt question	Instruction for ChatGPT
What does ChatGPT need to do?	Write a web scraper that extracts book titles and prices of all books in the target website.
What is the target URL?	Target website is: https://books.toscrape.com/
What programming language should it use?	Use Python programming language
What libraries should it use?	Use the following libraries: – Requests – Beautiful Soup
What elements it needs to scrape?	Defined CSS selectors are: #default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a #default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color
What are the output requirements?	Save titles and prices in a CSV file.

	Python	JavaScript
Libraries	BeautifulSoup, Selenium, Scrapy, Requests, And more	Puppeteer, Playwright, Cheerio, Axios, Selenium, And more
Asynchronous scraping	Available with some libraries	Available
Dynamic content handling	Available with headless browser libraries	Available natively
Performance and scalability	Slower; scalable	Faster; scalable
Usability	Easy to pick up; becomes more difficult with multiple integrations	Harder to pick up; becomes even more difficult with multiple integrations
Documentation and community support	Extensive documentation; large community	No official documentation; large community