How to Bypass CAPTCHAs When Web Scraping
No more pictures of traffic lights, please.
Unless you’re scraping tiny websites in the middle of Internet-nowhere, you’ve probably encountered a CAPTCHA. It’s one of the main ways domains try to protect themselves, popular for its effectiveness and simple implementation. CAPTCHAs make your spider go, “huh?” and clog up your data collection pipeline worse than a holiday turd. But it doesn’t mean there’s nothing you can do about them.
This article will teach you how to bypass CAPTCHAs or mitigate them using multiple methods. It includes general information about CAPTCHAs that you might find useful, such as what triggers a CAPTCHA challenge or what challenges you can expect. If that’s not relevant to you, feel free to skip to the parts that are.
- What Is CAPTCHA?
- What Is the Purpose of CAPTCHA?
- How do CAPTCHAs Work?
- What Triggers a CAPTCHA?
- The Main Types of CAPTCHA Challenges
- The Most Popular CAPTCHA Systems
- How to Bypass CAPTCHA
CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. If you don’t know what Turing test means, well – the acronym explains that too. It’s a test to determine whether the entity you’re interacting with is a computer or human. In other words, if that girl you’re trying to hook up with on Tinder is really a person, or just an elaborate chatbot that’ll try to shill an expensive webcam site.
The main purpose of CAPTCHA tests is to filter human traffic from bots (yes, web scrapers are bots). They do so by presenting various challenges to website visitors. The challenges are designed to be easily solvable by humans but very hard to crack for computers. CAPTCHAs allows website administrators to curb unwelcome automated activities, such as spam, DDoS attacks, and sometimes web scraping.
CAPTCHAs also have secondary purposes. Originally, they helped to digitize badly-scanned text passages that optical content recognition (OCR) technologies couldn’t crack. Nowadays, we provide free labor for Google’s machine learning algorithms by labeling objects in images. Talk about a noble cause.
CAPTCHAs function as a final test to determine if a website’s visitor is human or bot. They appear when a website detects unusual traffic; then they present the visitor with a challenge.
The exact configuration of a CAPTCHA depends on the webmaster: it can protect the whole website or specific pages. Sometimes, a page will always throw up a CAPTCHA, especially if it’s a registration, comment form, or checkout page. But more often, it needs some kind of trigger to appear.
The main factors that cause a CAPTCHA are:
- Simple CAPTCHA triggers. These include unusual traffic, high number of connections from a single IP address, or the use of low quality datacenter IPs. For example, VPN users see more CAPTCHAs than regular website visitors because VPNs get their IPs from a data center. The same is with corporate networks that share an IP address between many employees.
- Passive fingerprinting. A collection of parameters that evaluate your network and device. The most important are HTTP headers, user agent, TLS and TCP/IP data.
These triggers don’t have to involve CAPTCHAs – they can simply block a visitor from browsing the website altogether. They’re combined whenever fingerprinting or another protection method fails to conclusively prove that a visitor is non-human. Here are the combinations you can expect and their frequency:
|Simple trigger + CAPTCHA||Most common|
|Passive fingerprinting + CAPTCHA||Common|
|Active fingerprinting + CAPTCHA||Relatively rare|
|Simple trigger + passive + active fingerprinting + CAPTCHA||Rare|
As you can see, many websites won’t bother implementing elaborate fingerprint checks. That’s because doing so requires a lot of resources, and it can also harm user experience. For example, Cloudflare uses active fingerprinting to trigger CAPTCHAs, and I’m sure many people aren’t thrilled to be constantly interrupted by its “Checking your browser” screen.
Once a CAPTCHA is triggered, it presents the visitor with a challenge.
There are many different types of CAPTCHA challenges, and it’d be hard to list all of them here. Instead, let’s lump them into several big categories you’re most likely to encounter:
Text entry CAPTCHAs
This type presents a string of distorted letters and numbers. To pass the challenge, you have to retype them into a text field.
Text-based CAPTCHAs are perhaps the oldest type, introduced by the original CAPTCHA. They’ve since lost popularity because text is easy to manipulate by bots and hard to enter by humans. However, they’re still widely used by web forums and even sites like Amazon.
A typical example of an image challenge would be reCAPTCHA’s grid of images, where you have to select squares that contain some object. If you succeed, you’re allowed to go past; otherwise, you get another grid or fail the test.
Image CAPTCHAs are very popular, and you’re likely to encounter them the most often. There are multiple variations of image-based challenges, such as defining an object’s boundaries or labeling what you see by category.
These challenges give an audio excerpt and then ask to type in the letters, words, or numbers you’ve heard.
Audio CAPTCHAs rarely come standalone. Instead, they function as a fallback option to other types of challenges for limited accessibility users. To make it harder for speech recognition software, audio tests sometimes add distortions to the sound.
This type of CAPTCHA includes math problems (addition, subtraction, and other operations), word puzzles, spatial tasks, and similar tests.
For example, a popular CAPTCHA system called fun CAPTCHA often asks website visitors to roll a ball with 3D models inside. Another CAPTCHA system, Geetest, requires you to move a piece to complete a puzzle. Puzzle CAPTCHAs rely on motion and similar mechanics to avoid recognition tasks which machine learning models have become very good at solving.
This type is also called noCAPTCHA because it asks the visitor to click on a checkbox instead of presenting them with a challenge. So, if everything goes okay, a regular user won’t have to do anything more to pass. If the verification fails, a regular challenge (usually an image) will appear.
Button CAPTCHAs are widely used by Google’s reCAPTCHA and hCAPTCHA – two of the most popular systems on the internet. It reduces the friction of solving challenges and is quite effective at deterring bots. The system uses behavioral cues to monitor how visitors tick the checkbox; we briefly describe them below, under reCAPTCHA v2.
An invisible CAPTCHA doesn’t even give you a checkbox to tick – in fact, a regular person shouldn’t see it at all. It works completely in the background, where the system monitors visitors and decides whether to present them with a challenge.
Invisible CAPTCHAs are the most recent advancement in technology, championed by Google. Their aim is to become even less inconvenient for people. However, this kind of CAPTCHAs has been criticized for using intrusive and privacy-violating technologies to filter bot traffic.
Social Media Sign Ins
A social media sign in asks new users to register using their social media account before they can see content or use a service.
While not exactly a CAPTCHA in the strict sense, a social media sign is also used to filter human traffic from bots. They’re quite effective because it’s not enough to fill in a form with false information – you also need to have a fake social media account. So, you suddenly have to deal with two websites instead of one, and social platforms aren’t an easy nut to crack for automation tools.
Here are some of the more prevalent CAPTCHA systems on the internet:
Google’s own CAPTCHA solution. It was released in 2013 and has since replaced the company’s text-based reCAPTCHA v1. Even though there’s a third version already, version 2 remains a very popular tool for webmasters.
reCAPTCHA v2 is famous for being a “no CAPTCHA reCAPTCHA”. That’s because it gives a simple “are you a robot?” checkmark box instead of a challenge. Sneaker scalpers call the box a “one-click CAPTCHA”. It reduces the friction of solving a CAPTCHA every time.
What makes a verification fail or succeed? Google considers a user’s cookie history, mouse movements, and other behavioral data. If these parameters are missing or mismatched, they can trigger a CAPTCHA challenge. Due to its reliance on cookies, v2 has become infamous for serving more challenges to non-Chrome users.
In 2017, Google introduced invisible reCAPTCHA. It works like the regular v2, but instead of clicking the special box, you can bind it to trigger on any button click.
Google released the third version of reCAPTCHA in 2017. Unlike v2, v3 hasn’t replaced its predecessor, and both version 2 and 3 are available as alternative options.
v3 was designed to work without any interaction. It’s invisible for the user and continuously monitors their actions in the background. Based on them, it assigns the user with a score – how likely they are to be bots. If the score is low, reCAPTCHA v3 lets the webmaster choose which actions to take: present a v2 challenge, throttle the number of requests, block the user, or let them pass.
hCAPTCHA strongly resembles Google’s tools in the way it works. Like reCAPTCHA, it analyzes behavioral patterns to determine if a user is human or bot. If the check fails, the user receives a challenge. Most often it’s images, but hCAPTCHA uses an interesting mechanism where the challenge type depends on the highest bidder.
hCAPTCHA is relatively new – it was introduced only in 2018, as an answer to reCAPTCHA. By focusing on privacy, and being free of charge for most websites, hCAPTCHA quickly spread throughout the web. In April 2020, it became the provider of choice for Cloudflare, and today hCAPTCHA is perhaps the most widespread CAPTCHA system on the web.
Amazon’s own CAPTCHA system fails to compare in scale with the first three options. But the retail giant is a prime web scraping target, so here we are.
Unlike Google, which has moved on to other methods, Amazon still uses a text-based challenge. It’s hard to pinpoint what exactly triggers it, which for many people makes scraping Amazon rather unpredictable. Another problem with Amazon CAPTCHAs is that you don’t always know when you receive one. CAPTCHA or not, you’ll still get the 200 status code. So, don’t get too excited if your scraping job is going suspiciously well.
Amazon has been playing around with other implementations, so it’s unclear how long the text-based challenges will remain around. But for now, they’re here, and you’ll have to deal with them.
If your web scraper is encountering CAPTCHAs, your first recourse should be to rotate your IP address. This helps surprisingly often, especially if you’re using a quality proxy network.
Otherwise, there are two main approaches to bypassing CAPTCHAs: you can either try to solve the challenge or avoid it altogether.
Solving the Challenge
Solving a challenge means confronting it head on. This assumes the CAPTCHA is unavoidable or your web scraping setup isn’t sophisticated enough to fool the website’s protection mechanisms.
The simplest (and relatively low-tech) method would be to get a CAPTCHA solving service. Websites like 2Captcha and Anti-CAPTCHA use real humans to solve the challenges for you; you just have to feed its hash and receive a solution via an API. Dealing with CAPTCHAs this way costs 1-3 dollars per 1,000 challenges.
As for the more advanced methods:
- Text-based challenges can be overcome with machine learning. You can download the images that contain the text, segment them, and train a neural network to recognize the letters. It takes time but deals with the problem altogether.
- Image-based challenges have several solutions. You can find or train a convolutional neural network to recognise images. Alternatively, almost all image-based captchas have an accessibility mode for disabled people. Instead of solving images, it allows downloading an audio file and processing it with any online (free) speech-to-text API. Just note that Google has strengthened its active fingerprinting algorithms, so you might not always get the option for the audio.
Avoiding the Challenge
Of course, the ideal outcome would be to avoid CAPTCHAs altogether. This is harder to achieve than simply brute forcing through them but usually more rewarding. We suggest trying the following:
- Use quality IP addresses. IP recognition is often the first line of defence that websites use. With a good and “clean” residential IP address, you’ll be less likely to encounter a CAPTCHA.
- Limit the number of requests you make. You shouldn’t barrage the website with a high number of requests from the same IP. Change the duration between your requests, make organic timeouts. Don’t scrape too fast or all day round without pauses.
- Improve your web scraper’s fingerprint. Try to be as organic as possible when you scrape: match TLS parameters, HTTP headers, have a database of real user agents, discard cookies when they’re no longer needed.
Now you know what CAPTCHAs are, their main types, and some ways to bypass them. Note that not every CAPTCHA is made equal – their triggers and difficulty depend both on the website’s security and your actions. Keep that in mind and good luck with your web scraping!