Guides - Proxyway https://proxyway.com/guides Your Trusted Guide to All Things Proxy Tue, 03 Dec 2024 08:23:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://proxyway.com/wp-content/uploads/2023/04/favicon-150x150.png Guides - Proxyway https://proxyway.com/guides 32 32 How to Scrape Google Flights With Python: A Step-by-Step Tutorial https://proxyway.com/guides/scrape-google-flights https://proxyway.com/guides/scrape-google-flights#respond Mon, 25 Nov 2024 13:50:26 +0000 https://proxyway.com/?post_type=guides&p=27774 This is a step-by-step tutorial on how to build a Google Flights Scraper with Python

The post How to Scrape Google Flights With Python: A Step-by-Step Tutorial appeared first on Proxyway.

]]>

Guides

Instead of having multiple browser tabs open to check every destination, you can scrape Google Flights with a Python-based scraper, and get structured flight data in minutes.

How to scrape Google Flights

Planning trips online has become significantly more convenient, but there are still roadblocks – booking flights can still be time-consuming due to the sheer amount of data. While platforms like Google Flights offer a neat way to check all necessary information and compare it across different airlines, manually looking through each date and destination can be daunting. By automating this process with a Google Flights scraper, gathering large volumes of data and comparing it becomes less of a hassle. 

Whether you’re a person looking for a bargain on flight tickets, a business analyst, or a scraping enthusiast searching for a new challenge, this guide will help you build a scraper that collects Google Flights data from scratch. 

Why Scrape Google Flights?

Google Flights offers a vast amount of valuable data – from flight times and prices to the environmental impact of the flight. By scraping flight pages you can extract prices, schedules, and availability, as well as plan trips and stay updated when changes are made. 

Platforms like Google Flights offer flight information based on your requirements (departure and arrival location, dates, number of passengers), but it’s not always easy to compare it – you need to expand the results to see all relevant information, such as layovers. And having several expanded results can be hardly readable. Scraping real-time data can help you find the best deals, and plan itineraries better. Or if you’re a business owner, it can help gather market intelligence and analyze customer behavior. 

What Google Flight Data You Can Scrape?

There are dozens of reasons to scrape Google Flights data. While the intention might vary based on what you’re trying to accomplish, both travelers and businesses can benefit from it.

If you’re simply planning a trip, scraping Google Flights data might help you to:

  • Compare prices. Getting information about pricing is one of the key reasons why people choose to scrape Google Flights. Structured scraped results can help to evaluate ticket prices, and compare them across different airlines.
  • Check flight times. Another major reason to extract Google Flights data is flight times. You can collect departure and arrival times and dates, compare them, and select the option that fits your itinerary best.
  • Find out about stops. Most people prefer direct flights. Google Flights has data that allows you to check if there will be any layovers until you reach your destination.
  • Review duration. Knowing how long the flight is going to take will help you plan the trip better, and see how the flight fits into your schedule. Such data can give you insights on the duration of your flights between specific locations.
  • Learn about emissions. Scraped data from Google Flights can help you to evaluate carbon emissions of the flights, and make more sustainable choices.


If you’re looking to scrape Google Flights for business purposes, you can:

  • Analyze user behavior patterns. There are specific times when people tend to travel to certain destinations, such as during winter holidays, summer vacations, and more. By reviewing these behavior patterns, companies can segment user bases and target advertisements better.
  • Improve pricing strategies. Flight information is relevant for more businesses than just airports and airlines. Hotels, taxi services, car rental companies, travel insurance companies can review the increase or decrease of demand for specific locations, and adjust their pricing accordingly.
  • Create bundle deals. Accurate flight data can help travel agencies create better travel deals by bundling tickets, hotels, transportation, and activities for customers.
  • Improve risk management. Travel insurance companies can leverage flight data to identify popular destinations, and adjust policies and pricing to better align with customer demand.
Benefits of scraping Google Flights Data for travelers and businesses

Is Scraping Google Flights Legal?

Data on Google Flights is public, and there are no laws prohibiting the collection of  publicly available information. However, there are several things to keep in mind to avoid legal implications.

Here are several tips on how to scrape Google Flights data ethically:

  • Comply with Google’s terms of use. Take the time to go over Google’s terms of service to make sure you don’t violate any of their guidelines.
  • Read the robots.txt file. The file gives instructions to robots (such as scrapers) about which areas they can and cannot access  (e.g., admin panels, password-protected pages). Be respectful and follow the given commands.

How to Scrape Google Flights with Python: Step-by-Step Guide

If you’re looking to build your own Google Flights scraper, here’s a comprehensive guide on how to do so from scratch.

In this example, we’ll use Python with Selenium to build the scraper. Python is a great choice due to its straightforward syntax – it’s relatively easy to write, maintain, and understand. Additionally, since Google Flights is a highly dynamic website, we’ll use Selenium to handle dynamic content and interactive elements, such as buttons.

Below is a table containing all information about the scraper we’re going to build.

Programming languagePython
LibrariesSelenium
Target URLhttps://www.google.com/travel/flights/
Data to scrape

1. Departure date from the origin location

2. Return date from the destination

3. Operating airline

4. Departure time

5. Arrival time

6. Flight duration

7. Departure airport

8. Arrival airport

9. Layovers

10. Cost of the trip

11. Best offer

How to save dataCSV file

Prerequisites

Before the actual scraping begins, you’ll need to install the prerequisites. 

  1. Install Python. You can download the latest version from Python’s official website. If you’re not sure if you have Python installed on your computer, check it by running python –version in your terminal (Terminal on MacOS or Command Prompt on Windows).
  2. Install Selenium. To use Selenium with Python for this scraper, install it by running pip install selenium in the Terminal.
  3. Install Chrome WebDriver. Selenium helps to control headless browsers, such as Chromium (which powers Google Chrome). Download the Chrome WebDriver that corresponds to your Chrome browser.
  • Get a text editor. You’ll need a text editor to write and execute your code.. There’s one preinstalled on your computer (TextEditor on Mac or Notepad on Windows), but you can opt for a third-party editor, like Visual Studio Code, if you prefer.

Importing the Libraries

Once all your tools are installed, it’s time to import the necessary libraries. Since we’ll be using Python withChrome, we need to import the WebDriver to the system Path for the browser to work with Selenium.

Step 1. Import Webdriver from Selenium module.

				
					from selenium import webdriver

				
			

Step 2. Then, import the By selector module from Selenium to simplify element selection.

				
					from selenium.webdriver.common.by import By

				
			

Step 3. Import all necessary Selenium modules before moving on to the next steps.

				
					from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver import Keys, ActionChains

				
			

Step 4. We want to save our results into a CSV file, so let’s import the CSV module, too.

				
					import csv

				
			

Setting Up Global Variables and Parameters

After importing all the necessary libraries, we need to to set up global variables to store key values. These include the target URL, a timeout (to accommodate page loading time), and any specific parameters.

Step 5. So, let’s set up global variables.

				
					start_url = "https://www.google.com/travel/flights"
timeout = 10 #seconds

				
			

Step 6. Next, set up the parameters for the scraper – specifically, the criteria you’re looking for in the flights. These include departure and arrival locations, as well as travel dates.

				
					my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': '2024-12-03'
}
				
			

Note: You can also define parameters for one-way flights, too. 

				
					my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': None
}

				
			

When browsing Google Flights, you don’t need to specify the exact airport for departure or arrival – you can simply enter a city (or even a country) instead because we’re using the auto-complete feature. It simplifies location input by suggesting relevant options. For example, typing Los will display suggestions that match the input – LOS airport in Nigeria, Los Angeles in the U.S., or Los Cabos in Mexico.

You can edit these values as you see fit – your ‘from’ value can be set to ‘Los Angeles’, and the scraper will target any airport in Los Angeles for departure. You can also specify a different airport, like ‘JFK’ or change the dates completely. But, for the sake of this example, let’s use LAX for departure and any airport in Atlanta for arrival.

Setting Up the Browser

Step 7. Before we start scraping with Selenium, you need to prepare the browser. As mentioned earlier, we’ll be using Chrome in this example.

				
					def prepare_browser() → webdriver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘--disable-gpu’)
driver = webdriver.Chrome(options = chrome_options)
return driver

				
			

Note: This browser setup will allow you to see the scraping in action. However, you can add an additional chrome_options line to run Chrome in headless mode.

				
					def prepare_browser() → webdriver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(‘--disable-gpu’)
chrome_options.add_argument(“--headless=new”)
driver = webdriver.Chrome(options = chrome_options)
return driver

				
			

Step 8. It’s also important to set up the main() function. It calls the prepare_browser function, which returns a Chrome driver. Additionally, we need to instruct the driver to execute the scraping, and close when it’s finished.

				
					def main() -> None:
driver = prepare_browser()
scrape(driver)
driver.quit()

if __name__ = ‘__main__’:
main()

				
			

Scraping Google Flights

When the browser is prepared, we can actually start scraping the results from the Google Flights page. 

Handling Cookies on Google Flights with Python

While the start_url is the Google Flights main page, the scraper might bump into a cookie consent page first. Hence, we need to instruct our scraper to handle it.

Step 9. Let’s provide the scraper with some extra information to handle the cookie consent page. Namely, find and copy the CSS selectors for the “Accept” button. We can do this by using Inspect Element.

cookie consent button in google flights
Cookie consent button on Google Flights page

If the scraper successfully clicks the “Accept” button on the cookie consent page, we’ll still need to wait until the actual Flights page loads. in this example, we’re using the “Search” button to indicate that our target page has loaded. 

Step 10. Using the search button’s CSS selector, instruct the scraper to wait for it to appear before moving on to the next step. So, let’s write a function that will print “Search button found, continuing.” if everything went well, and “Something went wrong.” if the scraper couldn’t locate said button.

Search button in google Flights
Search button in Google Flights

Here’s how the function for accepting cookies and locating the “Search” button looks like:

				
					def scrape(driver: webdriver) -> None:
    driver.get(start_url)
    if driver.current_url.startswith("https://consent.google.com"):
        print ("Hit the consent page, handling it.")
        btn_consent_allow = driver.find_element(
            By.CSS_SELECTOR, 'button.VfPpkd-LgbsSe[jsname="b3VHJd') 
        btn_consent_allow.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]'))
            )
            print ("Search button found, continuing.")
            prepare_query(driver)
        except Exception as e:
            print("Something went wrong: {e}")

				
			

Continuing in the def scrape function, let’s add some code instructing the scraper to locate and click on the “Search” button, and print “Got the results back.” when the scraping is finished.

				
					search_btn = driver.find_element(By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]')
        search_btn.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'ul.Rk10dc'))
            )
            print ("Got the results back.")

				
			

Scraping Google Flights

At the beginning of our script, we defined our parameters: origin location (‘from’), destination (‘to’), a date for departure (‘departure’), and a date for return (‘return’). These parameters will help the scraper fill in the query fields. To allow def scrape to function properly, we need to instruct the scraper about how it should prepare the search query. 

Step 11. While we have our values ready, the scraper needs to know where to use them. For that, we’ll need to find and copy another set of CSS selectors for “Where from?”, “Where to?”, and date fields.

How to find CSS selector for Google Flights "Where to?" field
How to find CSS selector for Google Flights "Where to?" field

However, we need to prepare our scraper for two potential date_to options – if the return date is defined in my_params, and if it’s not

However, if it’s None, we’ll also need to change the selection to One-way (instead of Round trip) in the dropdown menu. Thus, we’ll need a CSS selector for the menu as well.

Dropdown menu CSS selector
FInding the CSS selector for the dropdown menu

Step 12. Instruct the scraper about how it should fill in the “Where from?”, “Where to?”, and date fields.

				
					def prepare_query(driver) -> None:
field_from = driver.find_element(By.CSS_SELECTOR, 'div.BGeFcf div[jsname="pT3pqd"] input')
    field_to = driver.find_element(By.CSS_SELECTOR, 'div.vxNK6d div[jsname="pT3pqd"] input')
    date_from = driver.find_element(By.CSS_SELECTOR, 'div[jsname="I3Yihd"] input')
date_to = None
   if my_params['return'] is None or my_params['return'] == '':
dropdown = driver.find_element(By.CSS_SELECTOR, 'div.VfPpkd-aPP78e')
        dropdown.click()
        ActionChains(driver)\
            .move_to_element(dropdown)\
            .send_keys(Keys.ARROW_DOWN)\
            .pause(1)\
            .send_keys(Keys.ENTER)\
            .perform()



				
			

The if function might find a pre-defined return date in my_params. If that’s the case, we need to find a CSS selector for the return date field instead of changing the the value in the dropdown menu. The scraper will fill in the form using data from my_params.

				
					else: 
date_to = driver.find_element(By.CSS_SELECTOR, 
'div.K2bCpe div[jsname="CpWD9d"] input')
 field_input(driver, field_from, my_params['from'])
    field_input(driver, field_to, my_params['to'])
    field_input(driver, date_from, my_params['departure'])
    if date_to is not None:
        field_input(driver, date_to, my_params['return'])
    ActionChains(driver)\
        .send_keys(Keys.ENTER)\
        .perform()
    print("Done preparing the search query")

				
			

Step 13. Once all the fields we need to fill in are defined, instruct the scraper to enter the information into the selected fields.

We’ll use ActionChains to send the text that needs to be typed in. Additionally, let’s instruct the scraper to press Enter, so that the first suggested option for departure and arrival dates is selected from the dropdown menu.

				
					def field_input(driver, element, text) -> None:
        element.click()
        ActionChains(driver)\
        .move_to_element(element)\
        .send_keys(text)\
        .pause(1)\
        .send_keys(Keys.ENTER)\
        .perform()

				
			

Note: In Step 10, we instructed the scraper to click on the “Search” button to run this search query.

Returning the Results

If you check the Google Flights page source, you’ll notice that the results come in an unordered list, where one list item contains all the information about a single trip – the dates, times, price, layovers, and more. When browsing the page, each list item should look something like this:

Flight result example
One flight result is one list item

Step 14. If we want these results to sit neatly in a table when we save them, we need to store them into our “dictionary”. To do this, we need to collect the CSS selectors for each element in the result.

				
					def get_flight_info(element, best) -> dict:

				
			

Let’s begin with flight times. The departure time time[0] will be time_leave, and arrival time – time[1] as time_arrive.

Finding CSS selectors for flight times on Google Flights results
Finding CSS selectors for flight times on Google Flights results
				
					times = element.find_elements(By.CSS_SELECTOR, 
        'div.Ir0Voe span[role="text"]')

				
			

Let’s do the same thing with airports.

				
					airports = element.find_elements(By.CSS_SELECTOR, 
        'div.QylvBf span span[jscontroller="cNtv4b"]')

				
			

And the rest of the provided information – airlines, layovers, cost, and suggested best result.

				
					flight_info = {
        'airline': element.find_element(By.CSS_SELECTOR, 
            'div.Ir0Voe > div.sSHqwe.tPgKwe.ogfYpf > span:last-child').text,
        'date_leave': my_params['departure'], #This will be filled in from my_params
        'date_arrive': my_params['return'], #This will also be filled from my_params
        'time_leave': times[0].text,
        'time_arrive': times[1].text,
        'duration_string': element.find_element(By.CSS_SELECTOR, 
            'div.Ak5kof > div.gvkrdb.AdWm1c.tPgKwe.ogfYpf').text,
        'airport_leave': airports[0].text,
        'airport_arrive': airports[1].text,
        'layovers': element.find_element(By.CSS_SELECTOR, 
            'div.BbR8Ec > div.EfT7Ae.AdWm1c.tPgKwe > span').text,
        'cost': element.find_element(By.CSS_SELECTOR,
            'div.U3gSDe div div.YMlIz.FpEdX span').text,
        'best': best #True for flights from the suggested best list, or False for everything else
    }
    return flight_info

				
			

Extracting and Parsing the Page Data

Google Flights has a neat feature that provides you with the best results (the shortest flight duration, fewest layovers, the cheapest flight), as well as all available results based on your query. You may not like the suggested best results, so let’s save both best and all other remaining results in a list list_elems.

Step 15. Let’s adjoin these two lists, and return them as a single item under one name – list_of_flights. 

				
					def find_lists(driver):
   list_elems = driver.find_elements(By.CSS_SELECTOR, 'ul.Rk10dc')
   list_of_flights = parse(list_elems[0], True) + parse(list_elems[1], False)
      return list_of_flights
				
			

It’s important to parse the downloaded page to collect only the necessary information – in this case, the flight lists. As mentioned before, we have two of them – the best results list and the rest. But we don’t want them to be separated in our final saved list of all flights. 

Step 16. Let’s parse our page data. The list_of_flights will contain the final results. 

				
					def parse(list_elem: list, best: bool) -> list:
   list_items = list_elem.find_elements(By.CSS_SELECTOR, 'li.pIav2d')
   list_of_flights = []
   for list_item in list_items:
        list_of_flights.append(get_flight_info(list_item, best))
    return list_of_flights


				
			

Saving the Output to CSV

At the very beginning, we imported the CSV library to save our data. 

Step 17. Let’s add a few extra lines of code so that all flight information we previously defined in our dictionary and scraped results are saved.

				
					def write_to_csv(flights):
    field_names = ['airline','date_leave','date_arrive','time_leave',
                   'time_arrive','duration_string','airport_leave',
                   'airport_arrive','layovers','cost','best']
    output_filename = 'flights.csv'
    with open (output_filename, 'w', newline='', encoding='utf-8') as f_out:
        writer = csv.DictWriter(f_out, fieldnames = field_names)
        writer.writeheader()
        writer.writerows(flights)

				
			
Parsed results saved in CSV file
Parsed results saved in a CSV file opened with Numbers (Mac)

Here’s the entire script for this Google Flights scraper:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver import Keys, ActionChains
import csv

start_url = "https://www.google.com/travel/flights"
timeout = 10

my_params = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': '2024-12-03'
}
my_params2 = {
    'from': 'LAX',
    'to': 'Atlanta',
    'departure': '2024-11-27',
    'return': None
}

def prepare_browser() -> webdriver:
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=chrome_options)
    return driver

def field_input(driver, element, text) -> None:
    element.click()
    ActionChains(driver)\
    .move_to_element(element)\
    .send_keys(text)\
    .pause(1)\
    .send_keys(Keys.ENTER)\
    .perform()

def prepare_query(driver) -> None:
    field_from = driver.find_element(By.CSS_SELECTOR, 'div.BGeFcf div[jsname="pT3pqd"] input')
    field_to = driver.find_element(By.CSS_SELECTOR, 'div.vxNK6d div[jsname="pT3pqd"] input')
    date_from = driver.find_element(By.CSS_SELECTOR, 'div[jsname="I3Yihd"] input')
    date_to = None
    if my_params['return'] is None or my_params['return'] == '':
        dropdown = driver.find_element(By.CSS_SELECTOR, 'div.VfPpkd-aPP78e')
        dropdown.click()
        ActionChains(driver)\
            .move_to_element(dropdown)\
            .send_keys(Keys.ARROW_DOWN)\
            .pause(1)\
            .send_keys(Keys.ENTER)\
            .perform()
    else:
        date_to = driver.find_element(By.CSS_SELECTOR, 'div.K2bCpe div[jsname="CpWD9d"] input')
    field_input(driver, field_from, my_params['from'])
    field_input(driver, field_to, my_params['to'])
    field_input(driver, date_from, my_params['departure'])
    if date_to is not None:
        field_input(driver, date_to, my_params['return'])
    ActionChains(driver)\
        .send_keys(Keys.ENTER)\
        .perform()
    print("Done preparing the search query")

def get_flight_info(element, best) -> dict:
    times = element.find_elements(By.CSS_SELECTOR, 'div.Ir0Voe span[role="text"]')
    airports = element.find_elements(By.CSS_SELECTOR, 'div.QylvBf span span[jscontroller="cNtv4b"]')
    flight_info = {
        'airline': element.find_element(By.CSS_SELECTOR, 'div.Ir0Voe > div.sSHqwe.tPgKwe.ogfYpf > span:last-child').text,
        'date_leave': my_params['departure'],
        'date_arrive': my_params['return'],
        'time_leave': times[0].text,
        'time_arrive': times[1].text,
        'duration_string': element.find_element(By.CSS_SELECTOR, 'div.Ak5kof > div.gvkrdb.AdWm1c.tPgKwe.ogfYpf').text,
        'airport_leave': airports[0].text,
        'airport_arrive': airports[1].text,
        'layovers': element.find_element(By.CSS_SELECTOR, 'div.BbR8Ec > div.EfT7Ae.AdWm1c.tPgKwe > span').text,
        'cost': element.find_element(By.CSS_SELECTOR, 'div.U3gSDe div div.YMlIz.FpEdX span').text,
        'best': best
    }
    return flight_info

def parse(list_elem: list, best: bool) -> list:
    list_items = list_elem.find_elements(By.CSS_SELECTOR, 'li.pIav2d')
    list_of_flights = []
    for list_item in list_items:
        list_of_flights.append(get_flight_info(list_item, best))
    return list_of_flights

def find_lists(driver):
    list_elems = driver.find_elements(By.CSS_SELECTOR, 'ul.Rk10dc')
    list_of_flights = parse(list_elems[0], True) + parse(list_elems[1], False)
    return list_of_flights

def write_to_csv(flights):
    field_names = ['airline', 'date_leave', 'date_arrive', 'time_leave',
                   'time_arrive', 'duration_string', 'airport_leave',
                   'airport_arrive', 'layovers', 'cost', 'best']
    output_filename = 'flights.csv'
    with open(output_filename, 'w', newline='', encoding='utf-8') as f_out:
        writer = csv.DictWriter(f_out, fieldnames=field_names)
        writer.writeheader()
        writer.writerows(flights)

def scrape(driver: webdriver) -> None:
    driver.get(start_url)
    if driver.current_url.startswith("https://consent.google.com"):
        print("Hit the consent page, dealing with it.")
        btn_consent_allow = driver.find_element(By.CSS_SELECTOR, 'button.VfPpkd-LgbsSe[jsname="b3VHJd')
        btn_consent_allow.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]'))
            )
            print("Search button found, continuing.")
            prepare_query(driver)
        except Exception as e:
            print("Something went wrong: {e}")
        search_btn = driver.find_element(By.CSS_SELECTOR, 'button[jsname="vLv7Lb"]')
        search_btn.click()
        try:
            WebDriverWait(driver, timeout).until(
                expected_conditions.presence_of_element_located((
                    By.CSS_SELECTOR, 'ul.Rk10dc'))
            )
            print("Got the results back.")
            flights = find_lists(driver)
            write_to_csv(flights)
        except Exception as e:
            print(f"Something went wrong: {e}")

def main() -> None:
    driver = prepare_browser()
    scrape(driver)
    driver.quit()

if __name__ == '__main__':
    main()

				
			

Avoiding the Roadblocks When Scraping Google Flights

Building a Google Flights scraper can be a pretty daunting task, especially if you’re new to scraping but it can become even more difficult if you’re going to scrape it a lot. While we have solved issues like the cookie consent page already, other issues can arise if you’re scraping at scale.

Use Proxies to Mask Your IP

Websites don’t like bot traffic, so they try to prevent it by using tools like Cloudflare. While scraping the Google Flights page once or twice probably won’t get you rate-limited or banned, it can happen if you try to scale. 

To prevent that, you can use proxy services – they will mask your original IP by routing the requests through different IP addresses, making them blend in with regular human traffic. Typically, human traffic comes from residential IPs, so this type of proxy is the least likely to be detected and blocked.

This is a step-by-step guide on how to set up and authenticate a proxy with Selenium using Python.

Use the Headless Browser Mode

The Google Flights page is a dynamic website that heavily relies on JavaScript– not only for storing data, but also for anti-bot measures. Running your scraper in headless Chrome mode allows it to render JavaScript like a regular user would and even modify the browser fingerprint.

A browser fingerprint is a collection of unique parameters like screen resolution, timezone, IP address, JavaScript configuration, and more, that slightly vary among users but  remain typical enough to avoid detection. Headless browsers can mimic these parameters to appear more human-like, reducing the risk of detection.

Step 7 in Setting Up the Browser gives two examples of how to set up Chrome for scraping, one of them containing this line of code: chrome_options.add_argument(“–headless=new”)

Adding this chrome_option will run the browser in headless mode. You may not want to use it now, but it’s good to know how to enable it if necessary. 

Be Aware of Website’s Structural Changes

This Google Flights scraper relies heavily on CSS selectors – they help to find the specific input fields and fill them in. However, if Google changes the Flights page, the scraper might break. That’s because the CSS selectors can shift when a site developer makes adjustments to the HTML structure. 

If you plan to use this Google Flights scraper regularly, keep in mind that selectors can change over time, and you’ll need to update them to keep the scraper functional.

Conclusion

Scraping Google Flights with Python is no easy feat, especially for beginners, but it offers a great deal of information useful for travelers and businesses alike. Despite the project’s difficulty, this data can not only help plan a trip but also provide market intelligence, analyze trends, and better understand customer needs.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post How to Scrape Google Flights With Python: A Step-by-Step Tutorial appeared first on Proxyway.

]]>
https://proxyway.com/guides/scrape-google-flights/feed 0
How to Use Inspect Element for Web Scraping https://proxyway.com/guides/how-to-inspect-element https://proxyway.com/guides/how-to-inspect-element#respond Tue, 08 Oct 2024 08:01:45 +0000 https://proxyway.com/?post_type=guides&p=26773 An easy-to-follow guide on how to use this browser feature.

The post How to Use Inspect Element for Web Scraping appeared first on Proxyway.

]]>

Guides

Modern, interactive websites are intuitive and user-friendly, but have you ever wondered how much valuable data they hold? If you know how to use Inspect Element, you can reveal the underlying HTML and CSS structure that is crucial for web scraping.

Inspect element browser feature on a desktop

Developers and designers spend a lot of time making websites visually appealing and easy to use. However, underneath all that user experience stands the source code – the backbone structure that makes it all work. Thanks to Inspect Element, we, as users, can peek inside and see how websites are made.

Using Inspect Element has several benefits, especially if you’re a web designer, developer, a marketer, or you’re building a custom web scraper. The tool allows you to look into how elements function, check how they appear on different screen resolutions, get data from the source code for scraping purposes, and more. So, let’s dig deeper into how to use Inspect Element.

What Is Inspect Element?

Inspect Element is a web browser feature that lets you peek into the HTML structure of the website. It’s built into most modern desktop browsers, such as Chrome, Firefox, or Safari, and is a part of Developer Tools – a set of tools that allow developers to test, troubleshoot, and experiment with websites. 

By using Inspect Element, you can view and temporarily edit HTML, CSS, and JavaScript code, monitor website’s performance, as well as find layout information about specific elements (i.e., the dimensions and margins of a button).

Inspect Element can also be useful for web scraping. When building a scraper, you’ll likely need it to extract only the relevant data points rather than the entire HTML structure. The feature allows you to find and collect specific CSS selectors, metadata, or hidden elements that you want to scrape.

Why Use Inspect Element?

The Inspect Element feature makes web designer’s and developer’s lives easier by allowing them to temporarily modify element features in real time. Basically, it allows them to test bug fixes and experiment with new features before making actual changes to the website yet.

The feature can also be useful for web scraping

  • It can help you collect specific CSS selectors or other data points for your scraper.
  • Inspect Element can also help to see if your scraper’s request header won’t be blocked by the website. The headers used by scrapers often look different compared to regular users. But the feature will let you see it, decide if it looks natural enough, and make changes accordingly. 
 

Inspect Element is handy for marketers and SEO specialists.

  • By reviewing the HTML and CSS code, marketing specialists can check if the website’s content is up to par. You can see if the colors and fonts follow the brand guidelines, check link information (a follow or a nofollow link), review if the website’s content looks good on desktop, responsive, and mobile views, and more. 
 

Website users can also find ways to benefit from it.

  • If a website makes it hard to download certain media, you can use Inspect Element to find the source URL, and download it from there.
  • You can also have some fun – try using Inspect Element to rewrite this entire article by changing the HTML element values.
 

There are other use cases for Inspect Element, too. Sometimes, you can bypass a paywall by modifying the source code to gain full access to the site. However, we discourage you from using Inspect Element for unethical reasons – not only can you violate the website’s terms of use, but face legal implications, too.

Inspect Element use cases
Inspect Element use cases

How to Use Inspect Element: A Detailed Explanation

Using Inspect Element is quite simple. There are several different tabs in the tool that you can explore, but you’ll find the most use cases for:

  • Elements panel allows inspecting and modifying HTML and CSS code.
  • Console panel allows inspecting and debugging JavaScript.
  • Sources panel allows inspecting and modifying CSS or JavaScript source files (pop-ups, analytical files).
  • Network panel allows monitoring HTTP requests.
  • Performance panel allows analyzing website’s performance – load, and download speeds.
  • Application panel allows managing cookies, local storage, and other session data.
 

Keep in mind that these panels have different names in each browser. For example, the “Elements panel” in Chrome and Safari is called “Elements,” while Firefox calls it “Inspector.” However, the content there remains the same.

If it’s your first time using Inspect Element, you might want to use it on the same website with different browsers to familiarize yourself with the tool’s terms.

The easiest way to access the HTML code is to right-click any element on the website and select “Inspect” or “Inspect element.” However, it varies slightly across different browsers and operating systems. Let’s break them down.

Google Chrome & Mozilla Firefox

There are three ways to open Inspect Element on Google Chrome and Mozilla Firefox: 1) right-clicking to inspect, 2) using a shortcut, or 3) opening Developer Tools through the browser’s menu bar.

  • Using right-click: use your mouse or a trackpad, right-click any element on the website. Then, select “Inspect.”
  • Using a shortcut: you can use a keyboard shortcut to open Inspect Element. On MacOS, click command + option + i. On Windows, click Ctrl + Shift + i. Additionally, you can use F12 to open Developer Tools on both Windows and MacOS (sometimes you might need to press the Fn key as well).
  • Using the browser’s menu: locate the kebab menu (three stacked dots) on Chrome or hamburger menu (three stacked lines) on Firefox. Typically, it’s located on the right side of the URL bar. Find “More tools,” and then select “Developer Tools.”

Safari

If you’re a Windows user, you can’t use the latest versions of Safari, so opening Inspect Element on this browser is only possible on MacOS. The process is also significantly different than on Chrome or Firefox. 

Firstly, you need to enable Developer Tools in Safari settings. To do so, you can either press command and , (comma button), or in the Menu bar on top of your device’s screen (the bar with an Apple logo), click on Safari, and Settings… In the settings panel, navigate to the “Advanced” bar. There, check the “Show features for web developers” box. Now you’re able to open Inspect Element.

  • Using right-click: right-click any element on the website, and press “Inspect Element”
  • Using a shortcut: press command + option + i on your keyboard to open Inspect element. Safari doesn’t use the F12 key shortcut by default, but you can manually set it up through keyboard settings.
  • Using a menu bar on your device: once you have a page on Safari open, you can access various settings on the menu bar on the top of your screen (make sure Safari screen is active). There, you can press on “Develop” and click on “Show Web Inspector”.

Mobile devices (iOS and Android)

There’s no easy way to use Inspect Element on iOS or Android as most of the time it still requires a desktop. However, there are a few methods you can try.

Remote Debugging

You can try remote debugging using Chrome on Android or Safari on iOS. Remote debugging allows you to use your phone’s environment on your computer. This method requires a computer, and it’s the most functional.

On an Android device: 

  1. Enable developer options. In Settings, click About phone, and tap Build number seven times. 
  2. Then, connect your mobile device to your computer via USB. 
  3. Open Chrome on your computer, and go to chrome://inspect. There, you’ll see your connected device. 
  4. Now you can go to your target website, and open Inspect Element. This way, you can inspect the mobile view on your desktop.

 

On an iOS device, the process is similar: 

  1. On your mobile device, go to Settings, and find Safari. 
  2. When Safari settings open, find Advanced, and toggle the Web Inspector there. 
  3. Then, connect your iOS device with your Mac, and open Safari on your computer. Go to Develop in the Menu bar, and select your iOS device.
  4. Now, go to your target website, and open Inspect Element.

Using Browser Developer Tools

Some mobile browsers also have an option to “Request Desktop Site”, which gives a layout similar to a desktop one, so you can do some basic inspection. The feature is helpful if you don’t have access to a computer but need to check different layouts of a website (mobile & desktop). That said, you still won’t be able to open Inspect Element.

You can also see how websites adapt to screen resolutions on a desktop. Many browsers have a “Responsive Design Mode” that can help to see if a website’s content looks good on different screen parameters. You can open Inspect Element while using the feature, but it requires a computer.

Using Third-Party Apps

There are some third-party applications for element inspection – you can download them for your iOS and Android devices. However, they can be pricey, and their functionality might be limited compared to using Developer Tools on a desktop.

Inspect Element Playground

Now that you know what Inspect Element is and how to use it, you can try putting your knowledge into practice. While some changes are very easy (i.e., changing the text), some might require basic HTML and CSS understanding. If you don’t know how to use them, there are various free HTML, CSS, and JavaScript tutorials online.

Inspect Element Playground for Web Scraping

If you’re looking to use Inspect Element for scraping purposes, you’ll likely need to copy some selectors from the target website – it helps the scraper extract only what’s relevant for you. Seasoned developers will know how to find these selectors and write the scraping logic, but if you’re new to this, ChatGPT can help to scrape, too. 

In the playground below, you’ll find a table with terms and their descriptions. Use Inspect Element to find CSS selectors for them (hovering over code with your mouse will highlight the area that piece of code defines). 

Once you do, you can write a simple Python scraper to scrape all terms and descriptions from the table below, and save them in a CSV format.

TermDescription
Proxy serverA proxy server is a middleman between the client and server (website). It masks the user’s original IP address, thus increasing security, privacy, and allows bypassing geo-restrictions.
Web browserA web browser is a software that allows users access and view websites on the internet.
IP addressAn IP address is a unique string of numbers given to any device connected to the internet. IPs help identify devices on the internet.
Web scraperWeb scraper is a software that automatically extracts data from websites. You can build a web scraper yourself using languages like Python or choose a pre-made one.
Finding the table and copying its CSS selector using Inspect Element
Finding the table and copying its CSS selector using Inspect Element

Other Easy Ideas to Test Inspect Element

Below you’ll find several typical HTML elements that can be manipulated with Inspect Element – headings, tables, images. Here are some ideas for you to try out.

Idea No.1: Modify the size of the headings using the Inspect Element tool.

  1. Locate the <h2> heading in the Inspect Element section.
  2. The code should look like this: <h2> ... </h2>.
  3. Double-click on the <h2> tag and change the number (2) to any number between 1 and 6.
  4. Observe how the website updates as the heading size changes.

Idea No.2: Change the content inside paragraphs.

  1. Locate the <p> tag that contains the text: “This is a paragraph.”
  2. Double-click on the paragraph text.
  3. Replace the text with your own sentence or phrase.

Idea No.3: Add more rows and columns to the table.

  1. Locate the table in the source code. Table rows are defined by <tr> tags, and columns by <td> tags.
  2. Add a new <tr> tag to create a new row.
  3. Inside the new row, use multiple <td> tags to add columns.
  4. Experiment with changing the width of each <td> by adjusting its attributes with CSS (e.g., 'width="100px"').

Idea No.4: Customize the table’s background color using CSS.

  1. Find the <table> tag in the source code.
  2. Inside the “style” attribute, which already contains properties like “width” and “border-collapse”, add a new rule: ‘background-color:Tomato;’.
  3. Make sure all properties remain inside the quotation marks, like this: ‘"style="border-collapse: collapse; width: 100%; background-color:Tomato;"’.
  4. Check how the background color changes.

Idea No.5: Replace an image using a different URL.

  1. Locate the <img> tag in the source code.
  2. Find the “src” attribute, which contains the current image source (URL).
  3. Double-click the URL, delete it, and replace it with the link to any other image.
  4. The new image should now appear in place of the original one.

THIS IS AN H2 HEADING

This Is an H3 Heading

This is a paragraph.

Cell 1Cell 2
ContentContent
More contentMore content
A step-by-step-guide on how to temporarily replace an image with Inspect Element
A step-by-step-guide on how to temporarily replace this image with Inspect Element

Conclusion

The Inspect Element feature is an excellent tool for web developers, designers, marketers, and regular web users alike. It’s also a beneficial feature for people who build their own web scrapers – Inspect Element allows exploring, finding, and collecting necessary data points for scrapers from the raw HTML structure of the website.

Using Inspect Element is easiest with a desktop, but some basic inspection is also available on mobile devices. While the process slightly varies on different browsers and operating systems, right-clicking and selecting “Inspect” is the simplest way to see the HTML structure of sites and find relevant data pieces for scrapers.

Getting acquainted with Inspect Element and Developer Tools, in general, can provide insights into a website’s performance, help test and troubleshoot web scrapers, fix bugs, and improve SEO performance, as well as provide information about how websites behave.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post How to Use Inspect Element for Web Scraping appeared first on Proxyway.

]]>
https://proxyway.com/guides/how-to-inspect-element/feed 0
ChatGPT for Web Scraping: A Step-by-Step Guide https://proxyway.com/guides/chatgpt-web-scraping https://proxyway.com/guides/chatgpt-web-scraping#respond Mon, 30 Sep 2024 13:20:00 +0000 https://proxyway.com/?post_type=guides&p=26318 Let's see how to use ChatGPT for web scraping efficiently.

The post ChatGPT for Web Scraping: A Step-by-Step Guide appeared first on Proxyway.

]]>

Guides

Since its release in 2022, ChatGPT immediately became a popular choice for any query. The chatbot can help you to brainstorm ideas, write and edit text or code, and much more. So, it only makes sense to use ChatGPT for web scraping, too.

How to use ChatGPT for web scraping

Search engines like Google are a huge help if you are looking for information, but it can take ages to find the right answer. ChatGPT, on the other hand, simplifies this process. It’s trained on large amounts of data, and it can summarize, provide understandable answers to complex questions, and respond to follow-up questions, which makes it great for many tasks, web scraping included.

While not perfect, ChatGPT can write simple code, as well as explain the logic behind it. It’s an excellent supporting tool for beginners trying to learn or seasoned scraping enthusiasts looking to improve and test their skills. So, let’s learn all the basics about web scraping with ChatGPT and try to build a simple web scraper.

What Is ChatGPT Web Scraping

ChatGPT web scraping is the process of automatically gathering data from websites using a code that’s written with ChatGPT. But if ChatGPT is a chatbot, what exactly does it do in this equation? Basically, you can ask ChatGPT to build you a scraper for a specific website you want to get data from. All you need to do is give instructions, collect certain data points for ChatGPT to use, and run the code.

How ChatGPT Can be Used for Web Scraping Tasks

ChatGPT itself isn’t a scraper and it can’t scrape websites directly, but it can help you write a simple scraper (or improve an old one) based on the instructions you give. However, while you don’t need a lot of knowledge about web scraping or writing code yourself, knowing how to create a good prompt is useful.

ChatGPT query for scraping a website
ChatGPT can’t scrape a website directly, but it can write you a scraper.

Scraping with ChatGPT: A Step-by-Step Guide

Let’s see how we can use ChatGPT for web scraping. In this step-by-step guide, we’ll use a website that loves to be scraped – https://books.toscrape.com/.

Scraping sandbox website
Books.toscrape.com website

This website is a sandbox environment specifically designed to practice your skills and test scrapers. It’s a safe place to do different scraping experiments without worrying about violating Terms of Service or robots.txt file. There are other sandbox websites similar to books.toscrape.com, but for the sake of this experiment, let’s use this one.

We prepared a list of sandboxes just for you.

Let’s create a Python scraper that extracts book titles and their prices from the page. We’ll be using Python because it’s a beginner-friendly programming language with a simple syntax that excels in tasks like web scraping. We’ll also use two Python libraries to assist, namely, Requests for handling HTML requests, and Beautiful Soup for parsing (cleaning) extracted data.

We’ll later save extracted data in CSV – a simple text format which will allow you to open, edit, manipulate, and move the data later on. 

Step 1: Install the Preliminaries

Before jumping straight into web scraping, there are a couple of things you need to do first.

  • Create a ChatGPT account. ChatGPT requires users to log in before using the service. So first, log in to your existing account or, if you don’t have one yet, click “Sign up”, and create it.
  • Get a text editor. To execute the code, you’ll need a text editor. There’s one already pre-installed on your computer (TextEdit on MacOS or Notepad on Windows), but you can use any third-party editors like Notepad++ or Visual Studio Code. Such text editors tend to have more advanced features and usually highlight functions in different colors for better readability. However, extra functionality might not be necessary, so it’s completely up to you which tool to use.
  • Install the latest Python version. If you use an older version of MacOS (12.3 or older) or Linux, Python may be preinstalled on your device. If it’s not, download Python from the official website and install it to your computer. 
  • Install necessary libraries. Then, install Requests and BeautifulSoup by running pip install requests beautifulsoup4 in the terminal (search for Terminal on MacOS or Command Prompt on Windows).

Step 2: Locate Elements You Want to Scrape

Now, it’s time to gather some information from the target website about the elements you’re going to scrape. Specifically, the CSS selectors defining book names and pricing. ChatGPT needs this information to build a scraper that targets these exact elements.

Visit https://books.toscrape.com/. You’ll see a list of books right away. To get the elements for the scraper, we’ll need to use “Inspect Element”. The easiest way to do this is to right-click on one of the book titles, and select “Inspect.” This action will open the browser’s Developer Tools, and you’ll see the HTML code of the website there.

Then, locate the book title element. Hovering over the code will highlight the area it defines, so you should find it easily.  Once you locate the element, right-click on it, select “Copy,” and click “Copy selector.”

Copying book title CSS selector in books.toscrape.com
Copying book title CSS selector in books.toscrape.com

Next, paste the copied selector somewhere to save (a desktop sticky note or your text editor) because we’ll need it in the next step. Copy and write down the selector for book prices, too. At this point, you’ve collected all the necessary information to write a prompt for ChatGPT.

Step 3: Prepare a Prompt for ChatGPT

Now, let’s write a prompt for ChatGPT to create a web scraper. A good rule of thumb is to write a clear, well-explained prompt with keywords and phrases that outline everything the chatbot needs to know. 

Prompt questionInstruction for ChatGPT
What does ChatGPT need to do?Write a web scraper that extracts book titles and prices of all books in the target website.
What is the target URL?Target website is: https://books.toscrape.com/
What programming language should it use?Use Python programming language
What libraries should it use?Use the following libraries:
– Requests
– Beautiful Soup
What elements it needs to scrape?

Defined CSS selectors are:

  1. #default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a
  2. #default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color
What are the output requirements?Save titles and prices in a CSV file.

A well-written prompt ensures that the response is less error-prone. Think about the task at hand and what questions need answering to build a scraper that does exactly what we need. Your answers will provide instructions for ChatGPT. Now, all you need to do is give these instructions to ChatGPT and get a response! Here’s the result:

				
					import requests
from bs4 import BeautifulSoup
import csv

# Define the URL of the target website
url = "https://books.toscrape.com/"

# Send a GET request to the website
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all book containers on the page
    books = soup.select('#default > div > div > div > div > section > div:nth-child(2) > ol > li')

    # Prepare a list to hold the titles and prices
    books_data = []

    for book in books:
        # Extract the title using the specified CSS selector
        title = book.select_one('article > h3 > a')['title']

        # Extract the price using the specified CSS selector
        price = book.select_one('article > div.product_price > p.price_color').text

        # Append the extracted data to the list
        books_data.append([title, price])

    # Define the CSV file to save the data
    csv_file = "books.csv"

    # Save the data to a CSV file
    with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Price"])  # Write the header
        writer.writerows(books_data)  # Write the data rows

    print(f"Data successfully saved to {csv_file}")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

				
			

Step 4: Review and Test the Code

Even if you don’t know a thing about writing code with Python, I highly recommend reviewing the code once ChatGPT generates it. Take some time to check if your prompt is clear, includes all necessary information, and if ChatGPT followed all your directions. 

For example, can you spot a mistake here?

ChatGPT-generated code with a miskate
ChatGPT-generated code with a mistake

When you try to run this code, it will show you an error. That’s because I forgot to mention that ChatGPT has to import BeautifulSoup for data parsing. 

If you run the code and something doesn’t work, you can ask ChatGPT for help – simply write a new prompt by asking to review and edit the code. Always double-check your requirements and, if you see error messages when trying to run the command, define them. You can always add extra information to the prompt, if necessary.

Step 5: Check Parsed Data

In our prompt, we asked ChatGPT to save data in a CSV file. So, the scraper ChatGPT generated should be able to do so after it’s done with scraping and parsing. Once you open the saved file, your results should look similar to this:

CSV file with parsed data opened in Numbers (MacOS)
CSV file with parsed data opened in Numbers (MacOS)

If they do – congratulations, you successfully used ChatGPT for web scraping! If not, go back to step 4 and troubleshoot. Common mistakes can be: forgetting to import CSV to your scraper, you have extracted too much data (this shouldn’t be the case here), or you don’t have enough memory in your system.

Now, you can try to experiment further: rewrite your scraper to extract book ratings or book titles only from specific categories. You probably also noticed that this scraper only scraped the first page only. Try asking ChatGPT to improve your scraper to be able to scrape titles and prices from all 50 pages. The more you practice, the better your understanding of scraping logic will be. 

Once you feel confident, you can try advanced data retrieving. For example, you can ask ChatGPT to write a scraper for dynamic content using Selenium. You can use it to scrape flight ticket prices from websites like Expedia or Skyscanner.

However, always keep in mind that actual websites are trickier than the sandboxes. Also, websites rarely enjoy being scraped. For example, scraping Amazon is possible but much more complicated. Nevertheless, whichever website you choose to scrape next, always be respectful to its Terms of Service and robots.txt file, data protection laws, and always follow good scraping practices.

Advanced Techniques for ChatGPT Web Scraping

The website we scraped is designed to test scrapers, so we didn’t encounter any roadblocks like CAPTCHAs or gotchas. Moreover, all data we scraped was simple HTML content. 

However, what if you need to scrape a website that uses dynamically-generated content based on JavaScript? What about bot-protection systems that prevent scrapers from extracting data? To tackle that, you’ll need some more advanced tools. But don’t worry – ChatGPT will still be able to help you out.

Handling Dynamic Content

If you’re planning to do some advanced scraping, a simple web scraper that only handles regular HTML content probably won’t be enough. Most websites nowadays use JavaScript to load their content. It allows you to enjoy things like endless scrolling and infinite loading. These elements, however, are too difficult for simple scrapers to grasp.

Try hovering your mouse on the proxy server icon below. While you can do it and see some extra information about the image, simple scrapers can’t. 

There are a couple of ways to scrape dynamic content. One, is to use headless browsers that can execute JavaScript with your scraper. Another – use pre-built tools that can handle and scrape dynamic content in websites. 

Using Headless Browsers

A headless browser helps scrapers “browse” the web. Basically, it’s a regular browser like the one you’re using to read this, but it doesn’t have a graphical user interface. So, instead of clicking buttons or typing in URL bars, you interact with the browser programmatically by writing code. 

Traditional web scraping methods rely on sending HTTP requests, so they can only retrieve information that is static – text, images, multimedia. Unfortunately, this means that dynamic elements like pop-ups are left behind. In this case, a headless browser can execute JavaScript by rendering it – just like a real user’s browser would. This allows scraping data from dynamic elements. You can use ChatGPT to assist you in writing code for headless browsers like Selenium, Playwright, Puppeteer, or any other.

Writing a Prompt for Web Scraping with Selenium

We can use the same prompt as before, and alter it. Instead of using Requests and Beautiful Soup, let’s ask ChatGPT to use Selenium – a powerful Python library used to automate web browsers. This time, let’s use another sandbox – https://quotes.toscrape.com. We’ll scrape quotes and their authors this time.

Step 1. Install the necessary libraries. 

  1. Run the command pip install selenium in the terminal to gain access to a library.
  2. Download the Chrome Web Driver. This Chrome-based web driver is the headless browser that you’ll be able to control with Selenium.

NOTE: You’ll have to copy different CSS selectors for this scraping experiment. Repeat the same steps (using “Inspect Element” and copying data) for getting selectors for quotes and authors in this case.

Step 2. Write your prompt. It should look something like this:

Please write me a Python web scraper using Selenium. The scraper should extract quotes and authors from the target website.

Target website: https://quotes.toscrape.com/

CSS selectors: 

  1. body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text
  2. body > div > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span:nth-child(2) > small

Output: save all quotes and authors in a CSV file.

Here’s how the ChatGPT-generated code should look in your code editor. Notice how the syntax differs from the one we wrote before:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

# Set up the web driver (ensure the driver is in your PATH or specify the path)
driver = webdriver.Chrome()

# URL to scrape
url = "https://quotes.toscrape.com/"

# Open the URL
driver.get(url)

# Create lists to store quotes and authors
quotes_list = []
authors_list = []

# Locate the quotes using the provided CSS selector
quotes = driver.find_elements(By.CSS_SELECTOR, "body > div > div:nth-child(2) > div.col-md-8 > div > span.text")
authors = driver.find_elements(By.CSS_SELECTOR, "body > div > div:nth-child(2) > div.col-md-8 > div > span:nth-child(2) > small")

# Extract the text and append it to the lists
for quote, author in zip(quotes, authors):
    quotes_list.append(quote.text)
    authors_list.append(author.text)

# Close the web driver
driver.quit()

# Save the quotes and authors to a CSV file
with open("quotes.csv", mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Quote", "Author"])
    for quote, author in zip(quotes_list, authors_list):
        writer.writerow([quote, author])

print("Scraping completed and saved to quotes.csv")

				
			

Step 3. Find your CSV file, and open it. Your results should sit in a neat table.

CSV file opened in Numbers (MacOS)
CSV file opened in Numbers (MacOS)

Using Pre-Built Tools

Instead of using ChatGPT to build a web scraper from scratch, you can get ready-made scrapers to help with your scraping projects.

Firstly, you can opt for a non-scraping option – some providers offer pre-scraped datasets from various industries. Instead of building a scraper, extracting, cleaning, and parsing data yourself, you can check if there are prepared and detailed datasets for your use case. If there aren’t datasets that suit your use case, and you still need to scrape data yourself, experiment with no-code scrapers. These tools require you to interact with the website, but they translate your clicks into scraping logic. They work by letting you browse, click, and scroll through the site like a regular user, while simultaneously extracting the data based on your interactions.

You can also get web scraping APIs or proxy-based APIs. These tools are made to handle any amount of scraping, and are designed to work with most modern websites. They can handle JavaScript content, create a unique fingerprint, parse extracted data, manage proxies, bypass anti-scraping systems, and more. However, they require some programming knowledge to set up and run, but you can use ChatGPT to help you out here.

Additionally, you can register to use your target website’s API for scraping. Instead of getting the entire HTML document, parsing, and working around JavaScript elements and anti-scraping measures, API lets you get specific and structured data without much hassle. Then, you can send requests with Python – ChatGPT can help you write them, too. It’s important to keep in mind that not all websites offer APIs, they may restrict the data you can access, and more often than not, they come with a cost.

Avoiding Anti-Scraping Measures

While websites like books.toscrape.com love being scraped, most websites don’t. Instead of giving you a hall pass to look around and collect their data, they implement antibot measures. CAPTCHAs and services like Cloudflare protect them from malicious bots. Though, not all is lost if your target website has these measures implemented. 

Websites want human traffic because it brings revenue. Bot traffic, however, can overload the servers, preventing real people from accessing the site. If you want to scrape successfully, you need your scraper to look like a real human.

Use Proxies

A proxy server is a middleman between you and the web server. When you route your traffic through a proxy server it masks your original IP address, so the web server can’t detect you. Many proxy service providers offer quality proxies that come in large pools, so you can switch addresses to prevent detection and IP blocks. Typically, real people use residential IPs, so this type of  proxy is less likely to be blocked.

Residential proxies come from real devices like smartphones, laptops, or even smart TVs that are connected to the internet via Wi-Fi. By using residential IPs, your requests appear natural, especially when you’re doing some heavy-duty scraping.

Spoof Your Browser’s Fingerprint

Have you ever noticed that websites tend to know it’s you visiting their site, even when you’re not logged in? That’s because your browser’s fingerprint shows some information about you.

A fingerprint is a collection of certain data points – screen resolution, OS, installed fonts, timezone, saved cookies – that help identify you on the web. Regular users like you have pretty detailed fingerprints that are unique, but still quite typical looking. Automated bots, however, tend to have browser fingerprints that either miss human-like information like cookies or precise user-agent strings or are inconsistent, so they stand out.

When you’re building your scraper, it’s important to find ways to spoof your browser’s fingerprint so it blends in. For example, there are Python libraries that can change the fingerprint by adjusting or rotating user-agent strings, or modify parameters like screen resolution and timezone. Antidetect browsers can also be useful – they let you create separate browsing environments with their own digital fingerprints without interlinking them.

Other Tips for Web Scraping with ChatGPT

  • Don’t expect it to be perfect. Despite being trained on large amounts of data, ChatGPT is still an automated program, and not a developer. It can provide inaccurate information or code, even if your prompt is brilliant.
  • Treat ChatGPT like a junior-level specialist. Think of the chatbot as a capable, but rookie specialist that’s just starting out in the field. Make sure to give clear instructions, and always double-check the result (and your query!).
  • Rephrase, reiterate, and regenerate answers. ChatGPT is so popular because it’s very flexible. If you’re not happy with the answer it provides, you can give additional instructions or completely regenerate questions until you like the response. For example, if you decide you want your scraper to be in Node.js rather than Python, you can ask to recreate it.
  • Ask follow-up questions. Let’s say ChatGPT helped you write a web scraper with Python, but you don’t understand what certain functions mean. Don’t be afraid to ask “What’s X?” or “What does Y do?” to get a better grasp on what you’re doing. By actually understanding the output, you’ll be able to see if it’s correct.

Conclusion

Although it can’t scrape websites for you, ChatGPT is a great tool that can get you started with web scraping. While not a scraper itself, it can generate code based on your query. Using the chatbot can be the first step to web scraping even if you can’t write code yourself, but it’s important to remember that it’s an automated program that can provide inaccurate information. 

The post ChatGPT for Web Scraping: A Step-by-Step Guide appeared first on Proxyway.

]]>
https://proxyway.com/guides/chatgpt-web-scraping/feed 0
The 9 Best Methods to Bypass Cloudflare When Web Scraping in 2024 https://proxyway.com/guides/how-to-bypass-cloudflare https://proxyway.com/guides/how-to-bypass-cloudflare#respond Fri, 06 Sep 2024 08:07:35 +0000 https://proxyway.com/?post_type=guides&p=25543 All you need to know about bypassing Cloudflare when web scraping.

The post The 9 Best Methods to Bypass Cloudflare When Web Scraping in 2024 appeared first on Proxyway.

]]>

Guides

Websites don’t like bots crawling around, so they use anti-bot measures like Cloudflare to stop them. If you’re looking to scrape data from Cloudflare-protected websites, you’ll need to know how to bypass it. Let’s find out how to achieve that.

How to bypass Cloudflare for web scraping

If you ever encountered a CAPTCHA with an orange cloud asking you to check the box that you’re human, congratulations – you’re already somewhat familiar with Cloudflare’s service.

Cloudflare is a company that offers content delivery network (CDN) service to increase web performance and security. Apart from making websites more optimized, Cloudflare protects them from things like DDoS attacks, as well as other malicious and unwanted bot traffic. But what about web scrapers?

Unfortunately, scrapers are identified as bots, and they often get blocked. While Cloudflare allows some verified bot traffic to pass through (i.e. Google’s crawlers that index web pages), bot-protection systems rarely know the difference between good and bad bots. And more often than not, web scrapers are interpreted as the bad ones. But if you’re looking to scrape Cloudflare-protected websites without getting blocked, there are ways to bypass this security layer.

What Is Cloudflare’s Antibot Protection?

Cloudflare is known for its Cloudflare Bot Management designed to shield websites from automated attacks and malicious bots. It provides advanced protection against a wide range of threats, including credential stuffing, web bypasses, and account takeover.

Cloudflare’s Bot Management has a Web Application Firewall (WAF) that uses traffic patterns and advanced algorithms to identify threats and stop malicious traffic from reaching a website’s origin server. 

Additionally, the DNS bypass feature differentiates between known good bots and potentially harmful bots. This allows legitimate bots, such as web crawlers, to access a website without being blocked by security measures, but stops suspicious ones.

Cloudflare errors and response codes

When Cloudflare prevents you from entering a specific website, you’re going to see an error code which explains the reasoning behind the block. Here are some of the most popular Cloudflare error codes you can encounter.

Cloudflare Error 1020: Access Denied. Cloudflare doesn’t specify the reasons behind this error, but it means that the client or the browser has been blocked by Cloudflare customer’s (website) Firewall rules. Likely, because the Cloudflare-protected website detects malicious activity, considers you a bot, or you’re denied access to a specific page (i.e., admin panel).

Cloudflare Error 1010: The owner of this website has banned your access based on your browser’s signature. Your request to access certain data was blocked because of your browser’s signature. It often happens when scraping tools expose fingerprint information (usually it lacks information or isn’t unique as a human-like fingerprint), and Cloudflare detects it. 

Cloudflare Error 1015: You are being rate limited. You sent too many requests from the same IP address, and got blocked. Real user traffic rarely gets limited, so you either need to send requests through different IPs or reduce the amount altogether.

Cloudflare Error 1009: Your country is blocked. A typical Cloudflare error code, but it has nothing to do with your scraper. The error means that the website owner wants traffic only from specific countries or regions, and yours isn’t on the list.

Most popular Cloudflare error codes
Most popular Cloudflare error codes

How Does Cloudflare Detect Web Scrapers?

Cloudflare uses two methods to detect web scrapers: passive and active. Passive method refers to using backend fingerprinting tests, while active relies on client-side analysis.

Passive Detection Methods

Cloudflare checks your IP address reputation to distinguish if you’re a bot or a human. For example, humans typically browse the web through residential or mobile IP addresses, as these are provided to them by their internet service providers. In contrast, datacenter IPs are often used by automated systems, like web crawlers or scrapers. So, someone visiting a website with a datacenter address is likely not a human. 

Additionally, IPs that were previously used for phishing, scamming, or other malicious activity will be banned. So, if you use a proxy service with your web scraper, make sure the provider offers reputable IP addresses.

Cloudflare also keeps track of bot networks. It stores information about IP addresses, devices, and even behavior patterns associated with these networks. So, it will block IPs and ASNs suspected to be a part of a botnet or give them client-side challenges (“What’s 3+7?” or “Select all red shapes”) to solve in order to prove human activity.

HTTP request headers – client’s browser information – can also be used to distinguish bots from real people. A header from real human traffic has a detailed user-agent string – it defines the operating system, the browser’s version, and other parameters. Also, there are saved cookies, matching IP and geolocation. Bots, however, tend to have a lot of this information missing or mismatched.

Active Detection Methods

To check if the user is not a bot, Cloudflare-protected websites can present Turnstiles – non-interactive CAPTCHA alternatives. These tools are only presented if your traffic looks suspicious. For example, if you have lacking user-agent strings, unusual web interactions, or use datacenter IPs. After Cloudflare checks your parameters and decides you’re a human, you’ll see a “Success!” message after the Turnstile solves itself.

Cloudflare's Turnstile challenge
Cloudflare's Turnstile in action

Using proxies or a VPN service increases the chance of encountering Turnstiles. So, if you use proxies for web scraping, keep in mind that the scraper will need to be able to mimic human-like browsing to avoid them. 

Additionally, Cloudflare uses canvas fingerprinting to detect automated activity by checking the user’s device class. This class refers to the combination of your browser, OS, and even hardware parameters. It’s defined by three layers: hardware (GPU), low-level software (installed fonts, pixel rendering), and high-level software (web browser). A combination of these parameters creates a unique fingerprint, and Cloudflare can detect if you’re trying to spoof them.

And that’s not all. Cloudflare also has event tracking. It checks mouse activity, such as movements, scrolls, and clicks, as well as what keyboard buttons you press. Typically, humans need a mouse and a keyboard to browse the web, so if your scraper won’t click enough or move the mouse at all, it will raise suspicions.

How to Bypass Cloudflare Bot Management When Web Scraping

There are several ways to avoid Cloudflare’s challenges when web scraping. While no method is perfect, these tools can help bypass Cloudflare rather simply.

Using Headless Browsers

Headless browsers like Selenium or Puppeteer, are regular browsers, just without a user interface. They don’t have buttons, URL bars, and no other elements to interact with. Instead, you use it programmatically. But how does a headless browser help to bypass Cloudflare? There are several advantages to using the tool when dealing with Cloudflare-protected websites.

  • You can customize the HTTP browser header. You can design them to mimic real human user-agent string, language selection, cookies, and more. By creating a human-like browser header, you can trick Cloudflare into allowing your scraping bot to successfully operate.
  • They make interactions appear human-like. Scrapers are typically very methodical about visiting sites. Humans, on the other hand, are more chaotic, and have distinct patterns. For example, we tend to visit the landing page first, then move to product selection, go back and forth until we choose, and so on. Headless browsers can imitate real browsing patterns and web interactions, such as browsing speed, human-like typing, mouse movements.
  • Headless browsers can be fortified. There are several plugins for headless browsers like Selenium and Puppeteer that patch and optimize them to look more human-like. These plugins (you can find them on GitHub) can help spoof fingerprints, modify and rotate user-agent strings, emulate human-like browsing, and more. This is especially important for scraping tasks as it stops fingerprint leaking and fixes parameters that help Cloudflare identify the browser as bot.

Using Proxies

Proxies are intermediaries between you and the web server. They route your traffic through a different IP and mask the original address and location. As mentioned before, humans almost never browse the web through datacenter IPs. So, in order to prevent detection, it’s important to pick residential or mobile proxies for your web scraper. 

  • Requests appear more natural. When scraping the web, your scraper will be sending loads of requests to the website’s server. You will get blocked if you throttle the server with too many requests. Therefore, it’s a good idea to use different IPs and rotate them. Most proxy services offer sizable IP pools, many geolocations, and an option to rotate proxies. This way, you can make your scraper’s requests appear more scattered and natural – as if they come from different people.
  • You can access geo-restricted content. Proxies can help you avoid Cloudflare Error 1009. Setting your IP address to the one allowed by the website will give you the opportunity to access and scrape websites if they are geo-restricted in your actual location.

Using Web Scraping APIs

Web scraping APIs are tools that combine data extraction logic and proxies with little input from you. There are many platforms to choose from, each with different strengths and capabilities. Some platforms will require programming knowledge to set up, while others offer ready-made scrapers with integrated proxy service, data parsing, and other beneficial tools.

Scraping a Cached Version

Since Cloudflare allows web crawlers, there’s likely an indexed and cached version of the website you want to scrape. So, instead of trying to bypass Cloudflare yourself, you can scrape the cached version.

To scrape the Google cache, add https://webcache.googleusercontent.com/search?q=cache: to the start of the URL you want to scrape.

For example, if you want to scrape Proxyway’s homepage, your URL should look like this: https://webcache.googleusercontent.com/search?q=cache:https://www.proxyway.com/

While this method is a cost-effective solution, it’s the most unpredictable. The cache might be too old, especially if you’re looking for fresh data. Or your target website isn’t cached at all. If you check the web cache page, it will tell you when the last snapshot was made, so make sure this data is still viable for use.

Cached version of Proxyway's homepage
Cached version of Proxyway's homepage

Additionally, some websites like LinkedIn tell Google crawlers not to cache their web pages, so you won’t be able to scrape cache altogether.

Alternative Methods to Bypass Cloudflare

If, for one or another reason, abovementioned methods don’t work for you, you can try some alternative ways to bypass Cloudflare.

  • Cloudflare solvers are often used to bypass Cloudflare’s security measures, such as exposing fingerprint spoofing or detecting unhuman-like browsing. There are several Cloudflare solvers and libraries available, some open-source and some paid. However, they often fail to keep up with Cloudflare’s protection methods that are becoming way harder to bypass.
  • Cloudflare only blocks requests that pass through their network. Instead, you can try sending a request to the origin server. Call the origin server, find the IP address of the server hosting the content, and then you can send your requests directly, bypassing Cloudflare. However, this method is quite challenging – it’s not easy to find the origin server’s IP address. And even if you do, the host might reject your request. 
  • If you have the original IP, you can try to request data from the origin server. Once you find the original IP, you can try pasting it into the URL bar, but that might fail because the request lacks a valid Host header that indicates which specific website you want to access. Tools like cURL allow you to specify a host header while requesting the origin server’s IP address. Unfortunately, this method often fails because many origin servers only accept traffic from trusted IP addresses.
  • You can also reverse engineer Cloudflare’s antibot system to find the workaround. In essence, reverse engineering is taking the whole system apart to figure out how it works. Knowing a system inside out can help you find loopholes to bypass it. If you have the skill for it, reverse engineering can be a cost-effective solution, especially if you aim to scrape large volumes of data. However, Cloudflare’s antibot system was purposefully made to be difficult to understand and it tends to change frequently, so this approach is definitely not for everyone.
  • When you visit a Cloudflare-protected website, you must first wait in the Cloudflare waiting room. It’s a virtual traffic control system that temporarily stops you from accessing the target website. During the wait time, your browser solves challenges to prove you’re not a robot. If all goes well, you’ll be redirected to the web page. However, if you’re labeled as a bot, you’ll be given an “Access Denied” error. You can reverse engineer Cloudflare’s waiting room challenges by checking the network log and debugging the challenge’s script. However, this also requires a very high programming skill level.

Conclusion

There are several ways to bypass Cloudflare, but the one you choose depends on the amount of resources you have, how much scraping you’re planning to do, and even your skill level.

If you’re looking for cheaper options, you can scrape cached versions of websites, try to reverse engineer the antibot system, or get an open-source Cloudflare solver. While not perfect, these methods can be great for the tech-savvy and those looking to save a buck. On the other hand, proxies and web scraping APIs are more expensive but also much more efficient. They are reliable, relatively simple to understand, and the cost usually depends on how much traffic you use.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post The 9 Best Methods to Bypass Cloudflare When Web Scraping in 2024 appeared first on Proxyway.

]]>
https://proxyway.com/guides/how-to-bypass-cloudflare/feed 0
Web Scraping JavaScript vs Python: Which is Best for Your Needs? https://proxyway.com/guides/web-scraping-javascript-vs-python https://proxyway.com/guides/web-scraping-javascript-vs-python#respond Tue, 03 Sep 2024 13:09:41 +0000 https://proxyway.com/?post_type=guides&p=25458 This is an in-depth guide on web scraping with JavaScript and Python

The post Web Scraping JavaScript vs Python: Which is Best for Your Needs? appeared first on Proxyway.

]]>

Guides

JavaScript and Python are two popular programming languages used for web scraping. But which one suits your needs better? Let’s find out. 

Web Scraping with JavaScript vs Python

JavaScript and Python, commonly used for web and mobile application development, data science, and various other tasks, are also popular for web scraping. There is no definitive answer to whether JavaScript or Python is better for web scraping. Usually, it boils down to the project complexity, the requirements (i.e., dynamic content handling), your familiarity and experience with the language, and even personal preference.

It’s important to understand that both languages are perfectly capable of various scraping tasks. Let’s dive into how each programming language works for web scraping, and which one would work better in your case.

What Is Python?

Python is a versatile programming language used for web development, data analysis, and web scraping. It’s pretty easy to grasp because of the relatively simple syntax. 

Even though starting out with Python requires minimal coding experience, it doesn’t mean the programming language itself is somewhat limited. Contrarily, it has extensive resources, such as libraries and frameworks for extra functionality, as well as large communities of professionals and enthusiasts who can help out if you’re facing challenges. 

In addition to its readability and resources, many choose Python for web scraping because it’s capable of handling data processing and manipulation (data cleaning, filtering, adding variables, analysis) really well. Python excels in automating repetitive tasks (i.e., data extraction) and working with large datasets.

What Is JavaScript?

JavaScript (you might see it abbreviated as JS) is a dynamic programming language primarily used to create dynamic content on websites, such as fill-in forms and animations. 

JavaScript may not be as intuitive as Python, and the learning curve is much steeper, but considering how many resources exist already, it’s frequently used for scraping, too. That’s because it’s very compatible with the web in general – most websites nowadays use JavaScript. Some of the most advanced scraping libraries, such as Puppeteer and Playwright, are also written for JavaScript, so it’s useful to have at least some understanding.

It also has a huge and supportive community to help you get started with your project. On top of that, javaScript is fast, easily handles asynchronous operations, and is well-suited for scraping dynamic content from modern websites.

What is Node.js?

Node.js is an open-source runtime environment that allows you to run JavaScript on the server side; outside of a web browser. But why would you want that?

Each web browser has a dedicated engine to run JavaScript code. JavaScript is usually used on the client side to manipulate web page behavior – fetch new social media posts, run animations, and manipulate other interactive elements. You can try to disable it on your browser, and see how websites behave then.

Using the Node.js framework, however, allows you to build web servers, real-time applications, such as chats or live-streaming services using JavaScript. Ultimately, it makes sense to use JavaScript with the Node.js environment for scraping interactive web pages, due to JavaScript’s ability to handle real-time interactions on web applications natively.

Python vs JavaScript for Web Scraping: Main Features

While both can be used for web scraping, let’s take a closer look at JavaScript vs. Python features and how they compare.

Popularity

It might come as a surprise for newbies, but JavaScript and Python are two of the most popular programming languages in the world, JavaScript being the number one for several years now. No wonder JavaScript is so prevalent – it’s used everywhere on the web.

Python, despite being first introduced in 1991, is still seeing an increase in popularity. The language is still actively maintained so it’s a primary choice for data science, web scraping, and other tasks.

Most popular programming languages (GitHub data)
Most popular programming languages in 2022. Source: GitHub

Available Libraries

Both languages have multiple web scraping libraries that add additional functionality to their vanilla versions. Designated libraries help simplify workload, parse data, handle requests, and manage scraping projects.

Most popular web scraping libraries used with Python are BeautifulSoup (HTML parsing), Scrapy (data extraction), Selenium (browser automation; used with JavaScript and other languages too), Requests (handling requests), Pandas (data analysis and parsing). 

As for JavaScript, you’ll see Puppeteer (Chromium browser automation), Cheerio (HTML parsing), Playwright (browser automation), Axios (handling requests) libraries as some of the most used Node.js libraries.

JavaScript vs Python most popular web scraping libraries compared
Most popular JavaScript and Python libraries

Asynchronous Web Scraping

Synchronous web scraping works by sending an HTTP request, waiting for it to finish, and then moving to the next one. Asynchronous web scraping, on the other hand, basically means that your scraper can handle concurrent requests. In simple terms, you don’t need to wait for one request to complete, so you can scrape multiple pages in parallel. Both JavaScript and Python are able to perform web scraping tasks asynchronously, but there are some differences in execution.

When scraping with JavaScript, you’ll use the Node.js environment. It’s inherently asynchronous and capable of handling concurrent requests as it uses an I/O (input/output) model where blocking doesn’t occur. In addition, Node.js uses async/await and Promises features which simplify working with asynchronous code. Though JavaScript with Node.js generally handles asynchronous requests really well, it all comes down to how these operations are dealt with – it can be difficult for less experienced.

Python, on the other hand, isn’t inherently asynchronous, but it has asynchronous libraries, like asyncio, aiohttp and httpx, as well as async/await syntax. Thus, using these web scraping libraries can make a Python scraper have similar asynchronous capabilities like a JavaScript one, though slower. However, it can become complex once you set up browser automation tools like Selenium – Python typically is easy to read, so the extensive configuration with extra libraries can complicate readability.

Ability to Handle Dynamic Content

Dynamic content is usually fetched by JavaScript, so you might think that only JavaScript-based scrapers can handle it. However, that’s not entirely true. 

While JavaScript excels in handling dynamic content natively with tools like Puppeteer, Python can do it, too. More specifically, with the help of headless browser libraries, like Selenium and Playwright. However, it’s worth noting that it can add complexity to the scraping process. Beginners might be discouraged due to the sheer amount of code management involved – the more libraries you use, the worse the readability gets.

Performance

In terms of performance, it’s hard to compare the two – both are great, but in different areas. 

Python is perfect for processing and manipulating scraped data – parsing, cleaning, and analyzing information from different web pages. It’s also great for writing scripts for repetitive tasks, such as scraping similar data from multiple web pages. Python is very effective for large web scraping projects because of additional frameworks and libraries. It makes it rather simple to deal with vast volumes of data.

But JavaScript also deserves some praise. While not as great with data processing, JavaScript-based scrapers are very efficient and fast. Node.js is built to handle loads of concurrent connections, so if you’re looking to scrape web applications with AJAX or JavaScript elements, it will easily handle them. It’s also scalable.

Ease of Use

So, JavaScript vs. Python – which is easier to use? The short answer is Python. The long one – it depends on your experience and what you’re aiming to do.

Python is easy to pick up even for people without any prior programming experience. Python has more available integrations, libraries, and frameworks to expand its capabilities. However, juggling multiple integrations can be tricky – the code becomes less readable, and scraping speed decreases.

JavaScript isn’t as easy to begin working with, but it’s more efficient and faster for complex and dynamic scraping tasks. Nevertheless, managing multiple concurrent tasks is a hassle because it can raise complexity with things like error handling, synchronization, and data consistency.

Ultimately, many in the scraping community agree that Python is easier to use because it’s more user-friendly.

Community Support and Documentation

Python has extensive documentation about its use; it’s full of syntax manuals and tutorials on the setup, library use, third-party modules, and more. The documentation also covers every existing Python version (even the ones from the 90s!) and the ones that are still in development. Not only that, but it also covers additional resources, such as book lists, audio/visual talks, and developer’s guides. It’s a massive collection of information with everything you need to know about coding with Python.

Here’s the thing about JavaScript documentation – it doesn’t exactly exist. However, the JSDoc on GitHub can help you understand it better. Don’t worry, Node.js – the runtime environment used for web scraping with JavaScript – has pretty awesome documentation. It’s vast, full of useful resources, and information on past releases. There’s also a code repository and issue tracker if you need that!

JavaScript and Python both have large communities that are supportive, and full of tech enthusiasts as well as seasoned developers who are ready to help you out. You can find them on Stack Overflow, GitHub, Reddit or Discord.

Web Scraping JavaScript vs. Python: Side-by-side Comparison

 PythonJavaScript
Libraries
  • BeautifulSoup,
  • Selenium,
  • Scrapy,
  • Requests,
  • And more
  • Puppeteer,
  • Playwright,
  • Cheerio,
  • Axios,
  • Selenium,
  • And more
Asynchronous scrapingAvailable with some librariesAvailable
Dynamic content handlingAvailable with headless browser librariesAvailable natively
Performance and scalabilitySlower; scalableFaster; scalable
UsabilityEasy to pick up; becomes more difficult with multiple integrations

Harder to pick up; becomes even more difficult with multiple integrations

Documentation and community support

Extensive documentation; large community

No official documentation; large community

Starting Web Scraping With Python vs. JavaScript

If you want to test Python vs.JavaScript in practice, you can try a few scraping projects with each and see how your experience differs. We created a tutorial for beginners that will help you test the programming language.

An introductory guide to Python web scraping with a step-by-step tutorial.

Also, we compiled a list of small and large scale Python project ideas that you can use to practice your skills.

As for JavaScript, you can try following a step-by-step guide to scrape data from a website with Node.js environment. It will help you choose the essential libraries for your scraping project, write the code, and introduce other helpful tips for starting out.

Everything you need to know about web scraping with Node.js and JavaScript in one place.

By trying web scraping with both JavaScript and Python in practice, you’ll be able to make an informed decision about which one you should choose for your upcoming projects.

Conclusion

JavaScript and Python are popular choices for web scraping. But they both come with  strengths, weaknesses, and distinctive features. 

Python’s simplicity and extensive libraries are great for beginners and advanced users alike, especially for data-heavy projects. On the other hand, JavaScript’s impeccable ability to handle dynamic content and asynchronous requests by default is highly useful when scraping modern web applications.

In reality, there’s no right or wrong option – both languages are good and full of various resources to expand their capabilities. Choose Python if code readability and scraped data analysis is your priority. But if you’re planning to scrape JavaScript-heavy web applications, you should pick JavaScript instead.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post Web Scraping JavaScript vs Python: Which is Best for Your Needs? appeared first on Proxyway.

]]>
https://proxyway.com/guides/web-scraping-javascript-vs-python/feed 0
Differences Between SOCKS and HTTP Proxy: A Complete Guide https://proxyway.com/guides/socks-vs-http-proxy https://proxyway.com/guides/socks-vs-http-proxy#respond Fri, 23 Aug 2024 08:57:12 +0000 https://proxyway.com/?post_type=guides&p=24927 All you need to know about SOCKS and HTTP proxies.

The post Differences Between SOCKS and HTTP Proxy: A Complete Guide appeared first on Proxyway.

]]>

Guides

Proxy servers can be categorized by protocol, with HTTP and SOCKS being the two main ones. This guide will help you determine whether HTTP or SOCKS proxies fit your use case better.

SOCKS vs HTTP Proxy Main Differences

Internet protocols like SOCKS and HTTP define how data is transmitted online. Proxies based on these protocols act as intermediaries that forward user requests between clients and servers. However, while both protocols handle internet traffic and are suitable for tasks such as web scraping, you might wonder why choose one over another. 

To make an informed decision, it’s important to understand exactly what HTTP and SOCKS proxies are – what are the differences, strengths, and limitations of each. While the primary function of both HTTP and SOCKS proxies is to route internet traffic, there are technical aspects to consider. But let’s start at the beginning.

What Is HTTP Proxy?

Servers and browsers need a way to communicate with each other. HTTP, short for HyperText Transfer Protocol, transfers web traffic, such as text, images, multimedia, etc. between the client – you – and a web server. HTTP is compatible with all types of websites, so it doesn’t matter what kind of website you’re visiting.

HTTP is basically the foundation of the entire communication across the World Wide Web as websites are built with HTML (HyperText Markup Language). So, by sending a request to a website, HTTP brings the content to you – whether you’re visiting a site, watching an online video, or downloading a file – in plain text.

HTTP proxy serves as an intermediary between you and the server. When your traffic is routed through an HTTP proxy, it masks your original IP and assigns you a new one.

HTTP proxy can also understand and interpret the data that’s transmitted between the two parties. This is especially useful because it can detect spyware or suspicious data packets, and prevent them from getting back to you. However, HTTP is not suitable for transmitting sensitive data – this protocol doesn’t encrypt data, so it can be susceptible to tampering.

HTTP proxy communication path
Client – HTTP proxy – server communication path

HTTP vs HTTPS

Fundamentally, HTTP and HTTPS are the same protocol. Both are used for data transmission through the web, and follow the same request-response structure for communication between client and server. But HTTPS adds a security layer which encrypts transmitted data – the additional S stands for Secure. In essence, Google treats HTTP protocol as not secure. But let’s go a bit more into detail.

First things first, HTTP transmits web data in plain text. That means it can be read by anyone who has access to the transmission path (the route data takes to travel from you to the server and back). HTTPS, on the other hand, encrypts data using Secure Sockets Layer (SSL)/Transport Layer Security (TSL) protocols. When encrypted, data is secure from tampering or eavesdropping.

For example, with HTTPS, the client and server firstly exchange cryptographic keys to decrypt the data back to its original form, and only then transmit data. HTTP doesn’t have any encryption, so the handshake process doesn’t exist.

What’s more, connection through HTTPS requires a trusted SSL/TLS certificate for identity verification and secure connection. Hence, you won’t be able to make a connection if the certificate is missing. As a less secure protocol, HTTP doesn’t require certificates.

Notice how most websites you visit nowadays start with https://. That’s mainly because HTTP is outdated and rarely used in modern web applications. 

HTTP protocol, compared to HTTPS, is much easier to tamper with, so it can’t be trusted to transmit sensitive data, like credit card, social security, passport, driver’s license numbers, usernames and passwords, and more. 

 HTTPHTTPS
Identity verificationNot supportedSupported
Data encryptionNot supportedSupported
SecurityBasicSSL/TSL protocols
Use casesBrowsing simple websitesBrowsing, transmitting sensitive data

What Is SOCKS Proxy?

SOCKS (Socket Secure) is an internet protocol that routes traffic like HTTP(S), but instead of only being suitable for web traffic, SOCKS can handle nearly any type of data.

Similarly to HTTP(S) proxies, SOCKS proxies mask your IP address and support data encryption. However, the main appeal of SOCKS is its use of Transmission Control Protocol (TCP) that ensures a stable and fast communication channel, thus making SOCKS a primary choice for tasks like content streaming, peer-to-peer sharing and torrenting as well as online gaming when you need a stable connection with minimal interruptions. 

However, while HTTP(S) proxies can filter malicious data, SOCKS addresses simply transmit data without interpreting it. When comparing SOCKS vs. HTTP proxies, the key difference is flexibility: SOCKS supports various protocols like TCP, UDP, and DNS, and operates on any port, which is useful for streaming on online gaming.

SOCKS4 vs SOCKS5

SOCKS4 and SOCKS5 are versions of the SOCKS protocol, but SOCKS5 offers more features and flexibility.

SOCKS4 is an older version of the protocol that handles basic functionalities like routing TCP traffic between a client and a server that creates a stable communication path. However, it lacks support for more advanced features like UDP traffic or authentication that allow for time-sensitive and more secure communication.

In contrast, SOCKS5 supports both TCP and UDP traffic, so you can transmit different types of data. Additionally, SOCKS5 includes built-in authentication that helps create more secure connections and handle complex network requests. This makes SOCKS5 a preferred choice for tasks that require higher security, like secure remote access or private communication (VoIP).

 SOCKS4SOCKS5
Supported protocolsTCPTCP, UDP
AuthenticationNot supportedSupported
SecurityBasic, no encryptionSupports encryption
Use casesSimple TCP routingMore flexible (P2P, streaming, etc.)

Why Use SOCKS and HTTP Proxies?

Proxies, whether SOCKS or HTTP, can improve your online privacy, security, and performance. They work by masking your IP address and geolocation.

Whether you choose SOCKS or HTTP depends on your specific needs – are you going to work with broad internet traffic or focus on web tasks only?

The Main Differences Between SOCKS vs HTTP Proxies

When deciding between SOCKS vs. HTTP proxies, the choice largely depends on the type of traffic you plan to route through the proxy.

SOCKS proxies are known for their versatility and ability to handle different types of internet traffic. Because SOCKS proxy can use both TCP and UDP protocols, this type is used in cases when you need reliable data transmission or large amounts of data – streaming content or downloading large files. 

In web scraping, SOCKS proxies can take some load off if you’re dealing with non-HTTP data or when you need a more flexible connection to handle complex requests. Their ability to operate on any port also adds to their flexibility.

HTTP proxies, on the other hand, are designed specifically for handling HTTP(S) traffic. They interpret and filter web data, and are a primary choice for tasks like web browsing and accessing geo-restricted content (for example, you can imagine how some workplaces or academic institutions block certain websites). Because HTTP proxies understand the content of transmitted data, they can also block malicious data packets, and optimize connections for target websites.

SOCKS vs HTTP proxy differences
SOCKS vs HTTP proxies: the key differences

The Bottom Line

Both SOCKS and HTTP proxies have their own strengths and weaknesses, and both excel at different tasks. However, understanding how SOCKS and HTTP proxies work can help you choose the best proxy type for your needs. 

If you need to transfer large amounts of different internet traffic and reduced latency, SOCKS proxies are your best choice. But if you’re aiming to work with regular web data, like text or multimedia, HTTP is a much better option.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post Differences Between SOCKS and HTTP Proxy: A Complete Guide appeared first on Proxyway.

]]>
https://proxyway.com/guides/socks-vs-http-proxy/feed 0
The Ultimate Guide: How to Manage Multiple TikTok Accounts https://proxyway.com/guides/how-to-manage-multiple-tiktok-accounts https://proxyway.com/guides/how-to-manage-multiple-tiktok-accounts#respond Fri, 09 Aug 2024 13:24:48 +0000 https://proxyway.com/?post_type=guides&p=24643 The best practices for managing TikTok accounts with one device.

The post The Ultimate Guide: How to Manage Multiple TikTok Accounts appeared first on Proxyway.

]]>

Guides

TikTok has taken over the internet like a storm, so no wonder it has become a highly popular social media platform for businesses to advertise their products. But if you’re a social media manager for several brands, juggling multiple TikTok accounts can get tricky. This guide will give the know-how on running multiple TikTok accounts on one device.

Best Strategies for How to Manage Multiple TikTok Accounts

It is estimated that there are over 1 billion active TikTok users monthly. The platform also has a significantly higher engagement rate than other social media sites, because most of the content is short, very personalized, and it never stops. Creating videos on TikTok brings communities together, helps follow trends, catch up with friends, and promote business in an amusing and educational way for you and your followers.

But if you create TikTok content for clients or several personal accounts, doing so on a single device is not only tricky, but can lead to shadowbans.

Can You Have Multiple TikTok Accounts?

Yes, you can have multiple TikTok accounts. The platform technically allows users to create and manage up to three profiles. Though in reality there’s no limit to how many accounts you can have. All you need is a unique email address or a phone number for each profile, and enough real or emulated devices to manage them. 

TikTok also has an option to switch between accounts – you can find it by going to the mobile app, opening Settings and Privacy, and clicking on Switch account in the Login section. Notably, this function doesn’t exist on the web version.

The main reason why people use additional tools for multiple account management is because there’s a higher risk your accounts will be flagged as bots or get shadowbanned. A shadowbanned account is still active but it doesn’t reach other users organically, as it is not shown in the “For You” page. This results in fewer views, engagement, and visibility for your content. 

Why People Have Multiple TikTok Accounts

Just like with any other social media platform, there can be several reasons for managing multiple TikTok accounts. Here’s why people choose to have more than one TikTok account:

  • Separate regular and “dump” accounts. You might have heard about “dump” accounts before – it’s a separate personal account for posting content without hesitation or ruining your main account’s feed curation. Usually, “dump” accounts are private and designed for a close-knitted audience. While it’s still a personal account, it’s more exclusive.
  • Separate personal and professional accounts. Promoting a business on social media platforms is crucial nowadays. You can create a TikTok account for creating promotional content, but also have a separate account for personal use. This way, your feeds aren’t linked.
  • Managing accounts for clients. Digital marketing specialists and social media managers help businesses promote their brands and products. In this case, one person – the manager – might handle dozens of commercial profiles as part of their job. Keeping these accounts separate is a must to prevent shadowbanning.

The Risks of Managing Multiple TikTok Accounts

While many users have several TikTok accounts and use them hassle-free, managing multiple client profiles can prove to be more difficult. Mainly because of account bans and shadowbanning.

How to avoid bans and shadowbans on TikTok?

An account ban is quite straightforward – you get banned for community guidelines violations, spam, or copyright infringement. Shadowban, on the other hand, means you’re partially blocked on the app, but you can’t actually see it. If your account is shadowbanned, you can still interact with other creators, but your content doesn’t appear on the main “For You” page. Since most users use the “For You” page to watch relevant content, being invisible there results in significant engagement drop.

There are some obvious reasons why TikTok shadowbans accounts – some of them are disclosed in the community guidelines. However, marketing specialists found out that using the same device for more than three accounts can flag or shadowban some accounts. 

Here’s what you shouldn’t do if you want to run your TikTok accounts without complications:

  • Pretend to be someone else
  • Offensive or adult content
  • Harmful content, hate speech
  • Post stolen or duplicate content
  • Spam (posting, liking, commenting, following too fast and too much)
  • Fake engagement (engaging with your content from your other accounts)
  • Run multiple accounts on the same device

Not being compliant with community guidelines can get you banned, but there are other ways TikTok checks if your account is legitimate. The app tracks your device model and ID, browser type, network information, mobile carrier, operating system, IP address, and your activity on the app. If the app detects multiple account activity from the same IP and device, it can appear suspicious, potentially leading to your account or IP being banned. 

However, if you have a hundred accounts to run and only one device, getting client accounts banned isn’t an option. Hence, you’ll need some tools to assist you.

How to Manage Multiple TikTok Accounts

If you have more than three TikTok accounts under your control, switching accounts through settings won’t help much. So, how should you handle multiple accounts if you’re, say, a social media manager for multiple brands?

One of the obvious solutions – use more devices. A mobile app and desktop app will double the amount of accounts you can manage already. You can also try using a virtual private network (VPN). However, there are multiple reports that TikTok has VPN detection tools, and the app sometimes blacklists IPs associated with VPNs. Hence, your best bet when managing multiple TikTok accounts is to use proxy services and an antidetect browser.

Use an Antidetect Browser

You can manage multiple TikTok accounts without revealing your digital identity by using an antidetect browser. The tool allows you to create distinct browsing profiles with unique fingerprints – used browser, operating system, language, time settings, and more. This allows  you to access and work with several TikTok accounts from a single device without being detected by the platform – each profile remains isolated, with fingerprints that are not linked to one another.

Unlike Chrome and other browsers that also support multiple profiles, an antidetect browser eliminates profile crossover. It allows you to create separate environments with individually saved cookies, different browser types and versions, time zones, locales, and other parameters that help to detect you. 

With multiple unique profiles, you can easily manage severalTikTok accounts and have them open in different browser windows at the same time. This is one of the most popular ways to handle multiple social media profiles without facing bans. 

Moreover, setting up and managing an antidetect browser is far simpler than using multiple devices.

Use Proxy Services

You should also use proxies to mask your IP address and geolocation. Separate browser fingerprints are important, but IP addresses and geolocations are the primary targets when trying to identify someone online.

Some proxy providers offer proxies specifically optimized for social media management; they are effective for TikTok as well. Generally, TikTok proxies are mobile because it’s a mobile app, but residential addresses work as well. They’re also hardest to detect and almost never get blocked since they originate from real devices with authentic browsing histories. Using reputable mobile or residential proxies will reduce the risk of getting banned to a minimum. To achieve the best results, you should combine proxy services with an antidetect browser.

If you never tried proxy services, you might be compelled to try free proxies. These services are free for a reason – usually, free IPs are abused on most social media platforms because everyone can access them. So, if your IP is not yet blocked, it likely will soon.

Best Practices for Managing Multiple TikTok Accounts

Aside from adhering to TikTok’s community guidelines, here are some additional tips to help you manage multiple TikTok accounts effectively.

  • Assign a different IP address to each TikTok account. Keeping a unique IP and browser fingerprint for each account will help run TikTok accounts smoothly. TikTok is vigilant about running multiple accounts from one IP address, so it’s recommended to keep one IP per account for as long as possible.
  • Don’t use your accounts for unethical purposes. Creating multiple TikTok accounts for spamming, harassment, spreading hate speech, or illegal activities is not only against TikTok’s rules, but simply wrong. Expect account bans or, in some cases, even deletion if you use TikTok for shady business. 
  • Be prepared to fail. There is no foolproof method for managing multiple accounts. Even with antidetect browsers and proxies you may encounter occasional issues when accessing TikTok accounts. If you face problems, try waiting or rotating your IP address (this can be done via your proxy service or antidetect browser’s dashboard), and then attempt to access the account again.
  • Warm up new accounts. Warming up a new account simply means using it the way a regular user would. Start slow and grow from there – watch, like, and comment videos, start creating content of your own, interact with your followers. It’s important to warm up new accounts as it lets TikTok know they’re legitimate and not spammy.
  •  

This should help you effectively manage multiple TikTok accounts while minimizing the risk of bans, shadowbans, and account deletion.

The Bottom Line

Since TikTok doesn’t want you to have more than three accounts, managing multiple profiles can be quite a hassle. Though, antidetect browsers and proxies make the process much simpler – they’re simple to set up and use, and don’t require extra equipment. Though no solution is perfect, setting up assistive tools, following the best practices, and learning along the way will increase your chances of managing TikTok accounts successfully.

proxy servers as houses

Frequently Asked Questions About Managing Multiple TikTok Accounts Proxie

You cannot have more than one TikTok account linked to one email address. Each TikTok account has to be created with a unique email or phone number. However, you can still manage multiple accounts with one device.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post The Ultimate Guide: How to Manage Multiple TikTok Accounts appeared first on Proxyway.

]]>
https://proxyway.com/guides/how-to-manage-multiple-tiktok-accounts/feed 0
What is AI Web Scraping: A Comprehensive Guide https://proxyway.com/guides/what-is-ai-web-scraping https://proxyway.com/guides/what-is-ai-web-scraping#respond Thu, 08 Aug 2024 10:17:42 +0000 https://proxyway.com/?post_type=guides&p=24706 Learn all about intelligent web scraping with AI-powered tools.

The post What is AI Web Scraping: A Comprehensive Guide appeared first on Proxyway.

]]>

Guides

Collecting accurate and up-to-date data can be a tedious task, but you can automate the process using various tools built for web scraping. However, if you’re working with complex websites, such as media streaming platforms, a typical custom-built scraper might require a lot of effort to set up.

What is AI web scraping

The more data on the web, the more ways to extract it – something as simple as a social media post or a product listing can be a valuable piece of data. A script instructed to retrieve this information can reduce manual work. While custom-built scrapers can do wonders, they can also break easily and miss crucial information. Additionally, many websites have built-in anti-scraping measures that are used to detect and block scrapers from completing their tasks. Not to mention that writing a script requires programming skills.

AI web scraping tools help avoid issues like detection because they respond to the task at hand and adjust themselves – whether it’s solving CAPTCHAs or automatically rendering JavaScript content. In this guide, you’ll learn everything you need to know about a more flexible and intelligent scraping approach – AI web scraping.

What Is Traditional Web Scraping

Traditional web scraping usually refers to automated data collection using custom-built scripts. In essence, you collect a list of URLs that you want to scrape, send a request to the target page, and your script pulls out the HTML code with all the web data. Then, if written in the script, the scraper cleans up the data to give you the information you initially asked for – it could be product listing names, prices, and whatnot. Once you have the code written, the process is quite straightforward, quick, and works as intended with most websites. However, it also comes with certain limitations.

Limitations of Traditional Web Scraping

While quite resource-intensive in the beginning, a custom web scraper can be a cost-efficient way to collect data from the web in the long run. If you’re skilled in programming or eager to learn, we recommend using Python or Node.js for writing your script – these languages are relatively simple, and have many powerful libraries for data collection and analysis. There are a lot of customizations you can do for a traditional scraper, but it’s important to understand that it will require constant attention.

  • They need add-ons for dynamic content: if you build a scraper yourself, you’re the only one that’s responsible for its success. Let’s say you set the scraper to work with HTML structure. When your scraper runs into dynamic content, you’ll have to manually update and adjust the script to work again. Imagine a social media site where new posts load as you scroll – in this case, the website fetches post content via JavaScript. So, you’ll need to use a headless browser library to deal with dynamic elements. And trust me, this is harder than it sounds.
YouTube user interface with JavaScript disabled
How YouTube looks without JavaScript
  • They’re made to work with one website layout: traditional scrapers get confused when websites change their layouts without refreshing the page. This leads to breakage, and missing or inaccurate information. Even with websites that have a simple HTML structure, you’ll have to readjust your scraper manually, if the website owner changes something (even relatively small) in the structure. 
  • They don’t “multitask” well: website layout changes probably won’t be much of an issue if you work with only one website. But if your case requires scraping loads of data on various websites with different structures, it will become tiresome to make adjustments. 
  • They’re more susceptible to antiscraping technologies: have you ever had to check a box to confirm you’re not a robot? While you’re capable of doing so, a traditional scraper usually isn’t. Websites employ various antiscraping technologies – CAPTCHAs, IP blocks, honeypot traps – to prevent robots from overloading their servers with unwanted traffic. In this case, you’ll need extra software like CAPTCHA solver and proxies to bypass web scraping roadblocks. However, it’s inconvenient and can make your script break faster.


If you already encountered these or similar issues with a custom-built scraper, or want to prevent them from happening, you should try AI web scraping.

What Is AI Web Scraping

Traditional scraping has come a long way and, to this day, is the primary choice for gathering web data. However, AI web scrapers significantly improve this process – they can scrape any website without additionally running headless browsers, CAPTCHA solvers, and manually updating the scraper.

Benefits of AI Web Scraping

  • Ability to handle dynamic content and adapt to structural changes: AI-based scrapers can handle both static and complex dynamic web content because they adjust to different content types, whereas traditional scrapers have to be manually reconfigured.
  • Extracted data is more accurate: AI web scrapers work faster and better because they learn from previous tasks. They can filter, contextualize, and parse information intelligently, similarly to how a human would. AI scrapers will understand the context, and will be able to extract all relevant information regardless how it’s presented. The process is more efficient, and no manual input is required – the scraper adjusts itself automatically.
  • They can outsmart anti-scraping technologies: AI scrapers can bypass anti-scraping measures, such as CAPTCHAs, honeypot traps (forms invisible to humans that only bots can try to fill in, which signals automated activity), or IP blocking, triggered by a suspiciously high number of requests from one address. They do so by adjusting browsing speed, mouse movements, and click patterns to imitate how a human would behave on a website. They can also choose the right proxy type, rotate them automatically, create unique browser fingerprints, and retry failed requests.

While seemingly foolproof, AI web scrapers aren’t without flaws. Usually, AI web scrapers are quite expensive, as specific features will cost extra. Also, you have less control over functionality and features – you’re stuck with what’s included in the service, and there’s no customization you can do yourself. It’s also worth mentioning that some websites (like Google) can be off limits with some pre-made tools.

Differences between Traditional and AI Web Scraping

In essence, traditional and AI scrapers are the same thing – they scrape data. However, traditional scrapers rely on predefined rules. They are conservative and do precisely what you ask them to do. AI web scrapers, on the other hand, can adapt to the task at hand, even if you didn’t adjust it – they’re more intelligent when encountering complex websites and data. 

Choosing AI Web Scraping Tools

If you need an AI web scraper, there are a couple of ways you can go about this. One way is to build a basic scraper with Python or other programming language from scratch, integrate a headless browser for dynamic content handling, a natural language processing model for semantic analysis and adaptable data extraction, a machine learning model for data analysis, and then train it. 

However, it’s a hassle and requires a significantly higher programming skill level. Fortunately, there’s another option – choose from multiple scraping tools already available on the market. They usually have great performance, well-maintained infrastructure, and are designed to handle large amounts of requests. Also, it’s a much better option for one-off jobs.

No-code AI Scrapers

No-code scrapers are a great choice for people without coding experience – they usually have a user-friendly interface and ready to use templates. With a no-code scraper, you visit a website, interact with the elements you want to scrape, and the scraper translates these interactions into scraping logic and structured data. That makes the process less automated, but still, much less manual work is involved.

Not all no-code scrapers are AI-based, but most have intelligent features, such as pattern recognition, automatic adjustments, and the ability to scrape dynamic websites.

Web Scraping APIs and Proxy APIs

Web scraping APIs and proxy APIs are an automatic and programmatic way to scrape the web. They’re like remote web scrapers – you send a request to the API with the URL and other parameters like language, geolocation, or device type. 

They access the target website, download the data, and come back to you with the results. They handle proxies, web scraping logic, antiscraping measures, and you don’t interact with the website yourself but, instead, write a piece of code to instruct the scraper. 

The key difference between scraper APIs and proxy APIs is that the first one integrates as an API, while the latter – as proxy server through which your scraping code reroutes traffic.

Scraping Browsers

A scraping browser is a tool for automating web interactions and extracting data from web sites. It uses browser engines, like Chromium that powers Chrome, to navigate, interact with, and scrape websites, handle dynamic content, and anti-scraping measures. Libraries like Puppeteer also have AI plugins, which can help you programmatically control a regular browser like Chrome or Firefox to perform sophisticated scraping tasks. An AI-powered scraping browser can mimic human actions, like clicks, scrolls, filling out forms, thus extracting data without being detected by anti-bot measures. This is especially important if you’re aiming to scrape JavaScript-heavy websites with strong anti-scraping measures. 

AI-Based Platforms

AI-based scraping platforms have a lot of features that make complex scraping tasks easier to handle. These tools help you write and execute scripts, manage scale, and how the data is extracted and stored. Usually, they require a good technical understanding and the ability to write code, but less maintenance. They have ready-made functions, built-in parsers, and the ability to adapt to changes and adjust scraping scale and rules. Some AI-based platforms also have visual tools to make scraping accessible and less technically challenging.

Best Practices for AI Web Scraping

Websites don’t like being scraped. They want real humans to browse, engage, and make purchases. Scrapers, on the other hand, create unwanted traffic that can overload the servers and doesn’t bring any revenue. Nevertheless, web data collection isn’t illegal.

There are no laws that would prohibit you from scraping the web, but it’s essential to do it ethically and responsibly. Here are some tips on how to scrape ethically:

  • Respect robots.txt file: in simple terms, robots.txt is a file that websites use to instruct web crawlers and scrapers on what they can and cannot do. It helps websites ensure that some parts of the website remain private. 
  • Respect the Terms of Service: it goes without saying, but you should adhere to the rules given by the website owner. Some ToS might forbid automated data extraction, and you should respect that.
  • Scrape politely: when scraping, try to be as respectful to the website as possible – don’t overload the servers with too many requests, don’t access forbidden information, and respect the rules imposed by ToS and robots.txt files.
  • Respect personal data: scraping someone’s personal information without consent violates privacy laws and raises many ethical concerns. Always comply with personal data protection laws, such as the General Data Protection Regulation (GDPR) or The California Consumer Privacy Act (CCPA). Otherwise, you can hurt your business’s reputation and face legal implications.

Conclusion

AI and machine learning enhance data scraping by handling dynamic content, recognizing complex patterns, and adapting to structural changes. Intelligent features like CAPTCHA solving, automatic proxy management, and semantic content analysis improve the accuracy, speed, and flexibility of scraping. As a result, the data is more structured, easier to understand, and requires less manual work.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post What is AI Web Scraping: A Comprehensive Guide appeared first on Proxyway.

]]>
https://proxyway.com/guides/what-is-ai-web-scraping/feed 0
How to Manage Multiple Discord Accounts in 2024 https://proxyway.com/guides/how-to-manage-multiple-discord-accounts https://proxyway.com/guides/how-to-manage-multiple-discord-accounts#respond Wed, 17 Jul 2024 10:47:19 +0000 https://proxyway.com/?post_type=guides&p=24178 Learn the best methods for managing multiple Discord accounts.

The post How to Manage Multiple Discord Accounts in 2024 appeared first on Proxyway.

]]>

Guides

If you ever tried to manage hundreds of Discord accounts, you know the hassle of switching between different profiles. Not to mention the struggles of accessing the profiles via different IPs to avoid getting flagged for suspicious activity or simply being able to access the platform from the target location. Don’t worry, we’re here to help.

Best Ideas for How to Manage Multiple Discord Accounts

Discord is one of the most popular messaging platforms in the world. While primarily used among gamers, it’s quickly becoming a communication tool for businesses and building communities with certain shared interests such as sneaker copping and web scraping.

But with the growing popularity of Discord, the platform now has less leeway for connecting multiple profiles at the same time. So, what if you’re a social media manager who needs to manage Discord accounts for multiple clients? Well, it’s not exactly possible while using one IP address. Additionally, if the platform finds any of your actions unfit, you might get your IP banned, or even worse – all your client accounts blocked.

In this guide, we’ll introduce tools that will help to mask your digital identity – IP address and other information that allows websites to identify you – without compromising your clients’ accounts.

Can You Have Multiple Discord Accounts?

Yes, you can have more than one Discord account, it’s not against the rules. According to the official guidelines, you can have and manage up to five different accounts – all you need is a different email address or a phone number for each one.

Discord also rolled out an Account Switcher feature to help you switch between all your accounts seamlessly. You can access Account Switcher by clicking on your avatar icon, and selecting Switch Accounts. However, this feature only works on the Discord desktop app.

If you need to manage more than five Discord accounts on the same device, you will need some extra tools. We’ll get into it later.

What about Multiple Discord Accounts on Mobile?

You can switch accounts on mobile as well. Unfortunately, the process is more difficult as the Account Switcher is not available on the mobile Discord app. There are two ways you can go about managing multiple Discord accounts on mobile. First, you can manually log out and enter the credentials to access another account. Another option is to use an antidetect browser on your phone. It’s a tool specifically designed for managing multiple accounts. Bear in mind that very few providers offer antidetect browsers on mobile devices. If they do, you’ll need to have an Android.

Why People Create Multiple Discord Accounts

There might be several reasons for managing multiple Discord accounts. Here are the main motives why people choose to have more than one Discord account:

  • Separate personal and professional use. Some need Discord as a communication channel at work, while also using the app for gaming, chatting, or other non-work-related reasons. In this case, multiple Discord accounts help to separate work and personal lives.

  • Manage multiple communities. Maybe you have dozens of accounts because you manage Discord communities for a living or as a hobby. Keeping separate accounts with different digital fingerprints will prevent Discord from banning your IP for misuse, and making all accounts inaccessible. You’ll also be able to share the specific account with stakeholders without compromising other accounts under your control.

  • Privacy concerns. You might have a personal account for connecting with friends but also want to participate in discussions on certain servers without revealing yourself. Having a separate Discord account can help camouflage your real identity or create multiple online personas, as they’re not linked across servers.

  • Testing and development. Developers or server administrators might want to use multiple accounts to create and test bots, roles, or other server features without impacting their main Discord account.

How to Manage Multiple Discord Accounts

If you have more than five Discord accounts that you need to manage, the Account Switcher feature will not be of much use because it will simply not work as intended. Luckily, there are a couple of options that will help you out.

Firstly, it’s crucial to understand that your online activity is tracked to identify you. Websites check your IP address, saved cookies, and digital fingerprint (such as what browser, operating system, language settings you’re using, etc.) to ensure that you’re not a machine trying to do something malicious. 

What’s more, if you need to manage hundreds of accounts at the same time, you won’t be able to do so from a single device and IP address. So, an antidetect browser and proxies are your best bet – you’ll be able to assign a different IP address to each browser account and access all Discord accounts in one environment.

Use an Antidetect Browser

You can manage dozens of Discord accounts without revealing that it’s you behind all of them with an antidetect browser. It lets you create completely different profiles with unique fingerprints, and browse the web as usual from a single device.

While Chrome and other browsers also allow managing multiple profiles, an antidetect browser prevents profile crossover. You can create completely unique environments with separately saved cookies, different. perceived browsers and their versions, time zones, locales, and other detectable aspects. Hence, each profile is isolated because the fingerprints (your identifiers) are unlinkable to each other. 

With several unique profiles, you can manage more than five Discord accounts, and even have them open on multiple browser windows simultaneously – that’s how social media account managers work with multiple client profiles without getting banned. Also, it’s much easier to set up and manage an antidetect browser than to have multiple devices or virtual operating systems.

Use Proxy Services

You should also use proxies to hide your perceived IP address and location. In a very simplified way, think of browser fingerprints as your defining characteristics – the color of your hair, eyes, or what clothes you wear. An antidetect browser will mask these attributes. However, if all these different people are named John and they all live on Baker Street 221B, it looks suspicious. Proxies will help you get a new name and place to live. Namely, your IP address and geolocation.    

Some providers offer proxies optimized for social media management; they work well with Discord, too. Usually, Discord proxies are residential or mobile since they come from real devices with legitimate browsing histories. Hence, they are unlikely to get banned.

Combining proxy services with an antidetect browser will help you achieve the best results in spoofing your online identity.

Other Methods

Surely, you can get more devices or get someone to help you out with their own device, but it’ll cost a lot. Instead, here’s what you can do:

  • Use Discord’s desktop and mobile apps: Each device gives you the ability to access and manage five accounts. By using both mobile and desktop applications, you can manage up to 10 accounts at once.

  • Get Discord Canary or PTB: Canary (Alpha version) and PBT (Public Test Build or Beta) can be installed along the main Discord app. This way, you’ll be able to manage up to 15 accounts – five in each version.

  • Try app cloning apps: Both Android and iOS support app cloning, so you can try cloning your Discord app. Unfortunately, this method might have some security risks.

Discord and Discord Canary apps on Mac
Discord and Discord Canary apps on Mac

Managing Multiple Accounts on Mobile Devices

Managing your Discord accounts on a mobile device is a bit more tricky, but possible. There’s no function to quickly switch between accounts on the mobile app, but there are ways to make it easier – for example, using an antidetect browser with mobile support.

Providers like GoLogin or Kameleo offer applications for Android devices. This way, you’ll be able to create multiple profiles on your mobile device and manage those Discord accounts.

Best Practices for Managing Multiple Discord Accounts

Apart from following the terms of service and specific server rules, you should know a few other recommendations that will help you manage all your accounts.

  • Try to keep one IP per account for as long as possible. Keeping your IP as consistent as possible will ensure that your time on Discord runs smoothly. With the help of proxies and antidetect browsers, create and use separate profiles with unique IPs and fingerprints for each Discord account. Changing your IP too often for one account might lead to account suspension as it pings Discord about suspicious activity. 

  • Don’t use your accounts for black hat use cases. It goes without saying, but creating multiple Discord accounts for scamming, hacking, abuse, and other illegal activities is against Discord’s terms of service. Not only can you get banned from your favorite servers or receive account bans, but there might be legal implications as well.

  • Expect failed attempts. There’s no perfect way to manage multiple accounts, even with the help of an antidetect browser or proxies – your requests to run Discord might fail sometimes. Our suggestion: wait or change your IP address (you’ll find an option in your proxy service or antidetect browser’s dashboard), and try again.

The Bottom Line

While no option is perfect for managing more than five Discord accounts, antidetect browsers and proxies are your best bet. Both are relatively simple to set up and use, and require fewer resources compared to alternatives, like buying multiple devices.

proxy servers as houses

Frequently Asked Questions About Multiple Discord Account Management

If you have 5 or fewer Discord accounts, you can easily switch between them with the Account Switcher feature on the desktop app. If you have more than five accounts, you can use antidetect browsers or proxies to manage multiple accounts.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post How to Manage Multiple Discord Accounts in 2024 appeared first on Proxyway.

]]>
https://proxyway.com/guides/how-to-manage-multiple-discord-accounts/feed 0
Cheerio Web Scraping: A Tutorial on How to Scrape Static Pages with Node.js  https://proxyway.com/guides/cheerio-web-scraping-a-tutorial https://proxyway.com/guides/cheerio-web-scraping-a-tutorial#respond Thu, 07 Mar 2024 20:08:51 +0000 https://proxyway.com/?post_type=guides&p=20004 This is a step-by-step example using Node.js libraries Cheerio and Axios.

The post Cheerio Web Scraping: A Tutorial on How to Scrape Static Pages with Node.js  appeared first on Proxyway.

]]>

Guides

Interested in gathering data from static web pages? Cheerio and Axios libraries are great for the job. Learn everything you need to start your web scraping project.

Cheerio web scraping

Node.js is a great runtime used in web scraping. It’s mostly known for its powerful headless browser libraries for dynamic websites. But you can also gather data from static web pages and clean it for further use. One very popular Node.js library is Cheerio.

In this guide, you’ll learn what web scraping with Cheerio is and what advantages it can bring to your web scraping project. We’ll also explain what other tools you need to successfully gather data from a static web page and how to build a web scraper with Cheerio and Axios. All the information will be wrapped in a real-life example, so you can practice your skills.

What is Web Scraping with Cheerio?

Web scraping with Cheerio involves several steps.

Cheerio isn’t a standalone library: it can only parse, or structure web content. You’ll first need to pick an HTTP client like Axios that will fetch the HTML from your target web page.

Once you have the HTML content, you can load it into Cheerio. Then, the library will let you use jQuery syntax to traverse and manipulate the HTML structure. Cheerio allows selecting page elements and extracting the data using CSS selectors. This can be  text, links, images, or any other information. You can also filter elements based on attributes or values. 

Some websites require you to handle pagination or scrape data nested in JavaScript. Cheerio can’t execute JavaScript; if this is the case, you’ll need a headless browser library like Playwright or Puppeteer

Advantages of Web Scraping with Cheerio

If you’re familiar with jQuery, web scraping with Cheerio is pretty simple because it has a similar syntax. 

Cheerio is a pretty flexible tool. It’s built over parse5 for parsing HTML and htmlparser2 for XML documents. It combines their features; for example, parse5 ensures compliance with HTML standards. In essence, it can parse any HTML and XML document. What’s more, the library allows you to traverse the DOM structure, so you can select and manipulate specific elements. 

It also supports a range of CSS selectors, so you can easily pinpoint elements within the document.

Like most parsing libraries, Cheerio is light on resources. The tool runs on the server side and doesn’t require a full browser environment, so you can deal with complex pages.

Web Scraping Static Pages Using Node.js Libraries Axios and Cheerio

In this step-by-step tutorial, we’ll scrape a list of books – their title, price, rating, stock, and URL – from books.toscrape.com. Even though Node.js has an integrated request library, it isn’t easy to use, so few people choose it for fetching data. For this reason, we’ll be using Axios to fetch the page.

books-toscrape-main
The main page of books.toscrape.com

Prerequisites

  • Node.js. Make sure you have the latest Node.js version installed on your system. You can get it from the official website.
  • Axios. You can add it by running npm install axios in your operating system’s terminal.
  • Cheerio. You can add it by running npm install cheerio.

Importing the Libraries

Step 1. First, let’s import the necessary libraries

1) Import Node.js HTTP client axios.

				
					import axios from 'axios'

				
			

2) Import Node.js parser Cheerio.

				
					import { load } from 'cheerio'

				
			

3) Import the built-in Node.js file system module for writing results into a CSV file.

				
					import fs from 'fs'

				
			

Downloading the Page

Step 1. Let’s download the target page.

				
					const start_url = "http://books.toscrape.com/"

				
			

Step 2. Create a list to store the data.

				
					const books_list = []

				
			

Step 3. Define the scrape() function.

				
					async function scrape(url) {

				
			

Step 4. Make an HTTP request and wait for the response.

				
					let resp = await axios.get(url)

				
			

Step 5. Extract HTML from the response.

				
					let resp_html = resp.data

				
			

Extracting the Data Points

Step 1.  Load the HTML into the Cheerio $ object.

				
					const $ = load(resp_html)

				
			

Step 2. Pass the Cheerio instance to the parse() function.

				
					parse($)
				
			

Step 3. Find the next page selector and the href attribute to scrape the next page.

				
					try {
        let next_href = $('.next > a').attr("href")
// In case the '/catalogue/' part of the URL is not found within
// the href attribute value, add it to the href
if (!next_href.includes('catalogue')){
next_href = `catalogue/${next_href}`
}
				
			

Step 4. Format the absolute URL of the next page we’re going to scrape.

				
					let next_url = start_url + next_href
console.log('Scrape: ' + next_url)
				
			

Step 5. Call the scrape() function again and pass the URL.

				
					const book_price = $(element).find('.price_color').text().replace('£', '')
				
			

Parsing the HTML

Step 1. Define the parsing function.

				
					function parse($){

				
			

Step 2. Now, we need to figure out where the data points are located. Let’s scrape four elements: the book title, price, rating, and availability. Right-click anywhere on the page and press Inspect. You can see that they’re all under a class called product_pod:

Bookstoscrape inspected element

Step 3. We can extract the whole class:

				
					$('.product_pod').map((i, element) => {

				
			
productpod printed

But the data you get will be messy, so let’s be more specific.

1) Extract the book title by finding the H3 tag within the element.

				
					const book_title = $(element).find('h3').text()
				
			

2) Then, extract the book price by getting rid of the pound sign.

				
					const book_price = $(element).find('.price_color').text().replace('£', '')
				
			

3) Now, get the book rating from the p tag with the classes star rating and Num (the book’s rating). This part is a bit more complicated because the rating is in the name that includes two words, and we only need one.

So, you first need to find the element with that class and get the value of the class attribute, which returns the string. It’s possible to split this string into a list of words using spaces as separators and grab the second word.

				
					const book_rating = $(element).find('p.star-rating').attr("class")
.split(' ')[1]
				
			

4) Extract book stock information by finding the element by the instock class and trim unnecessary whitespaces.

				
					const book_stock = $(element).find('.instock').text().trim()
				
			

5) Get the book URL by finding the a tag within the product_pod element and getting its href attribute which you’ll need to append to the start_url.

				
					const book_url = start_url + $(element).find('a').attr("href")
				
			

Step 4. Now, let’s append our data points to the list:

				
					books_list.push({
    "title": book_title,
    "price": book_price,
    "rating": book_rating,
    "stock": book_stock,
    "url": book_url
})
				
			

Step 5. End the iteration.

				
					})
    //console.log(books)
}
				
			

Saving the Output to a CSV File

Step 1. Now, let’s structure all our data by writing it to a CSV file.

				
					function write_to_csv(){
				
			

Step 2. Get the keys from the books object, this will become the first line of the csv file

				
					var csv = Object.keys(books_list[0]).join(', ') + '\n'
				
			

Step 3. Iterate through each book dictionary element.

				
					books_list.forEach(function(book) {
				
			

Step 4. Add a new line to the csv variable with the line break at the end.

				
					csv += `"${book['title']}", ${book['price']}, ${book['rating']}, ${book['stock']}, ${book['url']},\n`
})
//console.log(csv)
				
			

Step 5. Write the output to a CSV file.

				
					fs.writeFile('output.csv', csv, (err) => {
        if (err)
            console.log(err)
        else {
            console.log("Output written successfully")
        }
    })
}
				
			

Step 6. Then, pass the URL to the scrape function and tell Node.js for it to be awaited so that all of the scrapes finish before we move on to writing the output.

				
					await scrape(start_url)
				
			

Step 7. Call the function to write the output.

				
					write_to_csv()
				
			
bookstoscrape csv file screenshot
Here’s the full code:
				
					import axios from 'axios'
import { load } from 'cheerio'

// For writing into the output file
import fs from 'fs'

const start_url = "http://books.toscrape.com/"

const books_list = []

    // Requesting the page with the help of Axios and waiting for the response
    let resp = await axios.get(url)
    let resp_html = resp.data
    // Loading the html into Cheerio. $ - Cheerio object
    const $ = load(resp_html)
    // Passing the Cheerio instance to the parse() function
    parse($)

    try {
        // Try finding the  next page selector and
        // extract the href attribute for scraping the next page
        let next_href = $('.next > a').attr("href")
        // In case the '/catalogue/' part of the URL is not found within
        // the href attribute value, add it to the href
        if (!next_href.includes('catalogue')){
            next_href = `catalogue/${next_href}`
        }
        // Formatting the absolute URL of the next page we are going to scrape
        let next_url = start_url + next_href
        console.log('Scrape: ' + next_url)
        // Calling the scrape() function again and passing it the URL
        await scrape(next_url)

    } catch {
        // Next page selector not found, end job
        return
    }

}

// Function for parsing the html of the page.
function parse($){
    // The selector for each distinct book element on the page is an article
    // tag with the class of "product_pod". This line finds all such elements
    // and begins iterating through them.
    $('.product_pod').map((i, element) => {
        // To get the title, we find the h3 tag within the element and
        // extract its text.
        const book_title = $(element).find('h3').text()
        // Price is also simple, we just get rid of the pound sign
        const book_price = $(element).find('.price_color').text().replace('£', '')
        // The book ratings are easily scraped from the p tag with the classes
        // "star rating" and "Num" where "Num" is the rating the book has
        // received. To extract the rating, we first find the element with that
        // class, get the value of the "class" attribute which returns a string:
        // e.g. "star-rating One", split that string by whitespaces and assign
        // the second element of the resulting list to our variable.
        const book_rating = $(element).find('p.star-rating').attr("class")
            .split(' ')[1]
        // Simply finding the element by the "instock" class, extracting the
        // text and trimming the resulting string to strip away unnecessary
        // whitespaces.
        const book_stock = $(element).find('.instock').text().trim()
        // To extract the url of the book, we find the a tag within the
        // product_pod element and get its "href" attribute which we append to
        // the start_url
        const book_url = start_url + $(element).find('a').attr("href")
        // Appending the results dictionary to the books_list
        books_list.push({
            "title": book_title,
            "price": book_price,
            "rating": book_rating,
            "stock": book_stock,
            "url": book_url
        })
    })
    //console.log(books)
}

function write_to_csv(){
    // Getting the keys from the books object, this will become the first line of the csv file
    var csv = Object.keys(books_list[0]).join(', ') + '\n'
    // Iterating through each book dictionary element
    books_list.forEach(function(book) {
        // Adding a new line to the csv variable with the line break at the end
        csv += `"${book['title']}", ${book['price']}, ${book['rating']}, ${book['stock']}, ${book['url']},\n`
    })
    //console.log(csv)
    // Writing the output to a output.csv file
    fs.writeFile('output.csv', csv, (err) => {
        if (err)
            console.log(err)
        else {
            console.log("Output written successfully")
        }
    })
}

// Script starts here, we pass the URL we are going to start our scrape on to
// the scrape function and tell node for it to be awaited so that all of the
// scrapes finish before we move on to writing the output
await scrape(start_url)
// Calling the function to write the output
write_to_csv()
				
			

Web Scraping Dynamic Pages Using Node.js and Puppeteer

If you want to scrape dynamic pages, we have also prepared a step-by-step tutorial using a headless Node.js browser library, Puppeteer. The guide will walk you through extracting data from quotes.toscrape.com., specifically the text, quote, author, and tag data points. We’ll also show you how to deal with delayed rendering, and where to integrate proxies in your script.

This is a step-by-step guide to web scraping using the Node.js library Puppeteer.

The post Cheerio Web Scraping: A Tutorial on How to Scrape Static Pages with Node.js  appeared first on Proxyway.

]]>
https://proxyway.com/guides/cheerio-web-scraping-a-tutorial/feed 0
Cheerio vs Puppeteer for Web Scraping https://proxyway.com/guides/cheerio-vs-puppeteer-for-web-scraping https://proxyway.com/guides/cheerio-vs-puppeteer-for-web-scraping#respond Tue, 05 Mar 2024 08:21:16 +0000 https://proxyway.com/?post_type=guides&p=19884 Let’s see which Node.js library is better for your project.

The post Cheerio vs Puppeteer for Web Scraping appeared first on Proxyway.

]]>

Guides

Let’s see which Node.js library is better for your project. 

Cheerio vs Puppeteer

There is a plethora of tools to choose from when web scraping. The choice, by and large, depends on the target website: do you want to scrape a static page or one with elements like endless scrolling? Cheerio and Puppeteer are both Node.js libraries, yet they’re used for different purposes. Learn which tool better suits your needs with this guide.

What is Cheerio?

Cheerio is a data parsing library used to extract data from HTML or XML code. It isn’t a standalone tool – you first need to download the information with an HTTP client such as axios and only then load the HTML into Cheerio, find elements, and scrape the content.

The tool processes the HTML data and creates a structured document known as a DOM tree. This tree mirrors the HTML structure with every element, attribute, and text within the document.

Cheerio isn’t a web browser; it works in a server environment. The tool isn’t able to interpret results as a browser would, so it can’t deal with dynamic elements nested in JavaScript. 

What is Puppeteer?

Puppeteer is a library that allows you to control a headless browser programmatically. The tool was developed by Google in 2018.

The library automates browser interactions like moving the mouse, imitating keyboard inputs, filling out forms, taking screenshots, and other actions. Puppeteer first launches the browser, then opens a page, navigates to the URLs, interacts with the page, and extracts the data.

Until recently, Puppeteer only worked with Chrome and Chromium, but it’s now added experimental support for Firefox and Microsoft Edge.

Cheerio vs Puppeteer: A Comparison Table

 CheerioPuppeteer
Primary purposeWeb scraping (parsing)Browser automation and testing
Difficulty setting up EasyDifficult
Learning curveEasyDifficult
JavaScript rendering NoYes
Performance FastSlower
Proxy integrationYesYes
CommunitySmallLarge
Best forParsing static web pagesLoading and parsing dynamic web pages

Cheerio vs Puppeteer for Web Scraping

Popularity

According to npmtrends.com data, Cheerio has always had more monthly downloads than Puppeteer. 

Cheerio vs Puppeteer npmtrends

Github data, on the other hand, shows otherwise – Puppeteer has more forks and stars. So, we could say that both Cheerio and Puppeteer are very popular tools in the web scraping community. 

Cheerio: 1.7 forks, 27.5 stars

Puppeteer: 9.2 forks, 86.1 stars

Prerequisites and Installation

The installation process of Cheerio and Puppeteer is pretty much the same because both tools work with Node.js. You have to install the latest Node.js version from the official website.

To install the Node Package Manager (npm) package, simply open a terminal or command prompt and enter the following command:

If you’re using Puppeteer:

				
					npm install puppeteer

				
			

If you’re using Cheerio:

				
					npm install cheerio

				
			

Since Puppeteer can run on its own, you don’t need to install other dependencies. For Cheerio, you need to implement the axios HTTP client (or any other library you prefer):

				
					npm install axios

				
			

Features

Cheerio. It is built over parse5 for parsing HTML and htmlparser2 for XML documents. This makes the tool flexible and with robust parsing capabilities. 

Cheerio allows you to use jQuery syntax to select and manipulate data within HTML documents. So, if you’re familiar with jQuery, using Cheerio is pretty simple.

The parser is faster than Puppeteer because it runs on the server-side and uses less computing power

Puppeteer. The library gives you absolute control over the browser. It runs in headless mode by default, but you can switch to headful. This will show you a visible browser user interface, so you troubleshoot easier while scraping. 

A great feature of Puppeteer is that it offers plugins like puppeteer-extra-plugin-stealth for spoofing browser fingerprints. It reduces minute variations between a headless and a real browser. What’s more, since Puppeteer controls a browser, it can mimic user actions while on the page.

The library locates information by executing JavaScript. This allows Puppeteer to dynamically interact with web pages and locate elements in real time.

Performance

Cheerio. The tool is much faster than Puppeteer. The reason for that is that Cheerio doesn’t need to execute JavaScript – it simply uses a DOM parser and traverses raw HTML data. Additionally, Cheerio’s lightweight nature makes it ideal for simple scraping tasks where you need speed.

Puppeteer. Compared to other headless browser libraries like Selenium, Puppeteer is much faster because it uses less overhead. But Puppeteer deals with dynamic websites that require more resources than scraping static pages, so the library is slower than Cheerio.

Ease of Use

Cheerio. Web scraping newbies will find Cheerio much easier to use than Puppeteer. And predictably so, since Cheerio can only handle the HTML code; you won’t need to interact with pages based on dynamic elements such as lazy loading. What’s more, unlike Puppeteer, Cheerio was developed with web scraping in mind, so it has a less complex syntax.

Puppeteer. Using a headless browser library is more complicated overall. The HTML of modern websites change a lot, so you need to maintain the scraper. Puppeteer uses a wait function (waitForSelector) until the elements appear on the page. If the element doesn’t show up within 30 seconds, Puppeteer gives you an error. So, poorly set delays or timeouts can hinder your web scraping efforts.

Community Support and Documentation

Cheerio. The tool has mediocre documentation. According to the creators, the website is still a work in progress. It covers basic and advanced tutorials but they aren’t as thorough as Puppeteer’s documentation. What’s more, the creators say that most guides were written with ChatGPT, which raises some questions about the quality of the content.

You can find the Cheerio community on GitHub and StackOverflow. Compared to Puppeteer, Cheerio has fewer questions on StackOverflow, but they are more often viewed. The community is helpful and responsive, making it easier to find answers to common issues.

Puppeteer. The library has a larger community than Cheerio. You’ll find thousands of discussions on StackOverflow.

Puppeteer has great documentation that includes examples, detailed explanations, and best practices of the tool. So, if you’re a first-time user, you should get the hang of the tool fast enough. Moreover, the official GitHub repository is actively maintained, with regular updates and bug fixes.

Alternatives to Cheerio and Puppeteer

If you want to try another parser, you could go with Python’s Beautiful Soup. It’s a popular option for beginners and developers. The tool is easy to use and wraps three powerful in-built HTML parsers. Besides, it has the ability to work with broken HTML pages.

A step-by-step guide to Beautiful Soup web scraping with a real-life example.

In terms of scraping dynamic websites, there are other tools that can handle JavaScript-rendered content. For example, you can use Playwright, which is very similar to Puppeteer – light on resources and fast. Or use Selenium – a tool that’s been out there for two decades. 

Can't decide between Playwright vs Selenium? We're here to help.

The post Cheerio vs Puppeteer for Web Scraping appeared first on Proxyway.

]]>
https://proxyway.com/guides/cheerio-vs-puppeteer-for-web-scraping/feed 0
Playwright Web Scraping: A Step-by-Step Tutorial with Node.js https://proxyway.com/guides/playwright-web-scraping-a-step-by-step-tutorial https://proxyway.com/guides/playwright-web-scraping-a-step-by-step-tutorial#respond Mon, 05 Feb 2024 09:58:00 +0000 https://proxyway.com/?post_type=guides&p=18685 Practice web scraping JavaScript-based websites with our step-by-step tutorial.

The post Playwright Web Scraping: A Step-by-Step Tutorial with Node.js appeared first on Proxyway.

]]>

Guides

Want to practise your skills with JavaScript-based websites? We’ll show you how. 

Playwright web scraping

Gathering data from websites that rely on dynamic elements or elaborate anti-bot systems isn’t an easy task. Before extracting the information, you need to render the entire page like a real user. A headless browser library, Playwright, is just right for the job. 

In this step-by-step guide, you’ll learn why Playwright is such a popular Node.js library used for web scraping JavaScript-rendered websites. You’ll also be able to practice your skills with a real-life example.

What is Web Scraping with Playwright?

Playwright's
Playwright's home page

Web scraping with Playwright is the process of gathering data from JavaScript-rendered websites. The tool works by programmatically controlling a headless browser. It doesn’t have a user interface like a tab bar, so Playwright doesn’t need to load visual elements. This saves a lot of resources when web scraping. 

Playwright is a relatively new library developed by Microsoft in 2018. It’s used to automate actions on different browsers: emulate scrolling, clicking, downloading – all the actions you could do with a mouse. Playwright gives you full control over the browser in both headless and headful modes. And most importantly, it’s able to render JavaScript, which regular HTTP libraries can’t do. This makes Playwright a powerful tool for scraping dynamic content from modern websites.

Why Use Playwright for Web Scraping?

Playwright is used in web scraping for several reasons:

  • Cross-browser support. The library can emulate Chromium, Firefox, and WebKit. 
  • Cross-language support. Playwright supports JavaScript, Python, Java, TypeScript, and .NET.
  • Use it with any operating system. You can use Playwrit with Windows, Linux, or macOS.
  • Supports asynchronous and synchronous approaches. Playwright is asynchronous by default; it allows you to make concurrent requests and scrape multiple pages in parallel. Or, you can make one request at a time to reduce complexity.
  • Good performer. The library uses a WebSocket connection that stays open while scraping. So you can send multiple requests in one go. This greatly improves the performance. 
  • Great for spoofing browser fingerprints. Playwright has packages like the playwright-extra for preventing bot detection. 
  • Good documentation. Even though Playwright is a newbie in the web scraping world, it has extensive documentation with many examples.

Node.js and Playwright Web Scraping: A Step-by-Step Tutorial

This is a step-by-step guide with a real-life example. We’ll be gathering data with Playwright from a web scraping sandbox:

In this tutorial, we’ll be scraping the film year, title, nominations, awards, and if a film has won the Best Picture awards. You’ll practice some web scraping skills like:

  • scraping a single page;
  • handling multiple pages;
  • waiting for elements to load;
  • loading dynamic content by clicking on buttons;
  • scraping tables;
  • handling errors;
  • writing the output to .json format.
Scrapethissite elements we will scrape
The elements we’ll be scraping.

Prerequisites

To start using Playwright with Node.js, you need to:

Importing the Libraries

Step 1. Import the necessary elements: Playwright library and file system. It will allow you to work on your computer and write the output later. 

				
					import playwright from 'playwright'
import fs from 'fs'
				
			

Step 2. Then, enter the URL you want to scrape and create films_list that will hold the output.

				
					const url = 'https://www.scrapethissite.com/pages/ajax-javascript/'

var films_list = []
				
			

Preparing to Scrape

Step 1. In this web scraping example, we’ll be using the Chromium browser. For you to see how the browser is working, we’ll use the headful mode by specifying headless to false.

				
					async function prepare_browser() {
    const browser = await playwright.chromium.launch({
        headless: false, 
    })
    return browser
}

				
			

Step 2. Now, let’s write the main() function to create the browser. Then, we’ll use the created browser context to open a new page, and pass it to a function called get_page(). It will begin the scraping.

				
					async function main() {
    const browser = await prepare_browser() 
    const context = await browser.newContext()
    const page = await context.newPage()
    await get_page(page, url)
    await browser.close()
}

main()

				
			

Scraping a Single and Multiple Pages

Gathering Data from a Single Page

Step 1.  Now, let’s scrape one page: 

  • page.goto() tells the browser to go to the URL. 
  • To find and click the year buttons on the page, we need to create a year_btn_selector variable that contains the CSS selector of these elements. This will allow you to load the content. Inspect the page source by right-clicking anywhere on the page and selecting “Inspect”. 
scrapethissite css- selector
Css selector.
				
					async function get_page(page, url) {
    await page.goto(url)
    const year_btn_selector = '.year-link'

				
			

Step 2. Then, tell Playwright to wait until at least one button element appears on the screen. Let’s set a timeout to 20 seconds. Once a single element has appeared, we can assume that the page has loaded and start scraping. If the wait times out, you can implement additional reload or retry functionality to address the issue.

				
					await page.locator(year_btn_selector).first().waitFor({'timeout': 20_000})
				
			

Gathering Data from Multiple Pages

Step 3.  Now, we can iterate through all the buttons and get their content: 

  • We’ll use the same variable year_btns.  It stores button information that we got using a CSS selector. Then, we’ll iterate through buttons to get their content.

To gather data from each table for every year (2010-2015), we’ll create a function called scrape_table. It’ll take the browser “page” and the “year” variables as parameters. In this context, the year variable corresponds to the button that represents a specific year.

				
					for (let year of year_btns) {
        await scrape_table(page, year)
    }
				
			

The whole function:

				
					async function get_page(page, url) {
    await page.goto(url)
    const year_btn_selector = '.year-link'
    await page.locator(year_btn_selector).first().waitFor({'timeout': 20_000})
    let year_btns = await page.locator(year_btn_selector).all()
    for (let year of year_btns) {
        await scrape_table(page, year)
    }
}

				
			

Scraping the Table

Step 1. First, get the text “2015” from the button and save it in the year_text variable.

Then, tell Playwright to click on the year button and wait until content appears. This time we’re locating a table.

				
					async function scrape_table(page, year){
    let year_text = await year.textContent()
    await year.click()
    const table_selector = 'table.table'
    await page.locator(table_selector).waitFor({'timeout': 20_000})
				
			
Inspecting the table to see the CSS selectors.
Inspecting the table to see the CSS selectors.

Step 2. When the table appears, we can scrape all the rows. The table row (<tr>) has the class of “film”, so we need to select it.

				
					let table_rows = await page.locator('.film').all()

				
			

Getting the Data

Step 1. Now, let’s iterate through the table rows and get movie information using the CSS selectors for each column. The information is stored in a film_info dictionary, including the year of release.

				
					for (let row of table_rows) {
        let film_info = {
            'film-year': year_text,
            'film-title': await row.locator('.film-title').textContent(),
            'film-nominations': await row.locator('.film-nominations').textContent(),
            'film-awards': await row.locator('.film-awards').textContent(),
        }
				
			

Step 2. Then, check if the icon (<i>) element is present in that row. If it is, the film has won the best picture award. So, we need to add another key film_best_picture to the dictionary and assign it a “true” value or otherwise, “false”.

				
					        if (await row.locator('i').count()>0){
            film_info['film-best-picture'] = true
        } else {
            film_info['film-best-picture'] = false
        }
				
			
Icon
Icon.

Step 3. After that, print out the result in the console to see the output. The film_info dictionary gets pushed into the films_list list (the one we created at the very beginning).

				
					        console.log(film_info)
        films_list.push(film_info)
    }
				
			

The whole function:

				
					async function scrape_table(page, year){
    let year_text = await year.textContent()
    await year.click()
    const table_selector = 'table.table'
    await page.locator(table_selector).waitFor({'timeout': 20_000})
    let table_rows = await page.locator('.film').all()
    for (let row of table_rows) {
        let film_info = {
            'film-year': year_text,
            'film-title': await row.locator('.film-title').textContent(),
            'film-nominations': await row.locator('.film-nominations').textContent(),
            'film-awards': await row.locator('.film-awards').textContent(),
        }
        if (await row.locator('i').count()>0){
            film_info['film-best-picture'] = true
        } else {
            film_info['film-best-picture'] = false
        }
        console.log(film_info)
        films_list.push(film_info)
    }
}

				
			

JSON Output

Now that the scraping part is done, we can print out the list of objects (films_list) in a structured .json format.

The write_output() function will handle the task of writing the scraped data to a .json file. Here’s how it works:

				
					function write_output() {
    fs.writeFile('output.json', JSON.stringify(films_list, null, 2), (err) => {
        if (err) {
            console.log(err)
        } else {
            console.log("Output written successfully")
        }
    })
}

				
			
  • JSON.stringify(films_list, null, 2): converts the films_list into a JSON-formatted string with an indentation of 2 spaces for better readability.
  • fs.writeFile(‘output.json’, … ): writes the JSON-formatted string to a file named ‘output.json’.
  • (err) => { … }: a callback function that handles any errors during the writing process.
  • console.log(err): if an error occurs.
  • console.log(“Output written successfully”): if the writing process is completed without errors.

Full main function:

				
					async function main() {
    const browser = await prepare_browser() 
    const context = await browser.newContext()
    const page = await context.newPage()
    await get_page(page, url)
    await browser.close()
    write_output()
}

				
			
A fragment of the json output.
A fragment of the .json output.
Here’s the full code:
				
					import playwright from 'playwright'
import fs from 'fs'

const url = 'https://www.scrapethissite.com/pages/ajax-javascript/'

var films_list = []

async function prepare_browser() {
    const browser = await playwright.chromium.launch({
        headless: false, 
    })
    return browser
}

async function scrape_table(page, year){
    let year_text = await year.textContent()
    await year.click()
    const table_selector = 'table.table'
    await page.locator(table_selector).waitFor({'timeout': 20_000})
    let table_rows = await page.locator('.film').all()
    for (let row of table_rows) {
        let film_info = {
            'film-year': year_text,
            'film-title': await row.locator('.film-title').textContent(),
            'film-nominations': await row.locator('.film-nominations').textContent(),
            'film-awards': await row.locator('.film-awards').textContent(),
        }
        if (await row.locator('i').count()>0){
            film_info['film-best-picture'] = true
        } else {
            film_info['film-best-picture'] = false
        }
        console.log(film_info)
        films_list.push(film_info)
    }
}

async function get_page(page, url) {
    await page.goto(url)
    const year_btn_selector = '.year-link'
    await page.locator(year_btn_selector).first().waitFor({'timeout': 20_000})
    let year_btns = await page.locator(year_btn_selector).all()
    for (let year of year_btns) {
        await scrape_table(page, year)
    }
}

function write_output() {
    fs.writeFile('output.json', JSON.stringify(films_list, null, 2), (err) => {
        if (err) {
            console.log(err)
        } else {
            console.log("Output written successfully")
        }
    })
}

async function main() {
    const browser = await prepare_browser() 
    const context = await browser.newContext()
    const page = await context.newPage()
    await get_page(page, url)
    await browser.close()
    write_output()
}

main()

				
			

Alternatives to Playwright

While Playwright is a powerful library for scraping dynamic elements, it has some tough competition. 

Puppeteer is another great tool for JavaScript-rendered websites. It’s backed up by Google and has a much larger and more active community.

We compare two Node.js headless browser libraries created by two well-known companies. Choose the better fit for web scraping.

Selenium is a veteran in the industry, so it supports more programming languages, has a larger community for support, and works with any browser you can think of. 

Look at how the two popular headless browser libraries compare next to each other.

If you want to try scraping static pages with Node.js, we’d recommend using axios and Cheerio. The first one is one of the most popular Node.js HTTP clients that will fetch you the page. The latter is a powerful parser for downloading the data. 

A step-by-step guide to web scraping with Node.js: two examples.

Picture of Isabel Rivera
Isabel Rivera
Caffeine-powered sneaker enthusiast

The post Playwright Web Scraping: A Step-by-Step Tutorial with Node.js appeared first on Proxyway.

]]>
https://proxyway.com/guides/playwright-web-scraping-a-step-by-step-tutorial/feed 0