Knowledge Base - Proxyway

How to find element by id using Selenium

Adam Dubois — Mon, 13 Jun 2022 18:19:28 +0000

Knowledge Base

A step-by-step guide on how to find element by id using Selenium.

Important: we’ll use a real-life example in this tutorial, so you’ll need Selenium library and browser drivers installed.

Step 1. Write your first Selenium script. NOTE: We’ll be using Python and Chrome WebDriver. You can add the Chrome WebDriver to the Path. Step 2. Now let’s find the following element by its id for any book listings by inspecting the page source. We’ be using books.toscrape.com in this example.

Firstly, you need to import the By selector module

				
					from selenium.webdriver.common.by import By

TIPS: Locating elements. Step 3. The previous step provides multiple options for finding an element. You can either use CSS and XPath selectors or the inbuilt By.ID function. The selectors look like this:

				
					element_by_id = driver.find_element(By.ID, "product_description").text

element_by_css = driver.find_element(By.CSS_SELECTOR, "#product_description").text

element_by_xpath = driver.find_element(By.XPATH, "//*[@id='product_description']").text

This is the output of the script. It shows the elements you’ve just scraped.

Results:

Congratulations, you’ve just found and extracted the content of an id using Selenium. Here’s the full script:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

element_by_id = driver.find_element(By.ID, "product_description").text
element_by_css = driver.find_element(By.CSS_SELECTOR, "#product_description").text
element_by_xpath = driver.find_element(By.XPATH, "//div[@id='product_description']").text
driver.quit()

print (element_by_id)

The post How to find element by id using Selenium appeared first on Proxyway.

How to find element by text using Selenium

Adam Dubois — Mon, 13 Jun 2022 10:48:52 +0000

Knowledge Base

A step-by-step guide on how to find element by text using Selenium.

Important: we’ll use a real-life example in this tutorial, so you’ll need the Selenium library and browser drivers installed.

Step 1. Write your first Selenium script. NOTE: We’ll be using Python and Chrome WebDriver. You can add the Chrome WebDriver to the Path. Step 2. Now you’ll need to import the By selector module.

				
					from selenium.webdriver.common.by import By

TIPS: Locating elements. Step 3.Let’s try to find the stock availability of a book. We’ll be using the books.toscrape.com website in this example. Now, inspect the page source. You can find the word and the number of available books in the same element:

NOTE: We’ll use an XPath selector to locate the element as it has a text() method built-in. TIP: If you need a refresher, look at the XPath cheatsheet. Step 4. Then use this selector:

				
					//*[contains (text(),'stock')]

It finds any element within the page, the text of which contains a stock string.

				
					element_by_text = driver.find_element(By.XPATH, "//*[contains (text(),'stock')]").text

print (element_by_text)

NOTE: We’re using the driver.find_element() function to only get the first element found by the selector. It’s also possible to use the driver.find_elements() function to get a list of all elements.

This is the output of the script. It shows the elements you’ve just scraped.

Step 5. Now we can clean up the result by extracting the number from the text to do some other operations with it as an integer variable. You can do that with some simple regex.

Here we find the decimal number within the element_by_text string, assign it to a new variable and print it out separately:

				
					in_stock = re.findall(r'\d+', element_by_text)[0]

print (f'In stock: {in_stock}')

This is the output of the script. It shows the stock availability of a book you’ve just scraped.

Results:

Congratulations, you’ve just extracted the stock availability of a book using Selenium.

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome()

driver.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

element_by_text = driver.find_element(By.XPATH, "//*[contains (text(),'stock')]").text
driver.quit()

print (element_by_text)

in_stock = re.findall(r'\d+', element_by_text)[0]
print (f'In stock: {in_stock}')

Scrape without any blocks.

The post How to find element by text using Selenium appeared first on Proxyway.

How to find all URLs using Selenium

Adam Dubois — Mon, 13 Jun 2022 10:12:49 +0000

Knowledge Base

A step-by-step guide on how to find all URLs using Selenium.

Important: we’ll use a real-life example in this tutorial, so you’ll need Selenium library and browser drivers installed.

				
					from selenium.webdriver.common.by import By

TIPS: Locating elements. Step 3. We’ll be using books.toscrape.com in this example. If you only needed to find all links, selecting all tags would be enough to get their href attributes. Getting all the links on the page is simple using the By.TAG_NAME selector:

				
					link_elements = driver.find_elements(By.TAG_NAME, "a")

for link in link_elements:

    print (link.get_attribute("href"))

NOTE: In the code above, we’re getting all elements with tag and iterating through them to print out their href attribute.

This is the output of the script. It shows the elements you’ve just scraped.

Full code so far:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome()

driver.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

link_elements = driver.find_elements(By.TAG_NAME, "a")
for link in link_elements:

    print (link.get_attribute("href"))

driver.quit()

Find Links Using CSS or XPath Selectors

Step 1. We’ll use our main page in the example to get the URLs of the reviews we can find on our home page.

Inspect the page source. You can see that all of these links have one thing in common – reviews/ string present in their href attributes. That’s what you need to select.

Step 2. You can write the CSS and XPath selectors in this way:

				
					elements_by_css = driver.find_elements(By.CSS_SELECTOR, "a[href*='reviews/']")

elements_by_xpath = driver.find_elements(By.XPATH, "//a[contains(@href,'reviews/')]")

NOTE: Both of them are looking for tags within the document with a href tag containing a reviews/ substring.

Step 3. You can add a couple more lines of code to print everything out. The whole script looks like this:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("https://proxyway.com")

elements_by_css = driver.find_elements(By.CSS_SELECTOR, "a[href*='reviews/']")
elements_by_xpath = driver.find_elements(By.XPATH, "//a[contains(@href,'reviews/')]")

print ("By CSS:")   
for link in elements_by_css:
    print (link.get_attribute("href"))
print ("By XPath:")   
for link in elements_by_xpath:
    print (link.get_attribute("href"))
driver.quit()

This is the output of the script. It shows the links you’ve just scraped.

NOTE: You can see that both selectors return the same links. Some duplicates can be easily filtered out after creating a new list to store unique online links.

The post How to find all URLs using Selenium appeared first on Proxyway.

How to Parse XML with LXML

Adam Dubois — Mon, 06 Jun 2022 18:23:43 +0000

Knowledge Base

A step-by-step guide on how to parse XML with LXML.

Important: the article assumes that you are familiar with the XML data structure. Refer to the W3Schools XML tutorial if you need a refresher.

Step 1. Install LXML using pip.

				
					pip install lxml

TIPS: LXML Documentation. If you’re using LXML with Python, import the etree module.

				
					from lxml import etree

Step 2. Load the XML file you’ll be working with. There are two ways to do this: 1) from the .xml file on your system; 2) making an HTTP request to get XML content from the Internet. TIPS: The parsing will be slightly different for both methods: parsing documentation; other parsing options. 1. From the .xml file on your system

				
					filename = "file/location.xml"
parser = etree.XMLParser()
tree = etree.parse(filename, parser)

2. Making an HTTP request to get XML content from the Internet.

				
					r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)

NOTE: In both cases, the result is parsed and saved in an ElementTree object and saved in the tree variable. Step 3. You’ll need to understand the LXML ElementTree class and XPath selector for the following steps. Have a look at some tutorials: LXML Tutorial; XPath Tutorial. Step 4. Let’s continue with the code example you’ve been working on. We’ll get the names of each food item contained in the XML sample. XML data:

				
					<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup
</article>
<article>
<h1>How to Get Text Using LXML</h1>
<p>Adam Dubois — Fri, 03 Jun 2022 18:31:03 +0000</p>
		<div data-elementor-type="wp-post" data-elementor-id="13366" class="elementor elementor-13366" data-elementor-post-type="knowledge-base">
				<div class="elementor-element elementor-element-7c71962 e-flex e-con-boxed e-con e-parent" data-id="7c71962" data-element_type="container">
					<div class="e-con-inner">
		<main class="elementor-element elementor-element-21b3fe3 e-con-full e-flex e-con e-child" data-id="21b3fe3" data-element_type="container">
				<div class="elementor-element elementor-element-73e7c22 elementor-widget elementor-widget-heading" data-id="73e7c22" data-element_type="widget" data-widget_type="heading.default">
				<div class="elementor-widget-container">
					<h1 class="elementor-heading-title elementor-size-default">Knowledge Base</h1>				</div>
				</div>
				<div class="elementor-element elementor-element-5d343e8 elementor-widget elementor-widget-text-editor" data-id="5d343e8" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
									<div class="grid"><div class="heading"><div class="short-desc">A step-by-step guide on how to get text using LXML.</div></div></div>								</div>
				</div>
				<div class="elementor-element elementor-element-8394dd6 elementor-widget elementor-widget-spacer" data-id="8394dd6" data-element_type="widget" data-widget_type="spacer.default">
				<div class="elementor-widget-container">
							<div class="elementor-spacer">
			<div class="elementor-spacer-inner"></div>
		</div>
						</div>
				</div>
				<div class="elementor-element elementor-element-6cad07f elementor-blockquote--skin-border elementor-blockquote--button-color-official elementor-widget elementor-widget-blockquote" data-id="6cad07f" data-element_type="widget" data-widget_type="blockquote.default">
				<div class="elementor-widget-container">
							<blockquote class="elementor-blockquote">
			<p class="elementor-blockquote__content">
				You’ll need to use an XPath selector to get data. Refer to the <a href="https://www.w3schools.com/xml/xpath_intro.asp" rel="nofollow">XPath Tutorial</a> if you need a refresher.			</p>
					</blockquote>
						</div>
				</div>
				<div class="elementor-element elementor-element-d586b7f elementor-widget elementor-widget-text-editor" data-id="d586b7f" data-element_type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
									<p><strong>Step 1.</strong> Install LXML using pip.</p>								</div>
				</div>
				<div class="elementor-element elementor-element-82d5ee1 elementor-widget elementor-widget-code-highlight" data-id="82d5ee1" data-element_type="widget" data-widget_type="code-highlight.default">
				<div class="elementor-widget-container">
							<div class="prismjs-okaidia copy-to-clipboard word-wrap">
			<pre data-line="" class="highlight-height language-python line-numbers">
				<code readonly="true" class="language-python">
					<xmp>pip install lxml

TIPS: LXML Documentation. If you’re using LXML with Python, import the module and requests libraries.

				
					from lxml import etree
import requests

Example 1

Step 2. Let’s start by inspecting the source code of your target page. We’ll be using our The Best Residential Proxy Providers page in this example. You can find providers’ names in divs with the brand class.

Step 3. Then, make an HTTP request and assign the response to the r variable to scrape the site.

				
					r=requests.get("https://proxyway.com/best/residential-proxies")

Step 4. Parse the HTML response content using etree.HTML() parser provided by LXML.

				
					tree = etree.HTML(r.text)

Step 5. Select the div elements containing the class brand and get the text element.

				
					divs = tree.xpath(".//div[@class='brand']/text()")

Let’s have a closer look at the code:

.//div – select all divs within the HTML document.
.//div[@class=’brand’] – select all divs that have a class of brand.
/text() – get the text that is contained in the div.

NOTE: The result is a list of LXML elements.

Step 6. Let’s print out the divs list. You can see that it also contains blank spaces and non-breaking spaces (\xa0 elements) that we don’t need:

You can clean up the results and assign them to a new brand_names list:

				
					brand_names = []
for div in divs:
    if len(div.strip()) > 0:
        brand_names.append(div.strip())

This is the output of the script. It shows provider names you’ve just scraped.

Results: Congratulations, you’ve extracted the content. Here’s the full script:

				
					from lxml import etree
import requests
r=requests.get("https://proxyway.com/best/residential-proxies")

tree = etree.HTML(r.text)

divs = tree.xpath(".//div[@class='brand']/text()")

brand_names = []
for div in divs:
    if len(div.strip()) > 0:
        brand_names.append(div.strip())
#print (brand_names)

Example 2

Step 2. Let’s scrape a book title and its description. We’ll be using the books.toscrape.com website in this example. Step 3. Then, make an HTTP request and assign the response to the r variable to scrape the site.

				
					r=requests.get("https://books.toscrape.com")

Step 4. Parse the HTML response content using etree.HTML() parser provided by LXML.

				
					tree = etree.HTML(r.text)

Step 5. Now let’s get the book title by inspecting the code. The title can be found within an h1 tag in a div with a product_main class:

The XPath should look like this:

				
					title = tree.xpath("//div[@class='col-sm-6 product_main']/h1/text()”)[0]

NOTE: However, that won’t work as there is another class present – col-sm-6. So, the XPath selector won’t find this exact div.

Step 6. Let’s specify both classes so that the XPath work.

				
					title = tree.xpath("//div[@class='col-sm-6 product_main']/h1/text()”)[0]

You can also use the contains() method as an alternative:

				
					title = tree.xpath("//div[contains(@class,'product_main')]/h1/text()")[0]
print (f'Title: {title}')

Let’s have a closer look at the code:

(@class) – selects all classes from the document.
//div([contains(@class,’product_main’]) – selects the div that contains a product_main class.
/h1 – the title text is not in the div itself but in its
child element.
/h1/text() – gets the text from within the
tags.
[0] – since tree.xpath() method returns a list and we want text, we can simply grab the first element of the said list.

Step 6. Now let’s get the book description. It can be found in a

tag without any descriptive attributes below the div, which contains its heading.

One way to get the description text and avoid selecting any other

elements that aren’t relevant look like this:

				
					description = tree.xpath("//div[@id='product_description']/following-sibling::p/text()")[0]
print (f'Description: {description}')

Let’s have a closer look at the code:

//div[@id=’product_description’] – we select the div with an id of product_description so that we wouldn’t select the wrong element.
/following-sibling::p – selecting the next
sibling of the div we have selected before. You can get more information about Xpath Axes here.
/text() – getting the text within the
tags.

This is the output of the script. It shows the book title and description you’ve just scraped.

Results: Congratulations, you’ve extracted the book name and description. Here’s the full script:

				
					from lxml import etree
import requests

r=requests.get('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')

tree = etree.HTML(r.text)

title = tree.xpath("//div[contains(@class,'product_main')]/h1/text()")[0]
print (f'Title: {title}')

description = tree.xpath("//div[@id='product_description']/following-sibling::p/text()")[0]
print (f'Description: {description}')

The post How to Get Text Using LXML appeared first on Proxyway.

How to find all ‘href’ attributes using Beautifulsoup

Adam Dubois — Wed, 30 Mar 2022 14:55:42 +0000

Knowledge Base

A step-by-step guide on how to extract all the URL elements.

Important: we will use a real-life example in this tutorial, so you will need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup

Step 2. Then, import requests library.

				
					import requests

Step 3. Get a source code for your target landing page. We will be using our homepage in this example.

				
					r=requests.get("https://proxyway.com/")

An universal code might look like this:

				
					r=requests.get("Your URL")

Step 4. Parse HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")

Step 5. Then, find all the links with href attribute. We will be using this tag as an example:

				
					link_elements = soup.find_all("a", href=True)

NOTE: You can also specify a class.

				
					link_elements = soup.find_all("a", class_=”some_class”, href=True)

Step 6. Put all the links you’ve found into a dictionary to keep track of them.

				
					dict_of_links = {}

NOTE: The link would be assigned to the text found in the a tag.

Step 7. If a string exists for a particular element, you can iterate through all the link_elements that you’ve scraped and put them into the dictionary.

				
					for element in link_elements: if element.string: dict_of_links[element.string] = element['href']

Step 8. Let’s check if our code works by printing it out.

				
					print (dict_of_links)

Results:

Congratulations, you’ve extracted all the URLs. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r = requests.get("https://proxyway.com")
soup = BeautifulSoup(r.content, "html.parser")
link_elements = soup.find_all("a", href=True)

dict_of_links = {}

for element in link_elements:
    if element.string:
        dict_of_links[element.string] = element['href']
    
print (dict_of_links)

The post How to find all ‘href’ attributes using Beautifulsoup appeared first on Proxyway.

How to get ‘src’ attribute from ‘img’ tag using Beautifulsoup

Adam Dubois — Wed, 30 Mar 2022 06:37:18 +0000

Knowledge Base

A step-by-step guide on how to find image source using Beautifulsoup.

The majority of data were collected in February and March of 2023.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup

Step 2. Then, import requests library.

				
					import requests

Step 3. Get a source code of your target landing page.

				
					r=requests.get("https://books.toscrape.com/")

Step 4. Convert the HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")

Step 5. Inspect the page to find the image object you would like to extract.

The code of this image object looks like this:

				
					thumbnail_elements = soup.find_all("img", class_ = "thumbnail")

NOTE: For this website, you can find images by looking for img tags that have a thumbnail class.

Step 6. Let’s check if our code works by printing it out.

				
					print(thumbnail_elements)

Step 7. Now you need to get the src attribute from each element.

				
					for element in thumbnail_elements:
    print (element['src'])

Results:
Congratulations, you’ve found and extracted the content of an image source using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(r.content, "html.parser")
thumbnail_elements = soup.find_all("img", class_ = "thumbnail")

print(thumbnail_elements)


for element in thumbnail_elements:
    print (element['src'])
    
    
#for element in thumbnail_elements:
#    print ("https://books.toscrape.com/" + element['src'])

If you rebuild the full URL, you can access the image.

The post How to get ‘src’ attribute from ‘img’ tag using Beautifulsoup appeared first on Proxyway.

How to scrape multiple pages using Beautifulsoup

Adam Dubois — Wed, 30 Mar 2022 06:20:43 +0000

Knowledge Base

A step-by-step guide on how to scrape multiple pages using Beautifulsoup.

Important: we’ll use a real-life example in this tutorial, so you’ll need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup

Step 2. Then, import requests library.

				
					import requests

Step 3. Get a source code of your target landing page. We will be using our Guides page in this example.

				
					r=requests.get("https://proxyway.com/guides/")

Universally applicable code would look like this:

				
					r=requests.get("Your URL")

Step 4. Get the link to the next page by finding the a tag with the class of next. Then, you only need to get the href from this element and perform a new request.

Step 5. You can put the entire code into a single function:

				
					def scrape_page(url):
    print ("URL: " + url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    get_data(soup)
    next_page_link = soup.find("a", class_="next")
    if next_page_link is not None:
        href = next_page_link.get("href")
        scrape_page(href)
    else:
        print ("Done")

This is the output of the script. It shows the page URL you’ve just scraped.

Let’s look at the code step-by-step:

1. Performing a request to get the page.

				
					r = requests.get(url)

2. Parsing the page and turning into a BeautifulSoup object.

				
					soup = BeautifulSoup(r.content, "html.parser")

3. Passing the soup object into a different function where you could scrape page data before moving on to the next page.

				
					get_data(soup)

4. Finding the next link element.

				
					next_page_link = soup.find("a", class_="next")

5. If such an element exists, then there is another page you can scrape; if not, you’re done.

				
					if next_page_link is not None:

6. Getting the href attribute from the link element. This is the URL of the next page you’re scraping.

				
					href = next_page_link.get("href")

7. Calling the same function again and passing it the new URL to scrape the next page.

				
					scrape_page(href)

Results:
Congratulations, you’ve scraped multiple pages using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests

start_url = "https://proxyway.com/guides"

def scrape_page(url):
    print ("URL: " + url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    get_data(soup)
    next_page_link = soup.find("a", class_="next")
    if next_page_link is not None:
        href = next_page_link.get("href")
        scrape_page(href)
    else:
        print ("Done")


def get_data(content):
    #we could do some scraping of web content here
    pass

def main():
    scrape_page(start_url)
    

if __name__ == "__main__":
    main()

The post How to scrape multiple pages using Beautifulsoup appeared first on Proxyway.

How to scrape a table using Beautifulsoup

Adam Dubois — Wed, 30 Mar 2022 06:13:05 +0000

Knowledge Base

A step-by-step guide on how to scrape a table using Beautifulsoup.

Important: we’ll use a real-life example in this tutorial, so you’ll need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup

Step 2. Then, import requests library.

				
					import requests

Step 3. Get the source code of your target landing page. We’ll be using Yahoo in this example.

				
					r=requests.get("https://finance.yahoo.com/cryptocurrencies/")

Universally applicable code would look like this:

				
					r=requests.get("Your URL")

Step 4. Convert HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")

Step 5. Then, inspect the page source. See the table has a class of W(100%parser).

NOTE: A class can specify the scraped table in case there are multiple different ones on the same page.

Step 6. Parse the page content with BeautifulSoup, find the table in the HTML content and assign the whole table element to the table_element variable.

				
					soup = BeautifulSoup(r.content, "html.parser")
table_element = soup.find("table", class_="W(100%)")

NOTE: The goal is to scrape all the rows from the target table.

Step 7. Initialize a new list variable to save the data into.

				
					output_list = []

Step 8. Search for all tr tags in the table to get all the rows from the table_element that was saved earlier. You’ll also get the header row and all the variables.

				
					table_rows = table_element.find_all("tr")

NOTE: In this case, it’s also possible to get specific column values by referring to the aria-label attributes since they are present, but that won’t always be the case, so stick with a universal approach.

Step 9. The following for loop will iterate through all rows you got from the table and get all the children for each row. Each child is a td element in the table. After getting the children, iterate through the row_children list and append the text values of each element into a row_data list to keep it simple.

				
					for row in table_rows:
        row_children = row.children
        row_data = []
        for child in row_children:
            row_data.append(child.get_text())
        output_list.append(row_data)

Step 10. Let’s display the results.

				
					for row in output_list:
        print (row)

What you got is a list of lists, each containing 12 elements that correspond to table columns. The first row contains the table headers.

NOTE: This makes it easy to format the output in CSV/ JSON and write the results to an output file. Also, to convert to a Pandas DataFrame and use the data for some analysis.

Results:

Congratulations, you’ve learned how to scrape a table using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests

r = requests.get("https://finance.yahoo.com/cryptocurrencies/")

soup = BeautifulSoup(r.content, "html.parser")
table_element = soup.find("table", class_="W(100%)")

output_list = []

table_rows = table_element.find_all("tr")

for row in table_rows:
    row_children = row.children
    row_data = []
    for child in row_children:
        row_data.append(child.get_text())
    output_list.append(row_data)

for row in output_list:
    print (row)

The post How to scrape a table using Beautifulsoup appeared first on Proxyway.

How to find element by class using Beautifulsoup

Adam Dubois — Wed, 30 Mar 2022 06:10:44 +0000

Knowledge Base

A step-by-step guide on how to find elements by class using Beautifulsoup.

Important: we will use real-life example in this tutorial, so you will need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup

Step 2. Then, import requests library.

				
					import requests

Step 3. Get a source code of your target landing page. We will be looking for guide titles on our homepage in this example.

				
					r=requests.get("https://proxyway.com/")

Universally applicable code would look like this:

				
					r=requests.get("Your URL")

Step 4. Convert HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")

Step 5. Inspect the page to find a class you would like to extract.

The code of this class looks like this:

				
					elements_by_class = soup.find_all(class_ = "archive-list__title")

NOTE: Since we want to find all the titles instead of one, we use soup.find_all() instead of soup.find().

Step 6. Let’s check if the script works by printing its output into the terminal window.

				
					print(elements_by_class)

NOTE: If you want to display only the titles, you can get the string attribute of each scraped element.

				
					for element in elements_by_class: print (element.string)

Results:

Congratulations, you’ve found and extracted the content of class using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r = requests.get("https://proxyway.com/")
soup = BeautifulSoup(r.content, "html.parser")
elements_by_class = soup.find_all(class_ = "archive-list__title")

print(elements_by_class)

for element in elements_by_class:
    print (element.string)

The post How to find element by class using Beautifulsoup appeared first on Proxyway.

How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation

Adam Dubois — Fri, 24 Sep 2021 18:13:14 +0000

Knowledge Base

A guest article by Razvan Popescu, Head of Marketing at Abstract API.

If you’re scraping websites, you might already use a proxy server to collect data reliably and anonymously. What about the other side of the scrape, though – what if you want to block proxies from scraping your site? This article will describe how web scraping and proxies work, and how an IP geolocation API can be used to prevent web scraping with proxies.

What Is Web Scraping?

Web scraping is the process of taking unstructured data and formatting it into a structured format. For example, you might use Python to scrape Google Search results. Another common use case is to scrape up-to-date stock data from a stock market website, structure that data into a CSV, and pull that variable from the CSV to calculate your stock market returns in a Python program.

There is nothing illegal about doing this, but when it begins burdening a company’s web servers, they may block your IP address. Always check a website’s robots.txt file for their expected scraping behavior and etiquette.

What Is a Proxy?

When an IP address is blocked by a website, the scraper might work around the block by using a proxy server. So, what is a proxy? It’s a third party server that routes your connections through a different IP address. Remember that IP addresses identify where a connection takes place – for example, the router in your house. A proxy makes that connection appear to be coming another device in another place.

You may have encountered proxies when bypassing your school’s Internet filters back in the day, or using a VPN to stream region–restricted Eurovision song contests. We aren’t condoning these activities, but they use the idea of rerouting an IP address through a third party connection.

Elements of Successful Web Scraping

A little Python code, some Python libraries (like Beautiful Soup), and an Internet connection are all you need to start basic web scraping. But there are important factors in making your scraping efficient, reliable, and anonymous – that is, successful.

One of the most important factors in web scraping is using a high quality proxy , or even multiple proxies in a proxy pool to scale up your scraping operation. A high-quality proxy can take your web scraping projects to the next level:

If you’re scraping without a proxy, when one site blocks your IP, you have to go find another site with the same information.
Proxies increase scraping reliability and volume.
Proxies allow you to view content as it appears if accessed from other places in the world. If you’re scraping location-dependent data, this is very important.
Proxies protect your identity by substituting one of their IPs for one of your own. Think of it as similar to how APIs allow authenticated users to exchange data through an interface while remaining anonymous to each other. That said, you can provide your contact info in a third party proxy, if you want businesses you are scraping to be able to contact you.

Why Blocking Proxies Is Key to Preventing Web Scraping

As stated above, scraping without proxies is inefficient, unsafe, and doesn’t scale. If someone is serious about web scraping, they’re be surely using a high-quality proxy pool.

Proxy servers are a powerful tool. And while collecting public web data isn’t bad in itself, reckless web scraping can cause a lot of damage to websites.

So, if we look at the other end of the process, at the website that is being scraped, what’s the best way for us to protect our resources from bad traffic? We can use proxy detection and IP geolocation to root out users scraping with proxies and block them.

What Is Proxy Detection?

Proxy detection is – you guessed it – ways to identify a proxy connection by the website owner. The IP address received by the website can check that IP against a list of flagged addresses and block the traffic. If the scraper uses a limited number of IPs, proxy detectors learn to block them, but proxy services will just change IP ranges again.

You can also check the headers for common proxy entries like x-forwarded-for, but this only removes the most basic proxies, and we’re trying to block professionals.

How to Block Proxies Using IP Geolocation

To detect a proxy using IP geolocation, remember that IP addresses carry location information with them, announcing where a connection takes place. A proxy server makes that connection appear to be coming a a different geographic location.

So, if we are trying to identify a proxy server, we could use the free IP geolocation API from Abstract to test this. You can test it for free as soon as you sign up.

Let’s try testing a request in the browser:

				
					https://ipgeolocation.abstractapi.com/v1/?api_key={YOUR API KEY}

It will return our IP, our geographic location, and a lot of other interesting data:

				
					{
    "ip_address": "174.49.204.134",
    "city": "York",
    "city_geoname_id": 4562407,
    "region": "Pennsylvania",
    "region_iso_code": "PA",
    "region_geoname_id": 6254927,
    "postal_code": "17402",
    "country": "United States",
    "country_code": "US",
    "country_geoname_id": 6252001,
    "country_is_eu": false,
    "continent": "North America",
    "continent_code": "NA",
    "continent_geoname_id": 6255149,
    "longitude": -76.6653,
    "latitude": 39.9552,
    "security": {
        "is_vpn": false
    }

If we engage a VPN and try the same test request, we get different results. VPNs aren’t the same thing as proxies, but they provide a similar outcome.

				
					{
    "ip_address": "23.105.165.55",
    "city": "Manassas",
    "city_geoname_id": 4771401,
    "region": "Virginia",
    "region_iso_code": "VA",
    "region_geoname_id": 6254928,
    "postal_code": "20110",
    "country": "United States",
    "country_code": "US",
    "country_geoname_id": 6252001,
    "country_is_eu": false,
    "continent": "North America",
    "continent_code": "NA",
    "continent_geoname_id": 6255149,
    "longitude": -77.4918,
    "latitude": 38.7493,
    "security": {
        "is_vpn": false
    }

Now, we can use this IP geolocation API to see where incoming traffic is coming from, and make decisions on blocking based on that information. Some strategic considerations:

We might block IPs coming from countries with high fraud activity.
We might block requests geographically outside of our usual customer base.
We might take this data and find the proxy traffic isn’t doing anything suspicious or resource-consuming.
We might use this data to geo-target our ad campaigns. (This company in that city is disrupting everything!)

Can All Proxies Be Detected and Blocked?

The proxy cat-and-mouse game has been going on for a long time, and probably won’t stop. Proxies aren’t illegal, but a lot of the discussion around them makes them sound like only credit card scammers and Anonymous use them. They can be used to responsibly anonymize traffic online, but as with any tool, they sometimes fall in the hands of bad agents.

Considering that bad bot activity now accounts for 39% of internet traffic, it’s a good time to know who is accessing your hardware, and if it’s impacting your customers. IP geolocation databases are a great tool to collect and act upon.

The post How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation appeared first on Proxyway.

How to get text from DIV using Beautifulsoup

Adam Dubois — Fri, 23 Apr 2021 07:01:11 +0000

Knowledge Base

A step-by-step guide on how to extract the content of a div tag using Beautifulsoup.

Important: we will use a real-life example in this tutorial, so you will need requests and Beautifulsoup libraries installed.

Step 1. First, import Beautifulsoup library.

				
					from bs4 import BeautifulSoup

Step 2. Then, import requests library.

				
					import requests

Step 3. Get your preferred landing page source code. We will use our homepage in this example.

				
					r=requests.get("https://proxyway.com/")

Universally applicable code would look like this:

				
					r=requests.get("Your URL")

Step 4. Convert HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")

Step 5. Find an id, which content you would like to extract. We will be using this tag for an example:

The code of this id looks like this:

				
					div_text=soup.find("div",{"class":"intro__small-text"}).get_text()

Step 6. Let’s check if our code works by printing it out.

				
					print(div_text)

Results:

Congratulations, you’ve found and extracted the content of an id using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r=requests.get("https://proxyway.com/")
soup=BeautifulSoup(r.content,"html.parser")
div_text=soup.find("div",{"class":"intro__small-text"}).get_text()
print(div_text)

The post How to get text from DIV using Beautifulsoup appeared first on Proxyway.

Knowledge Base - Proxyway

How to find element by id using Selenium

Knowledge Base

How to find element by text using Selenium

Knowledge Base

How to find all URLs using Selenium

Knowledge Base

Find Links Using CSS or XPath Selectors

How to Parse XML with LXML

Knowledge Base

Example 1

Example 2

child element.

tags.

How to find all ‘href’ attributes using Beautifulsoup

Knowledge Base

How to get ‘src’ attribute from ‘img’ tag using Beautifulsoup

Knowledge Base

How to scrape multiple pages using Beautifulsoup

Knowledge Base

How to scrape a table using Beautifulsoup

Knowledge Base

How to find element by class using Beautifulsoup

Knowledge Base

How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation

Knowledge Base

What Is Web Scraping?

What Is a Proxy?

Elements of Successful Web Scraping

Why Blocking Proxies Is Key to Preventing Web Scraping

What Is Proxy Detection?

How to Block Proxies Using IP Geolocation

Can All Proxies Be Detected and Blocked?

How to get text from DIV using Beautifulsoup

Knowledge Base