Knowledge Base - Proxyway https://proxyway.com/knowledge-base Your Trusted Guide to All Things Proxy Wed, 03 Jul 2024 12:02:29 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.3 https://proxyway.com/wp-content/uploads/2023/04/favicon-150x150.png Knowledge Base - Proxyway https://proxyway.com/knowledge-base 32 32 How to find element by id using Selenium https://proxyway.com/knowledge-base/how-to-find-element-by-id-using-selenium https://proxyway.com/knowledge-base/how-to-find-element-by-id-using-selenium#respond Mon, 13 Jun 2022 18:19:28 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13356 A step-by-step guide on how to find element by id using Selenium.

The post How to find element by id using Selenium appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to find element by id using Selenium.

Important: we’ll use a real-life example in this tutorial, so you’ll need Selenium library and browser drivers installed.

Step 1. Write your first Selenium script. NOTE: We’ll be using Python and Chrome WebDriver. You can add the Chrome WebDriver to the Path. Step 2. Now let’s find the following element by its id for any book listings by inspecting the page source. We’ be using books.toscrape.com in this example.

Firstly, you need to import the By selector module

				
					from selenium.webdriver.common.by import By
				
			
TIPS: Locating elements. Step 3. The previous step provides multiple options for finding an element. You can either use CSS and XPath selectors or the inbuilt By.ID function. The selectors look like this:
				
					element_by_id = driver.find_element(By.ID, "product_description").text

element_by_css = driver.find_element(By.CSS_SELECTOR, "#product_description").text

element_by_xpath = driver.find_element(By.XPATH, "//*[@id='product_description']").text
				
			

This is the output of the script. It shows the elements you’ve just scraped.

Results:

Congratulations, you’ve just found and extracted the content of an id using Selenium. Here’s the full script:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

element_by_id = driver.find_element(By.ID, "product_description").text
element_by_css = driver.find_element(By.CSS_SELECTOR, "#product_description").text
element_by_xpath = driver.find_element(By.XPATH, "//div[@id='product_description']").text
driver.quit()

print (element_by_id)
				
			

The post How to find element by id using Selenium appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-find-element-by-id-using-selenium/feed 0
How to find element by text using Selenium https://proxyway.com/knowledge-base/how-to-find-element-by-text-using-selenium https://proxyway.com/knowledge-base/how-to-find-element-by-text-using-selenium#respond Mon, 13 Jun 2022 10:48:52 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13037 A step-by-step guide on how to find element by text using Selenium.

The post How to find element by text using Selenium appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to find element by text using Selenium.

Important: we’ll use a real-life example in this tutorial, so you’ll need the Selenium library and browser drivers installed.

Step 1. Write your first Selenium script. NOTE: We’ll be using Python and Chrome WebDriver. You can add the Chrome WebDriver to the Path. Step 2. Now you’ll need to import the By selector module.
				
					from selenium.webdriver.common.by import By
				
			
TIPS: Locating elements. Step 3.Let’s try to find the stock availability of a book. We’ll be using the books.toscrape.com website in this example. Now, inspect the page source. You can find the word and the number of available books in the same element:
selenium find element by text
NOTE: We’ll use an XPath selector to locate the element as it has a text() method built-in. TIP: If you need a refresher, look at the XPath cheatsheet Step 4. Then use this selector:
				
					//*[contains (text(),'stock')]
				
			

It finds any element within the page, the text of which contains a stock string.

				
					element_by_text = driver.find_element(By.XPATH, "//*[contains (text(),'stock')]").text

print (element_by_text)
				
			

NOTE: We’re using the driver.find_element() function to only get the first element found by the selector. It’s also possible to use the driver.find_elements() function to get a list of all elements.

This is the output of the script. It shows the elements you’ve just scraped.

2 selenium_find_element_by_text_output

Step 5. Now we can clean up the result by extracting the number from the text to do some other operations with it as an integer variable. You can do that with some simple regex. 

Here we find the decimal number within the element_by_text string, assign it to a new variable and print it out separately:

				
					in_stock = re.findall(r'\d+', element_by_text)[0]

print (f'In stock: {in_stock}')
				
			

This is the output of the script. It shows the stock availability of a book you’ve just scraped.

selenium find element by text output2

Results: 

Congratulations, you’ve just extracted the stock availability of a book using Selenium.

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome()

driver.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

element_by_text = driver.find_element(By.XPATH, "//*[contains (text(),'stock')]").text
driver.quit()

print (element_by_text)

in_stock = re.findall(r'\d+', element_by_text)[0]
print (f'In stock: {in_stock}')
				
			

Scrape without any blocks.

web scraping api research thumbnail

The post How to find element by text using Selenium appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-find-element-by-text-using-selenium/feed 0
How to find all URLs using Selenium https://proxyway.com/knowledge-base/how-to-find-all-urls-using-selenium https://proxyway.com/knowledge-base/how-to-find-all-urls-using-selenium#respond Mon, 13 Jun 2022 10:12:49 +0000 https://stage-web2.proxyway.com/?post_type=knowledge_base&p=652 A step-by-step guide on how to find all URLs using Selenium.

The post How to find all URLs using Selenium appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to find all URLs using Selenium.

Important: we’ll use a real-life example in this tutorial, so you’ll need Selenium library and browser drivers installed.

Step 1. Write your first Selenium script. NOTE: We’ll be using Python and Chrome WebDriver. You can add the Chrome WebDriver to the Path. Step 2. Now you’ll need to import the By selector module.
				
					from selenium.webdriver.common.by import By
				
			
TIPS: Locating elements. Step 3. We’ll be using books.toscrape.com in this example. If you only needed to find all links, selecting all <a> tags would be enough to get their href attributes. Getting all the links on the page is simple using the By.TAG_NAME selector:
				
					link_elements = driver.find_elements(By.TAG_NAME, "a")

for link in link_elements:

    print (link.get_attribute("href"))
				
			

NOTE: In the code above, we’re getting all elements with <a> tag and iterating through them to print out their href attribute.

This is the output of the script. It shows the elements you’ve just scraped.

selenium find all links tagname

Full code so far:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome()

driver.get("http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

link_elements = driver.find_elements(By.TAG_NAME, "a")
for link in link_elements:

    print (link.get_attribute("href"))

driver.quit()
				
			

Find Links Using CSS or XPath Selectors

Step 1. We’ll use our main page in the example to get the URLs of the reviews we can find on our home page.

Inspect the page source. You can see that all of these links have one thing in common – reviews/ string present in their href attributes. That’s what you need to select.

selenium find all links pway

Step 2. You can write the CSS and XPath selectors in this way:

				
					elements_by_css = driver.find_elements(By.CSS_SELECTOR, "a[href*='reviews/']")

elements_by_xpath = driver.find_elements(By.XPATH, "//a[contains(@href,'reviews/')]")
				
			

NOTE: Both of them are looking for <a> tags within the document with a href tag containing a reviews/ substring.

Step 3. You can add a couple more lines of code to print everything out. The whole script looks like this:

				
					from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("https://proxyway.com")

elements_by_css = driver.find_elements(By.CSS_SELECTOR, "a[href*='reviews/']")
elements_by_xpath = driver.find_elements(By.XPATH, "//a[contains(@href,'reviews/')]")

print ("By CSS:")   
for link in elements_by_css:
    print (link.get_attribute("href"))
print ("By XPath:")   
for link in elements_by_xpath:
    print (link.get_attribute("href"))
driver.quit()
				
			

This is the output of the script. It shows the links you’ve just scraped.

selenium find all links output

NOTE: You can see that both selectors return the same links. Some duplicates can be easily filtered out after creating a new list to store unique online links.

The post How to find all URLs using Selenium appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-find-all-urls-using-selenium/feed 0
How to Parse XML with LXML https://proxyway.com/knowledge-base/how-to-parse-xml-with-lxml https://proxyway.com/knowledge-base/how-to-parse-xml-with-lxml#respond Mon, 06 Jun 2022 18:23:43 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13361 A step-by-step guide on how to parse XML with LXML.

The post How to Parse XML with LXML appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to parse XML with LXML.

Important: the article assumes that you are familiar with the XML data structure. Refer to the W3Schools XML tutorial if you need a refresher.

Step 1. Install LXML using pip.

				
					pip install lxml
				
			
TIPS: LXML Documentation. If you’re using LXML with Python, import the etree module.
				
					from lxml import etree
				
			
Step 2. Load the XML file you’ll be working with. There are two ways to do this: 1) from the .xml file on your system; 2) making an HTTP request to get XML content from the Internet. TIPS: The parsing will be slightly different for both methods: parsing documentationother parsing options. 1. From the .xml file on your system
				
					filename = "file/location.xml"
parser = etree.XMLParser()
tree = etree.parse(filename, parser)
				
			

2. Making an HTTP request to get XML content from the Internet.

				
					r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
				
			
NOTE: In both cases, the result is parsed and saved in an ElementTree object and saved in the tree variable. Step 3. You’ll need to understand the LXML ElementTree class and XPath selector for the following steps.  Have a look at some tutorials: LXML TutorialXPath Tutorial. Step 4. Let’s continue with the code example you’ve been working on. We’ll get the names of each food item contained in the XML sample. XML data:
				
					<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade sourdough bread</description>
<calories>600</calories>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
<calories>950</calories>
</food>
</breakfast_menu>
				
			

Let’s take a look at the XML tree of the sample:

Step 5. To get the names, you’ll first need to find a <name> element for each <food> node and get the text data from it. This can be done by the following line of code:

				
					foods = tree.xpath(".//food/name/text()")
				
			
  1. .//food – finds and selects the <food> elements anywhere within the XML
  2. /name – selects the <name> child
  3. /text() – gets the text that is contained within the <name> </name> tags.


NOTE:
 The foods variable is going to contain a list of all food names found in the XML document.

Step 6. Let’s check if the script works by printing its output into the terminal window.

				
					for food in foods:
    print (food)
				
			

This is the output of the script. It shows the names you’ve just scraped.

				
					python lxml_get_text.py
Belgian Waffles
Strawberry Belgian Waffles
Berry-Berry Belgian Waffles
French Toast
Homestyle Breakfast
				
			

Results:
Congratulations, you’ve just learned how to parse XML with LXML. Here’s the full script:

				
					from lxml import etree
import requests
r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
foods = tree.xpath(".//food/name/text()")
for food in foods:
    print (food)
				
			

The post How to Parse XML with LXML appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-parse-xml-with-lxml/feed 0
How to Get Text Using LXML https://proxyway.com/knowledge-base/how-to-get-text-using-lxml https://proxyway.com/knowledge-base/how-to-get-text-using-lxml#respond Fri, 03 Jun 2022 18:31:03 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13366 A step-by-step guide on how to get text using LXML.

The post How to Get Text Using LXML appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to get text using LXML.

You’ll need to use an XPath selector to get data. Refer to the XPath Tutorial if you need a refresher.

Step 1. Install LXML using pip.

				
					pip install lxml
				
			
TIPS: LXML Documentation. If you’re using LXML with Python, import the module and requests libraries.
				
					from lxml import etree
import requests
				
			

Example 1

Step 2. Let’s start by inspecting the source code of your target page. We’ll be using our The Best Residential Proxy Providers page in this example. You can find providers’ names in divs with the brand class.

Step 3. Then, make an HTTP request and assign the response to the r variable to scrape the site.

				
					r=requests.get("https://proxyway.com/best/residential-proxies")
				
			

Step 4. Parse the HTML response content using etree.HTML() parser provided by LXML.

				
					tree = etree.HTML(r.text)
				
			

Step 5. Select the div elements containing the class brand and get the text element.

				
					divs = tree.xpath(".//div[@class='brand']/text()")
				
			

Let’s have a closer look at the code:

  1. .//div – select all divs within the HTML document.
  2. .//div[@class=’brand’] – select all divs that have a class of brand.
  3. /text() – get the text that is contained in the div.


NOTE:
 The result is a list of LXML elements.

Step 6. Let’s print out the divs list. You can see that it also contains blank spaces and non-breaking spaces (\xa0 elements) that we don’t need:

You can clean up the results and assign them to a new brand_names list:

				
					brand_names = []
for div in divs:
    if len(div.strip()) > 0:
        brand_names.append(div.strip())
				
			

This is the output of the script. It shows provider names you’ve just scraped.

Results: Congratulations, you’ve extracted the content. Here’s the full script:

				
					from lxml import etree
import requests
r=requests.get("https://proxyway.com/best/residential-proxies")

tree = etree.HTML(r.text)

divs = tree.xpath(".//div[@class='brand']/text()")

brand_names = []
for div in divs:
    if len(div.strip()) > 0:
        brand_names.append(div.strip())
#print (brand_names)
				
			

Example 2

Step 2. Let’s scrape a book title and its description. We’ll be using the books.toscrape.com website in this example. Step 3. Then, make an HTTP request and assign the response to the r variable to scrape the site.
				
					r=requests.get("https://books.toscrape.com")
				
			

Step 4. Parse the HTML response content using etree.HTML() parser provided by LXML.

				
					tree = etree.HTML(r.text)
				
			

Step 5. Now let’s get the book title by inspecting the code. The title can be found within an h1 tag in a div with a product_main class:

_get_text_lxml_2_1

The XPath should look like this:

				
					title = tree.xpath("//div[@class='col-sm-6 product_main']/h1/text()”)[0]
				
			

NOTE: However, that won’t work as there is another class present – col-sm-6. So, the XPath selector won’t find this exact div.

Step 6. Let’s specify both classes so that the XPath work.

				
					title = tree.xpath("//div[@class='col-sm-6 product_main']/h1/text()”)[0]
				
			

You can also use the contains() method as an alternative:

				
					title = tree.xpath("//div[contains(@class,'product_main')]/h1/text()")[0]
print (f'Title: {title}')
				
			

Let’s have a closer look at the code:

  1. (@class) – selects all classes from the document.
  2. //div([contains(@class,’product_main’]) – selects the div that contains a product_main class.
  3. /h1 – the title text is not in the div itself but in its <h1> child element.
  4. /h1/text() – gets the text from within the <h1> tags.
  5. [0] – since tree.xpath() method returns a list and we want text, we can simply grab the first element of the said list.


Step 6. 
Now let’s get the book description. It can be found in a <p> tag without any descriptive attributes below the div, which contains its heading.

get_text_lxml_2_2 (1)

One way to get the description text and avoid selecting any other <p> elements that aren’t relevant look like this:

				
					description = tree.xpath("//div[@id='product_description']/following-sibling::p/text()")[0]
print (f'Description: {description}')
				
			
Let’s have a closer look at the code:
  1. //div[@id=’product_description’] – we select the div with an id of product_description so that we wouldn’t select the wrong element.
  2. /following-sibling::p – selecting the next <p> sibling of the div we have selected before. You can get more information about Xpath Axes here.
  3. /text() – getting the text within the <p> tags.
This is the output of the script. It shows the book title and description you’ve just scraped.
get_text_lxml_2_3

Results: Congratulations, you’ve extracted the book name and description. Here’s the full script:

				
					from lxml import etree
import requests

r=requests.get('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')

tree = etree.HTML(r.text)

title = tree.xpath("//div[contains(@class,'product_main')]/h1/text()")[0]
print (f'Title: {title}')

description = tree.xpath("//div[@id='product_description']/following-sibling::p/text()")[0]
print (f'Description: {description}')
				
			

The post How to Get Text Using LXML appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-get-text-using-lxml/feed 0
How to find all ‘href’ attributes using Beautifulsoup https://proxyway.com/knowledge-base/how-to-find-all-href-attributes-using-beautifulsoup https://proxyway.com/knowledge-base/how-to-find-all-href-attributes-using-beautifulsoup#respond Wed, 30 Mar 2022 14:55:42 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=17057 A step-by-step guide on how to extract all the URL elements.

The post How to find all ‘href’ attributes using Beautifulsoup appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to extract all the URL elements.
 

Important: we will use a real-life example in this tutorial, so you will need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup
				
			

Step 2. Then, import requests library.

				
					import requests
				
			

Step 3. Get a source code for your target landing page. We will be using our homepage in this example.

				
					r=requests.get("https://proxyway.com/")
				
			

An universal code might look like this:

				
					r=requests.get("Your URL")
				
			

Step 4. Parse HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")
				
			

Step 5. Then, find all the links with href attribute. We will be using this tag as an example:

selenium find all links pway
				
					link_elements = soup.find_all("a", href=True)
				
			

NOTE: You can also specify a class.

				
					link_elements = soup.find_all("a", class_=”some_class”, href=True)
				
			

Step 6. Put all the links you’ve found into a dictionary to keep track of them.

				
					dict_of_links = {}
				
			

NOTE: The link would be assigned to the text found in the a tag.

Step 7. If a string exists for a particular element, you can iterate through all the link_elements that you’ve scraped and put them into the dictionary.

				
					for element in link_elements: if element.string: dict_of_links[element.string] = element['href']
				
			

Step 8. Let’s check if our code works by printing it out.

				
					print (dict_of_links)
				
			
���href��� attributes

Results:

Congratulations, you’ve extracted all the URLs. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r = requests.get("https://proxyway.com")
soup = BeautifulSoup(r.content, "html.parser")
link_elements = soup.find_all("a", href=True)

dict_of_links = {}

for element in link_elements:
    if element.string:
        dict_of_links[element.string] = element['href']
    
print (dict_of_links)
				
			

The post How to find all ‘href’ attributes using Beautifulsoup appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-find-all-href-attributes-using-beautifulsoup/feed 0
How to get ‘src’ attribute from ‘img’ tag using Beautifulsoup https://proxyway.com/knowledge-base/how-to-get-src-attribute-from-img-tag-using-beautifulsoup https://proxyway.com/knowledge-base/how-to-get-src-attribute-from-img-tag-using-beautifulsoup#respond Wed, 30 Mar 2022 06:37:18 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13392 A step-by-step guide on how to find image source using Beautifulsoup.

The post How to get ‘src’ attribute from ‘img’ tag using Beautifulsoup appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to find image source using Beautifulsoup.

The majority of data were collected in February and March of 2023.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup
				
			

Step 2. Then, import requests library.

				
					import requests
				
			

Step 3. Get a source code of your target landing page.

				
					r=requests.get("https://books.toscrape.com/")
				
			

Step 4. Convert the HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")
				
			

Step 5. Inspect the page to find the image object you would like to extract.

How to find image source using Beautifulsoup

The code of this image object looks like this:

				
					thumbnail_elements = soup.find_all("img", class_ = "thumbnail")
				
			

NOTE: For this website, you can find images by looking for img tags that have a thumbnail class.

Step 6. Let’s check if our code works by printing it out.

				
					print(thumbnail_elements)
				
			
thumbnail_elements

Step 7. Now you need to get the src attribute from each element.

				
					for element in thumbnail_elements:
    print (element['src'])
				
			

Results:
Congratulations, you’ve found and extracted the content of an image source using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(r.content, "html.parser")
thumbnail_elements = soup.find_all("img", class_ = "thumbnail")

print(thumbnail_elements)


for element in thumbnail_elements:
    print (element['src'])
    
    
#for element in thumbnail_elements:
#    print ("https://books.toscrape.com/" + element['src'])
				
			
If you rebuild the full URL, you can access the image.

The post How to get ‘src’ attribute from ‘img’ tag using Beautifulsoup appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-get-src-attribute-from-img-tag-using-beautifulsoup/feed 0
How to scrape multiple pages using Beautifulsoup https://proxyway.com/knowledge-base/how-to-scrape-multiple-pages-using-beautifulsoup https://proxyway.com/knowledge-base/how-to-scrape-multiple-pages-using-beautifulsoup#comments Wed, 30 Mar 2022 06:20:43 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13382 A step-by-step guide on how to scrape multiple pages using Beautifulsoup.

The post How to scrape multiple pages using Beautifulsoup appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to scrape multiple pages using Beautifulsoup.

Important: we’ll use a real-life example in this tutorial, so you’ll need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup
				
			

Step 2. Then, import requests library.

				
					import requests
				
			

Step 3. Get a source code of your target landing page. We will be using our Guides page in this example.

				
					r=requests.get("https://proxyway.com/guides/")
				
			

Universally applicable code would look like this:

				
					r=requests.get("Your URL")
				
			

Step 4. Get the link to the next page by finding the a tag with the class of next. Then, you only need to get the href from this element and perform a new request.

How to scrape multiple pages using Beautifulsoup

Step 5. You can put the entire code into a single function:

				
					def scrape_page(url):
    print ("URL: " + url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    get_data(soup)
    next_page_link = soup.find("a", class_="next")
    if next_page_link is not None:
        href = next_page_link.get("href")
        scrape_page(href)
    else:
        print ("Done")
				
			

This is the output of the script. It shows the page URL you’ve just scraped.

Scrape multiple pages

Let’s look at the code step-by-step:

1. Performing a request to get the page.

				
					r = requests.get(url)
				
			

2. Parsing the page and turning into a BeautifulSoup object.

				
					soup = BeautifulSoup(r.content, "html.parser")
				
			

3. Passing the soup object into a different function where you could scrape page data before moving on to the next page.

				
					get_data(soup)
				
			

4. Finding the next link element.

				
					next_page_link = soup.find("a", class_="next")
				
			

5. If such an element exists, then there is another page you can scrape; if not, you’re done.

				
					if next_page_link is not None:
				
			

6. Getting the href attribute from the link element. This is the URL of the next page you’re scraping.

				
					href = next_page_link.get("href")
				
			

7. Calling the same function again and passing it the new URL to scrape the next page.

				
					scrape_page(href)
				
			

Results:
Congratulations, you’ve scraped multiple pages using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests

start_url = "https://proxyway.com/guides"

def scrape_page(url):
    print ("URL: " + url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    get_data(soup)
    next_page_link = soup.find("a", class_="next")
    if next_page_link is not None:
        href = next_page_link.get("href")
        scrape_page(href)
    else:
        print ("Done")


def get_data(content):
    #we could do some scraping of web content here
    pass

def main():
    scrape_page(start_url)
    

if __name__ == "__main__":
    main()
				
			

The post How to scrape multiple pages using Beautifulsoup appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-scrape-multiple-pages-using-beautifulsoup/feed 1
How to scrape a table using Beautifulsoup https://proxyway.com/knowledge-base/how-to-scrape-a-table-using-beautifulsoup https://proxyway.com/knowledge-base/how-to-scrape-a-table-using-beautifulsoup#respond Wed, 30 Mar 2022 06:13:05 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13377 A step-by-step guide on how to scrape a table using Beautifulsoup.

The post How to scrape a table using Beautifulsoup appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to scrape a table using Beautifulsoup.

Important: we’ll use a real-life example in this tutorial, so you’ll need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup
				
			

Step 2. Then, import requests library.

				
					import requests
				
			

Step 3. Get the source code of your target landing page. We’ll be using Yahoo in this example.

yahoo page
				
					r=requests.get("https://finance.yahoo.com/cryptocurrencies/")
				
			

Universally applicable code would look like this:

				
					r=requests.get("Your URL")
				
			

Step 4. Convert HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")
				
			

Step 5. Then, inspect the page source. See the table has a class of W(100%parser).

NOTE: A class can specify the scraped table in case there are multiple different ones on the same page.

table has a class of W(100%)

Step 6. Parse the page content with BeautifulSoup, find the table in the HTML content and assign the whole table element to the table_element variable.

				
					soup = BeautifulSoup(r.content, "html.parser")
table_element = soup.find("table", class_="W(100%)")
				
			

NOTE: The goal is to scrape all the rows from the target table.

Step 7. Initialize a new list variable to save the data into.

				
					output_list = []
				
			

Step 8. Search for all tr tags in the table to get all the rows from the table_element that was saved earlier. You’ll also get the header row and all the variables.

				
					table_rows = table_element.find_all("tr")
				
			

NOTE: In this case, it’s also possible to get specific column values by referring to the aria-label attributes since they are present, but that won’t always be the case, so stick with a universal approach.

An example of the step 8 in "How to scrape a table using Beautifulsoup"

Step 9. The following for loop will iterate through all rows you got from the table and get all the children for each row. Each child is a td element in the table. After getting the children, iterate through the row_children list and append the text values of each element into a row_data list to keep it simple.

				
					for row in table_rows:
        row_children = row.children
        row_data = []
        for child in row_children:
            row_data.append(child.get_text())
        output_list.append(row_data)
				
			

Step 10. Let’s display the results.

				
					for row in output_list:
        print (row)
				
			
An example of the step 10 in "How to scrape a table using Beautifulsoup"

What you got is a list of lists, each containing 12 elements that correspond to table columns. The first row contains the table headers.

NOTE: This makes it easy to format the output in CSV/ JSON and write the results to an output file. Also, to convert to a Pandas DataFrame and use the data for some analysis.

Results:

Congratulations, you’ve learned how to scrape a table using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests

r = requests.get("https://finance.yahoo.com/cryptocurrencies/")

soup = BeautifulSoup(r.content, "html.parser")
table_element = soup.find("table", class_="W(100%)")

output_list = []

table_rows = table_element.find_all("tr")

for row in table_rows:
    row_children = row.children
    row_data = []
    for child in row_children:
        row_data.append(child.get_text())
    output_list.append(row_data)

for row in output_list:
    print (row)
				
			

The post How to scrape a table using Beautifulsoup appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-scrape-a-table-using-beautifulsoup/feed 0
How to find element by class using Beautifulsoup https://proxyway.com/knowledge-base/how-to-find-element-by-class-using-beautifulsoup https://proxyway.com/knowledge-base/how-to-find-element-by-class-using-beautifulsoup#respond Wed, 30 Mar 2022 06:10:44 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13372 A step-by-step guide on how to find elements by class using Beautifulsoup.

The post How to find element by class using Beautifulsoup appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to find elements by class using Beautifulsoup.

Important: we will use real-life example in this tutorial, so you will need requests and Beautifulsoup libraries installed.

Step 1. Let’s start by importing the Beautifulsoup library.

				
					from bs4 import BeautifulSoup
				
			

Step 2. Then, import requests library.

				
					import requests
				
			

Step 3. Get a source code of your target landing page. We will be looking for guide titles on our homepage in this example.

				
					r=requests.get("https://proxyway.com/")
				
			

Universally applicable code would look like this:

				
					r=requests.get("Your URL")
				
			

Step 4. Convert HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")
				
			

Step 5.  Inspect the page to find a class you would like to extract.

How to find element by class using Beautifulsoup

The code of this class looks like this:

				
					elements_by_class = soup.find_all(class_ = "archive-list__title")
				
			

NOTE: Since we want to find all the titles instead of one, we use soup.find_all() instead of soup.find().

Step 6. Let’s check if the script works by printing its output into the terminal window.

				
					print(elements_by_class)
				
			
print(elements_by_class)

NOTE: If you want to display only the titles, you can get the string attribute of each scraped element.

				
					for element in elements_by_class: print (element.string)
				
			
for element in elements_by_class: print (element.string

Results:

Congratulations, you’ve found and extracted the content of class using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r = requests.get("https://proxyway.com/")
soup = BeautifulSoup(r.content, "html.parser")
elements_by_class = soup.find_all(class_ = "archive-list__title")

print(elements_by_class)

for element in elements_by_class:
    print (element.string)
				
			

The post How to find element by class using Beautifulsoup appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-find-element-by-class-using-beautifulsoup/feed 0
How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation https://proxyway.com/knowledge-base/prevent-web-scraping-using-ip-geolocation https://proxyway.com/knowledge-base/prevent-web-scraping-using-ip-geolocation#respond Fri, 24 Sep 2021 18:13:14 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13351 A guest article by Razvan Popescu, Head of Marketing at Abstract API.

The post How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation appeared first on Proxyway.

]]>

Knowledge Base

A guest article by Razvan Popescu, Head of Marketing at Abstract API.

If you’re scraping websites, you might already use a proxy server to collect data reliably and anonymously. What about the other side of the scrape, though – what if you want to block proxies from scraping your site? This article will describe how web scraping and proxies work, and how an IP geolocation API can be used to prevent web scraping with proxies.

What Is Web Scraping?

Web scraping is the process of taking unstructured data and formatting it into a structured format. For example, you might use Python to scrape Google Search results. Another common use case is to scrape up-to-date stock data from a stock market website, structure that data into a CSV, and pull that variable from the CSV to calculate your stock market returns in a Python program.

There is nothing illegal about doing this, but when it begins burdening a company’s web servers, they may block your IP address. Always check a website’s robots.txt file for their expected scraping behavior and etiquette.

What Is a Proxy?

When an IP address is blocked by a website, the scraper might work around the block by using a proxy server. So, what is a proxy? It’s a third party server that routes your connections through a different IP address. Remember that IP addresses identify where a connection takes place – for example, the router in your house. A proxy makes that connection appear to be coming another device in another place.

You may have encountered proxies when bypassing your school’s Internet filters back in the day, or using a VPN to stream region–restricted Eurovision song contests. We aren’t condoning these activities, but they use the idea of rerouting an IP address through a third party connection.

Elements of Successful Web Scraping

A little Python code, some Python libraries (like Beautiful Soup), and an Internet connection are all you need to start basic web scraping. But there are important factors in making your scraping efficient, reliable, and anonymous – that is, successful.

One of the most important factors in web scraping is using a high quality proxy , or even multiple proxies in a proxy pool to scale up your scraping operation. A high-quality proxy can take your web scraping projects to the next level:

  • If you’re scraping without a proxy, when one site blocks your IP, you have to go find another site with the same information.
  • Proxies increase scraping reliability and volume.
  • Proxies allow you to view content as it appears if accessed from other places in the world. If you’re scraping location-dependent data, this is very important.
  • Proxies protect your identity by substituting one of their IPs for one of your own. Think of it as similar to how APIs allow authenticated users to exchange data through an interface while remaining anonymous to each other. That said, you can provide your contact info in a third party proxy, if you want businesses you are scraping to be able to contact you.

Why Blocking Proxies Is Key to Preventing Web Scraping

As stated above, scraping without proxies is inefficient, unsafe, and doesn’t scale. If someone is serious about web scraping, they’re be surely using a high-quality proxy pool.

Proxy servers are a powerful tool. And while collecting public web data isn’t bad in itself, reckless web scraping can cause a lot of damage to websites.

So, if we look at the other end of the process, at the website that is being scraped, what’s the best way for us to protect our resources from bad traffic? We can use proxy detection and IP geolocation to root out users scraping with proxies and block them.

What Is Proxy Detection?

Proxy detection is – you guessed it – ways to identify a proxy connection by the website owner. The IP address received by the website can check that IP against a list of flagged addresses and block the traffic. If the scraper uses a limited number of IPs, proxy detectors learn to block them, but proxy services will just change IP ranges again.

You can also check the headers for common proxy entries like x-forwarded-for, but this only removes the most basic proxies, and we’re trying to block professionals.

How to Block Proxies Using IP Geolocation

To detect a proxy using IP geolocation, remember that IP addresses carry location information with them, announcing where a connection takes place. A proxy server makes that connection appear to be coming a a different geographic location.

So, if we are trying to identify a proxy server, we could use the free IP geolocation API from Abstract to test this. You can test it for free as soon as you sign up.

Let’s try testing a request in the browser:

				
					https://ipgeolocation.abstractapi.com/v1/?api_key={YOUR API KEY}

				
			

It will return our IP, our geographic location, and a lot of other interesting data:

				
					{
    "ip_address": "174.49.204.134",
    "city": "York",
    "city_geoname_id": 4562407,
    "region": "Pennsylvania",
    "region_iso_code": "PA",
    "region_geoname_id": 6254927,
    "postal_code": "17402",
    "country": "United States",
    "country_code": "US",
    "country_geoname_id": 6252001,
    "country_is_eu": false,
    "continent": "North America",
    "continent_code": "NA",
    "continent_geoname_id": 6255149,
    "longitude": -76.6653,
    "latitude": 39.9552,
    "security": {
        "is_vpn": false
    }

				
			

If we engage a VPN and try the same test request, we get different results. VPNs aren’t the same thing as proxies, but they provide a similar outcome.

				
					{
    "ip_address": "23.105.165.55",
    "city": "Manassas",
    "city_geoname_id": 4771401,
    "region": "Virginia",
    "region_iso_code": "VA",
    "region_geoname_id": 6254928,
    "postal_code": "20110",
    "country": "United States",
    "country_code": "US",
    "country_geoname_id": 6252001,
    "country_is_eu": false,
    "continent": "North America",
    "continent_code": "NA",
    "continent_geoname_id": 6255149,
    "longitude": -77.4918,
    "latitude": 38.7493,
    "security": {
        "is_vpn": false
    }
				
			

Now, we can use this IP geolocation API to see where incoming traffic is coming from, and make decisions on blocking based on that information. Some strategic considerations:

  • We might block IPs coming from countries with high fraud activity.
  • We might block requests geographically outside of our usual customer base.
  • We might take this data and find the proxy traffic isn’t doing anything suspicious or resource-consuming.
  • We might use this data to geo-target our ad campaigns. (This company in that city is disrupting everything!)

Can All Proxies Be Detected and Blocked?

The proxy cat-and-mouse game has been going on for a long time, and probably won’t stop. Proxies aren’t illegal, but a lot of the discussion around them makes them sound like only credit card scammers and Anonymous use them. They can be used to responsibly anonymize traffic online, but as with any tool, they sometimes fall in the hands of bad agents.

Considering that bad bot activity now accounts for 39% of internet traffic, it’s a good time to know who is accessing your hardware, and if it’s impacting your customers. IP geolocation databases are a great tool to collect and act upon.

The post How to Prevent Web Scraping By Blocking Proxies Using IP Geolocation appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/prevent-web-scraping-using-ip-geolocation/feed 0
How to get text from DIV using Beautifulsoup https://proxyway.com/knowledge-base/how-to-get-text-from-div-using-beautifulsoup https://proxyway.com/knowledge-base/how-to-get-text-from-div-using-beautifulsoup#respond Fri, 23 Apr 2021 07:01:11 +0000 https://stage-web2.proxyway.com/?post_type=knowledge-base&p=13417 A step-by-step guide on how to extract the content of a div tag using Beautifulsoup.

The post How to get text from DIV using Beautifulsoup appeared first on Proxyway.

]]>

Knowledge Base

A step-by-step guide on how to extract the content of a div tag using Beautifulsoup.

Important: we will use a real-life example in this tutorial, so you will need requests and Beautifulsoup libraries installed.

Step 1.  First, import Beautifulsoup library.

				
					from bs4 import BeautifulSoup
				
			

Step 2. Then, import requests library.

				
					import requests
				
			

Step 3. Get your preferred landing page source code. We will use our homepage in this example.

				
					r=requests.get("https://proxyway.com/")
				
			

Universally applicable code would look like this:

				
					r=requests.get("Your URL")
				
			

Step 4. Convert HTML code into a Beautifulsoup object named soup.

				
					soup=BeautifulSoup(r.content,"html.parser")
				
			

Step 5. Find an id, which content you would like to extract. We will be using this tag for an example:

Extract DIV ID screenshot

The code of this id looks like this:

				
					div_text=soup.find("div",{"class":"intro__small-text"}).get_text()
				
			

Step 6. Let’s check if our code works by printing it out.

				
					print(div_text)
				
			

Results:

Congratulations, you’ve found and extracted the content of an id using Beautifulsoup. Here’s the full script:

				
					from bs4 import BeautifulSoup
import requests
r=requests.get("https://proxyway.com/")
soup=BeautifulSoup(r.content,"html.parser")
div_text=soup.find("div",{"class":"intro__small-text"}).get_text()
print(div_text)
				
			

The post How to get text from DIV using Beautifulsoup appeared first on Proxyway.

]]>
https://proxyway.com/knowledge-base/how-to-get-text-from-div-using-beautifulsoup/feed 0