We use affiliate links. They let us sustain ourselves at no cost to you.

How to Parse XML with LXML

A step-by-step guide on how to parse XML with LXML.

Important: the article assumes that you are familiar with the XML data structure. Refer to the W3Schools XML tutorial if you need a refresher.

Step 1. Install LXML using pip.

				
					pip install lxml
				
			

TIPS: LXML Documentation.

If you’re using LXML with Python, import the etree module.

				
					from lxml import etree
				
			

Step 2. Load the XML file you’ll be working with. There are two ways to do this: 1) from the .xml file on your system; 2) making an HTTP request to get XML content from the Internet.

TIPS: The parsing will be slightly different for both methods: parsing documentationother parsing options.

1. From the .xml file on your system

				
					filename = "file/location.xml"
parser = etree.XMLParser()
tree = etree.parse(filename, parser)
				
			

2. Making an HTTP request to get XML content from the Internet.

				
					r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
				
			

NOTE: In both cases, the result is parsed and saved in an ElementTree object and saved in the tree variable.

Step 3. You’ll need to understand the LXML ElementTree class and XPath selector for the following steps.  Have a look at some tutorials: LXML TutorialXPath Tutorial.

Step 4. Let’s continue with the code example you’ve been working on. We’ll get the names of each food item contained in the XML sample.

XML data:

				
					<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade sourdough bread</description>
<calories>600</calories>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
<calories>950</calories>
</food>
</breakfast_menu>
				
			

Let’s take a look at the XML tree of the sample:

Step 5. To get the names, you’ll first need to find a <name> element for each <food> node and get the text data from it. This can be done by the following line of code:

				
					foods = tree.xpath(".//food/name/text()")
				
			
  1. .//food – finds and selects the <food> elements anywhere within the XML
  2. /name – selects the <name> child
  3. /text() – gets the text that is contained within the <name> </name> tags.


NOTE:
 The foods variable is going to contain a list of all food names found in the XML document.

Step 6. Let’s check if the script works by printing its output into the terminal window.

				
					for food in foods:
    print (food)
				
			

This is the output of the script. It shows the names you’ve just scraped.

				
					python lxml_get_text.py
Belgian Waffles
Strawberry Belgian Waffles
Berry-Berry Belgian Waffles
French Toast
Homestyle Breakfast
				
			

Results:
Congratulations, you’ve just learned how to parse XML with LXML. Here’s the full script:

				
					from lxml import etree
import requests
r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
foods = tree.xpath(".//food/name/text()")
for food in foods:
    print (food)
				
			

Join Smartproxy’s webinar about ready-made scrapers on May 7, 10AM EST. Save your seat>