How to Parse XML with LXML

A step-by-step guide on how to parse XML with LXML.

Important: the article assumes that you are familiar with the XML data structure. Refer to the W3Schools XML tutorial if you need a refresher.

Step 1. Install LXML using pip.

pip install lxml

TIPS: LXML Documentation.

If you’re using LXML with Python, import the module.

from lxml import etree

Step 2. Load the XML file you’ll be working with. There are two ways to do this: 1) from the .xml file on your system; 2) making an HTTP request to get XML content from the Internet.

TIPS: The parsing will be slightly different for both methods: parsing documentation; other parsing options.

1. From the .xml file on your system

filename = "file/location.xml"
parser = etree.XMLParser()
tree = etree.parse(filename, parser)

2. Making an HTTP request to get XML content from the Internet.

r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)

NOTE: In both cases, the result is parsed and saved in an ElementTree object and saved in the tree variable.

Step 3. You’ll need to understand the LXML ElementTree class and XPath selector for the following steps.  Have a look at some tutorials: LXML TutorialXPath Tutorial.

Step 4. Let’s continue with the code example you’ve been working on. We’ll get the names of each food item contained in the XML sample.

XML data:

<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>French Toast</name>
<price>$4.50</price>
<description>Thick slices made from our homemade sourdough bread</description>
<calories>600</calories>
</food>
<food>
<name>Homestyle Breakfast</name>
<price>$6.95</price>
<description>Two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
<calories>950</calories>
</food>
</breakfast_menu>

Let’s take a look at the XML tree of the sample:

how to parse xml with lxml

Step 5. To get the names, you’ll first need to find a <name> element for each <food> node and get the text data from it. This can be done by the following line of code:

foods = tree.xpath(".//food/name/text()")
  1. .//food – finds and selects the <food> elements anywhere within the XML
  2. /name – selects the <name> child
  3. /text() – gets the text that is contained within the <name> </name> tags.

NOTE: The foods variable is going to contain a list of all food names found in the XML document.

Step 6. Let’s check if the script works by printing its output into the terminal window.

for food in foods:
    print (food)

This is the output of the script. It shows the names you’ve just scraped.

python lxml_get_text.py
Belgian Waffles
Strawberry Belgian Waffles
Berry-Berry Belgian Waffles
French Toast
Homestyle Breakfast

How to parse xml with lxml output

Results:
Congratulations, you’ve just learned how to parse XML with LXML. Here’s the full script:

from lxml import etree
import requests
r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
foods = tree.xpath(".//food/name/text()")
for food in foods:
    print (food)
best-scraping-apis

Submit a comment

Your email address will not be published.

Rate this post