How to Parse XML with LXML
Important: the article assumes that you are familiar with the XML data structure. Refer to the W3Schools XML tutorial if you need a refresher.
Step 1. Install LXML using pip.
pip install lxml
from lxml import etree
filename = "file/location.xml"
parser = etree.XMLParser()
tree = etree.parse(filename, parser)
2. Making an HTTP request to get XML content from the Internet.
r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
Belgian Waffles
$5.95
Two of our famous Belgian Waffles with plenty of real maple syrup
650
Strawberry Belgian Waffles
$7.95
Light Belgian waffles covered with strawberries and whipped cream
900
Berry-Berry Belgian Waffles
$8.95
Light Belgian waffles covered with an assortment of fresh berries and whipped cream
900
French Toast
$4.50
Thick slices made from our homemade sourdough bread
600
Homestyle Breakfast
$6.95
Two eggs, bacon or sausage, toast, and our ever-popular hash browns
950
Let’s take a look at the XML tree of the sample:
Step 5. To get the names, you’ll first need to find a <name> element for each <food> node and get the text data from it. This can be done by the following line of code:
foods = tree.xpath(".//food/name/text()")
- .//food – finds and selects the <food> elements anywhere within the XML
- /name – selects the <name> child
- /text() – gets the text that is contained within the <name> </name> tags.
NOTE: The foods variable is going to contain a list of all food names found in the XML document.
Step 6. Let’s check if the script works by printing its output into the terminal window.
for food in foods:
print (food)
This is the output of the script. It shows the names you’ve just scraped.
python lxml_get_text.py
Belgian Waffles
Strawberry Belgian Waffles
Berry-Berry Belgian Waffles
French Toast
Homestyle Breakfast
Results:
Congratulations, you’ve just learned how to parse XML with LXML. Here’s the full script:
from lxml import etree
import requests
r=requests.get('https://www.w3schools.com/xml/simple.xml')
tree = etree.XML(r.content)
foods = tree.xpath(".//food/name/text()")
for food in foods:
print (food)