Python XML Parsing — Core Concepts
Python provides multiple XML parsing approaches in the standard library. The most practical for everyday use is ElementTree, which represents XML as a tree of elements you can navigate and query.
ElementTree: The Standard Approach
Parsing XML
import xml.etree.ElementTree as ET
# From a file
tree = ET.parse("data.xml")
root = tree.getroot()
# From a string
root = ET.fromstring("""
<catalog>
<book id="1">
<title>Python Basics</title>
<price currency="USD">29.99</price>
</book>
<book id="2">
<title>Advanced Python</title>
<price currency="EUR">39.99</price>
</book>
</catalog>
""")
Navigating the Tree
# Tag name
print(root.tag) # "catalog"
# Direct children
for child in root:
print(child.tag, child.attrib)
# "book", {"id": "1"}
# "book", {"id": "2"}
# Text content
for book in root:
title = book.find("title").text
print(title)
Finding Elements
find() returns the first match. findall() returns all matches. Both use a subset of XPath:
# Find first book
first_book = root.find("book")
# Find all books
all_books = root.findall("book")
# Find nested elements
all_titles = root.findall(".//title") # .// means any depth
# Find by attribute
book_1 = root.find("book[@id='1']")
Extracting Data
# Build a list of dictionaries from XML
books = []
for book in root.findall("book"):
books.append({
"id": book.get("id"),
"title": book.find("title").text,
"price": float(book.find("price").text),
"currency": book.find("price").get("currency"),
})
Handling Namespaces
Many real-world XML documents use namespaces:
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title>Blog Post</title>
</entry>
</feed>
ElementTree requires the full namespace in queries:
ns = {"atom": "http://www.w3.org/2005/Atom"}
root = ET.fromstring(xml_string)
entries = root.findall("atom:entry", ns)
for entry in entries:
title = entry.find("atom:title", ns).text
This is verbose but explicit. Define namespace mappings once and reuse them.
Writing XML
Building a Document
root = ET.Element("catalog")
book = ET.SubElement(root, "book", id="1")
title = ET.SubElement(book, "title")
title.text = "Python Basics"
# Convert to string
xml_string = ET.tostring(root, encoding="unicode", xml_declaration=True)
Modifying Existing XML
tree = ET.parse("config.xml")
root = tree.getroot()
# Change a value
root.find(".//port").text = "8080"
# Add a new element
new_elem = ET.SubElement(root, "feature")
new_elem.text = "caching"
# Save
tree.write("config.xml", encoding="utf-8", xml_declaration=True)
lxml: The Power Tool
For complex XML work, lxml is significantly more capable:
pip install lxml
from lxml import etree
root = etree.fromstring(xml_bytes)
# Full XPath support
titles = root.xpath("//book[price > 30]/title/text()")
# CSS selectors (install cssselect)
books = root.cssselect("book[id]")
When to use lxml over ElementTree:
| Feature | ElementTree | lxml |
|---|---|---|
| XPath | Basic subset | Full XPath 1.0 |
| Speed | Moderate | Fast (C-based) |
| Validation (XSD/DTD) | No | Yes |
| CSS selectors | No | Yes (with cssselect) |
| XSLT transforms | No | Yes |
| Installation | Built-in | Requires install |
Security Warning
Never use xml.etree.ElementTree with untrusted XML without precautions.
XML parsers are vulnerable to several attacks:
- Billion Laughs: Exponential entity expansion eats all memory
- External Entity Injection (XXE): Reads local files via entity references
Use defusedxml for untrusted input:
pip install defusedxml
from defusedxml.ElementTree import parse, fromstring
# Safe against XXE and entity expansion attacks
root = fromstring(untrusted_xml_string)
Common Misconception
“XML is dead, replaced by JSON.” XML remains essential in enterprise systems (banking, healthcare, government), document formats (DOCX, SVG, EPUB), and syndication (RSS/Atom). If you work with APIs from large organizations, you’ll encounter XML regularly.
One Thing to Remember
Use ElementTree for simple XML tasks, lxml when you need full XPath or validation, and always use defusedxml when parsing untrusted input to prevent entity expansion attacks.
See Also
- Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
- Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
- Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
- Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
- Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.