Python XML Parsing — Core Concepts

Learn the two main approaches to XML parsing in Python — ElementTree for simplicity and lxml for power — with practical extraction patterns.

Python provides multiple XML parsing approaches in the standard library. The most practical for everyday use is ElementTree, which represents XML as a tree of elements you can navigate and query.

ElementTree: The Standard Approach

Parsing XML

import xml.etree.ElementTree as ET

# From a file
tree = ET.parse("data.xml")
root = tree.getroot()

# From a string
root = ET.fromstring("""
<catalog>
  <book id="1">
    <title>Python Basics</title>
    <price currency="USD">29.99</price>
  </book>
  <book id="2">
    <title>Advanced Python</title>
    <price currency="EUR">39.99</price>
  </book>
</catalog>
""")

Navigating the Tree

# Tag name
print(root.tag)  # "catalog"

# Direct children
for child in root:
    print(child.tag, child.attrib)
    # "book", {"id": "1"}
    # "book", {"id": "2"}

# Text content
for book in root:
    title = book.find("title").text
    print(title)

Finding Elements

find() returns the first match. findall() returns all matches. Both use a subset of XPath:

# Find first book
first_book = root.find("book")

# Find all books
all_books = root.findall("book")

# Find nested elements
all_titles = root.findall(".//title")  # .// means any depth

# Find by attribute
book_1 = root.find("book[@id='1']")

Extracting Data

# Build a list of dictionaries from XML
books = []
for book in root.findall("book"):
    books.append({
        "id": book.get("id"),
        "title": book.find("title").text,
        "price": float(book.find("price").text),
        "currency": book.find("price").get("currency"),
    })

Handling Namespaces

Many real-world XML documents use namespaces:

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Blog Post</title>
  </entry>
</feed>

ElementTree requires the full namespace in queries:

ns = {"atom": "http://www.w3.org/2005/Atom"}

root = ET.fromstring(xml_string)
entries = root.findall("atom:entry", ns)
for entry in entries:
    title = entry.find("atom:title", ns).text

This is verbose but explicit. Define namespace mappings once and reuse them.

Writing XML

Building a Document

root = ET.Element("catalog")
book = ET.SubElement(root, "book", id="1")
title = ET.SubElement(book, "title")
title.text = "Python Basics"

# Convert to string
xml_string = ET.tostring(root, encoding="unicode", xml_declaration=True)

Modifying Existing XML

tree = ET.parse("config.xml")
root = tree.getroot()

# Change a value
root.find(".//port").text = "8080"

# Add a new element
new_elem = ET.SubElement(root, "feature")
new_elem.text = "caching"

# Save
tree.write("config.xml", encoding="utf-8", xml_declaration=True)

lxml: The Power Tool

For complex XML work, lxml is significantly more capable:

pip install lxml

from lxml import etree

root = etree.fromstring(xml_bytes)

# Full XPath support
titles = root.xpath("//book[price > 30]/title/text()")

# CSS selectors (install cssselect)
books = root.cssselect("book[id]")

When to use lxml over ElementTree:

Feature	ElementTree	lxml
XPath	Basic subset	Full XPath 1.0
Speed	Moderate	Fast (C-based)
Validation (XSD/DTD)	No	Yes
CSS selectors	No	Yes (with cssselect)
XSLT transforms	No	Yes
Installation	Built-in	Requires install

Security Warning

Never use xml.etree.ElementTree with untrusted XML without precautions.

XML parsers are vulnerable to several attacks:

Billion Laughs: Exponential entity expansion eats all memory
External Entity Injection (XXE): Reads local files via entity references

Use defusedxml for untrusted input:

pip install defusedxml

from defusedxml.ElementTree import parse, fromstring

# Safe against XXE and entity expansion attacks
root = fromstring(untrusted_xml_string)

Common Misconception

“XML is dead, replaced by JSON.” XML remains essential in enterprise systems (banking, healthcare, government), document formats (DOCX, SVG, EPUB), and syndication (RSS/Atom). If you work with APIs from large organizations, you’ll encounter XML regularly.

One Thing to Remember

Use ElementTree for simple XML tasks, lxml when you need full XPath or validation, and always use defusedxml when parsing untrusted input to prevent entity expansion attacks.

pythonxmlparsingtext-processingdata-extraction