Python XML Parsing — Core Concepts

Python provides multiple XML parsing approaches in the standard library. The most practical for everyday use is ElementTree, which represents XML as a tree of elements you can navigate and query.

ElementTree: The Standard Approach

Parsing XML

import xml.etree.ElementTree as ET

# From a file
tree = ET.parse("data.xml")
root = tree.getroot()

# From a string
root = ET.fromstring("""
<catalog>
  <book id="1">
    <title>Python Basics</title>
    <price currency="USD">29.99</price>
  </book>
  <book id="2">
    <title>Advanced Python</title>
    <price currency="EUR">39.99</price>
  </book>
</catalog>
""")
# Tag name
print(root.tag)  # "catalog"

# Direct children
for child in root:
    print(child.tag, child.attrib)
    # "book", {"id": "1"}
    # "book", {"id": "2"}

# Text content
for book in root:
    title = book.find("title").text
    print(title)

Finding Elements

find() returns the first match. findall() returns all matches. Both use a subset of XPath:

# Find first book
first_book = root.find("book")

# Find all books
all_books = root.findall("book")

# Find nested elements
all_titles = root.findall(".//title")  # .// means any depth

# Find by attribute
book_1 = root.find("book[@id='1']")

Extracting Data

# Build a list of dictionaries from XML
books = []
for book in root.findall("book"):
    books.append({
        "id": book.get("id"),
        "title": book.find("title").text,
        "price": float(book.find("price").text),
        "currency": book.find("price").get("currency"),
    })

Handling Namespaces

Many real-world XML documents use namespaces:

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Blog Post</title>
  </entry>
</feed>

ElementTree requires the full namespace in queries:

ns = {"atom": "http://www.w3.org/2005/Atom"}

root = ET.fromstring(xml_string)
entries = root.findall("atom:entry", ns)
for entry in entries:
    title = entry.find("atom:title", ns).text

This is verbose but explicit. Define namespace mappings once and reuse them.

Writing XML

Building a Document

root = ET.Element("catalog")
book = ET.SubElement(root, "book", id="1")
title = ET.SubElement(book, "title")
title.text = "Python Basics"

# Convert to string
xml_string = ET.tostring(root, encoding="unicode", xml_declaration=True)

Modifying Existing XML

tree = ET.parse("config.xml")
root = tree.getroot()

# Change a value
root.find(".//port").text = "8080"

# Add a new element
new_elem = ET.SubElement(root, "feature")
new_elem.text = "caching"

# Save
tree.write("config.xml", encoding="utf-8", xml_declaration=True)

lxml: The Power Tool

For complex XML work, lxml is significantly more capable:

pip install lxml
from lxml import etree

root = etree.fromstring(xml_bytes)

# Full XPath support
titles = root.xpath("//book[price > 30]/title/text()")

# CSS selectors (install cssselect)
books = root.cssselect("book[id]")

When to use lxml over ElementTree:

FeatureElementTreelxml
XPathBasic subsetFull XPath 1.0
SpeedModerateFast (C-based)
Validation (XSD/DTD)NoYes
CSS selectorsNoYes (with cssselect)
XSLT transformsNoYes
InstallationBuilt-inRequires install

Security Warning

Never use xml.etree.ElementTree with untrusted XML without precautions.

XML parsers are vulnerable to several attacks:

  • Billion Laughs: Exponential entity expansion eats all memory
  • External Entity Injection (XXE): Reads local files via entity references

Use defusedxml for untrusted input:

pip install defusedxml
from defusedxml.ElementTree import parse, fromstring

# Safe against XXE and entity expansion attacks
root = fromstring(untrusted_xml_string)

Common Misconception

“XML is dead, replaced by JSON.” XML remains essential in enterprise systems (banking, healthcare, government), document formats (DOCX, SVG, EPUB), and syndication (RSS/Atom). If you work with APIs from large organizations, you’ll encounter XML regularly.

One Thing to Remember

Use ElementTree for simple XML tasks, lxml when you need full XPath or validation, and always use defusedxml when parsing untrusted input to prevent entity expansion attacks.

pythonxmlparsingtext-processingdata-extraction

See Also

  • Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
  • Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
  • Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
  • Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
  • Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.