Python XML Parsing — Deep Dive
XML processing in Python spans from simple tree navigation to streaming gigabyte files, validating against schemas, and transforming with XSLT. This deep dive covers the full spectrum of XML techniques for production systems.
Parsing Models: DOM vs. SAX vs. Pull
Python offers three distinct XML parsing approaches:
DOM (Document Object Model) — ElementTree
Loads the entire document into a tree in memory. Best for small to medium files where you need random access.
import xml.etree.ElementTree as ET
tree = ET.parse("catalog.xml")
root = tree.getroot()
# Entire tree is in memory — navigate freely
Memory: Proportional to document size (typically 3-5x the file size)
SAX (Simple API for XML) — Event-Based
Fires events as the parser encounters elements. Never loads the full tree.
import xml.sax
class BookHandler(xml.sax.ContentHandler):
def __init__(self):
self.current = ""
self.books = []
self.book = {}
def startElement(self, tag, attrs):
self.current = tag
if tag == "book":
self.book = {"id": attrs.get("id")}
def characters(self, content):
if self.current in ("title", "author"):
self.book[self.current] = self.book.get(self.current, "") + content
def endElement(self, tag):
if tag == "book":
self.books.append(self.book)
self.current = ""
handler = BookHandler()
xml.sax.parse("catalog.xml", handler)
print(handler.books)
Memory: Constant — only the handler’s accumulated state.
iterparse — Pull-Based Streaming
ElementTree’s iterparse combines tree-style access with streaming efficiency:
import xml.etree.ElementTree as ET
def parse_large_xml(path):
"""Stream-parse a large XML file, yielding one book at a time."""
context = ET.iterparse(path, events=("end",))
for event, elem in context:
if elem.tag == "book":
yield {
"id": elem.get("id"),
"title": elem.findtext("title"),
"author": elem.findtext("author"),
}
# Critical: clear processed elements to free memory
elem.clear()
for book in parse_large_xml("huge_catalog.xml"):
process(book)
The elem.clear() call is essential — without it, the tree grows unboundedly.
Choosing a Parsing Model
| Model | Memory | Speed | Ease of Use | Best For |
|---|---|---|---|---|
| DOM (ElementTree) | High | Moderate | Easy | Files < 100 MB |
| SAX | Constant | Fast | Hard | Huge files, simple extraction |
| iterparse | Low | Fast | Moderate | Huge files, element-level access |
| lxml.iterparse | Low | Fastest | Moderate | Huge files, need XPath |
Advanced XPath with lxml
lxml supports full XPath 1.0, which is far more powerful than ElementTree’s subset:
from lxml import etree
tree = etree.parse("catalog.xml")
# Predicates
expensive = tree.xpath("//book[price > 30]")
# Functions
count = tree.xpath("count(//book)")
# Axes
following = tree.xpath("//book[@id='1']/following-sibling::book")
# String functions
python_books = tree.xpath("//book[contains(title, 'Python')]")
# Namespace-aware
ns = {"atom": "http://www.w3.org/2005/Atom"}
entries = tree.xpath("//atom:entry/atom:title/text()", namespaces=ns)
XPath Variables
Prevent injection by using variables instead of string formatting:
# DANGEROUS — XPath injection
tree.xpath(f"//user[@name='{user_input}']")
# SAFE — parameterized XPath
tree.xpath("//user[@name=$name]", name=user_input)
Schema Validation
XML Schema (XSD)
from lxml import etree
# Load schema
with open("catalog.xsd", "rb") as f:
schema_doc = etree.parse(f)
schema = etree.XMLSchema(schema_doc)
# Validate a document
doc = etree.parse("catalog.xml")
is_valid = schema.validate(doc)
if not is_valid:
for error in schema.error_log:
print(f"Line {error.line}: {error.message}")
RelaxNG
from lxml import etree
relaxng_doc = etree.parse("catalog.rng")
relaxng = etree.RelaxNG(relaxng_doc)
relaxng.validate(doc)
Schematron (Business Rules)
from lxml import etree
from lxml.isoschematron import Schematron
schematron_doc = etree.parse("rules.sch")
schematron = Schematron(schematron_doc)
schematron.validate(doc)
XSLT Transforms
Transform XML documents using XSLT stylesheets:
from lxml import etree
# XSLT stylesheet that converts catalog XML to HTML
xslt_text = """
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/catalog">
<html>
<body>
<table>
<tr><th>Title</th><th>Price</th></tr>
<xsl:for-each select="book">
<tr>
<td><xsl:value-of select="title"/></td>
<td><xsl:value-of select="price"/></td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
"""
xslt = etree.XSLT(etree.fromstring(xslt_text.encode()))
result = xslt(etree.parse("catalog.xml"))
html = str(result)
Security: Hardening XML Processing
The Threat Landscape
XML parsers are historically vulnerable because the XML specification includes features designed for flexibility that attackers exploit:
<!-- Billion Laughs: exponential entity expansion -->
<!DOCTYPE bomb [
<!ENTITY a "lol">
<!ENTITY b "&a;&a;&a;&a;&a;&a;&a;&a;&a;&a;">
<!ENTITY c "&b;&b;&b;&b;&b;&b;&b;&b;&b;&b;">
<!-- 10 levels = 10 billion "lol" strings -->
]>
<data>&c;</data>
<!-- XXE: External Entity Injection reads files -->
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>
defusedxml: The Safe Choice
import defusedxml.ElementTree as DET
import defusedxml.minidom as Dminidom
from defusedxml.lxml import parse as safe_parse
# Drop-in replacements
root = DET.fromstring(untrusted_xml)
tree = DET.parse("untrusted.xml")
# lxml safe parsing
tree = safe_parse("untrusted.xml")
defusedxml blocks:
- Entity expansion (billion laughs)
- External entity resolution (XXE)
- External DTD loading
- Processing instructions that reference external resources
Hardening lxml Directly
If you need lxml features without defusedxml:
from lxml import etree
# Create a parser that blocks dangerous features
parser = etree.XMLParser(
resolve_entities=False,
no_network=True,
dtd_validation=False,
load_dtd=False,
)
# Use the hardened parser
tree = etree.parse("data.xml", parser)
root = etree.fromstring(xml_bytes, parser)
Production Patterns
RSS/Atom Feed Parser
from lxml import etree
from datetime import datetime
def parse_rss(url_or_path: str) -> list[dict]:
"""Parse an RSS 2.0 feed into a list of articles."""
tree = etree.parse(url_or_path)
items = []
for item in tree.xpath("//item"):
items.append({
"title": item.findtext("title", ""),
"link": item.findtext("link", ""),
"description": item.findtext("description", ""),
"pub_date": item.findtext("pubDate", ""),
})
return items
def parse_atom(url_or_path: str) -> list[dict]:
"""Parse an Atom feed."""
ns = {"atom": "http://www.w3.org/2005/Atom"}
tree = etree.parse(url_or_path)
entries = []
for entry in tree.xpath("//atom:entry", namespaces=ns):
entries.append({
"title": entry.findtext("atom:title", "", ns),
"link": entry.find("atom:link", ns).get("href", ""),
"summary": entry.findtext("atom:summary", "", ns),
"updated": entry.findtext("atom:updated", "", ns),
})
return entries
XML to Dict Converter
from lxml import etree
def xml_to_dict(element) -> dict | str:
"""Convert an lxml element tree to nested Python dicts."""
result = {}
# Add attributes
if element.attrib:
result["@attrs"] = dict(element.attrib)
# Process children
children = {}
for child in element:
tag = child.tag
value = xml_to_dict(child)
if tag in children:
# Multiple children with same tag → list
if not isinstance(children[tag], list):
children[tag] = [children[tag]]
children[tag].append(value)
else:
children[tag] = value
if children:
result.update(children)
# Text content
if element.text and element.text.strip():
if result:
result["#text"] = element.text.strip()
else:
return element.text.strip()
return result
High-Performance Batch Processing
from lxml import etree
from concurrent.futures import ProcessPoolExecutor
import os
def process_xml_chunk(xml_strings: list[bytes]) -> list[dict]:
"""Process a batch of XML documents in a worker process."""
results = []
parser = etree.XMLParser(resolve_entities=False, no_network=True)
for xml_bytes in xml_strings:
root = etree.fromstring(xml_bytes, parser)
results.append(extract_data(root))
return results
def parallel_xml_processing(directory: str, workers: int = 4):
"""Process many XML files in parallel."""
files = [os.path.join(directory, f)
for f in os.listdir(directory) if f.endswith(".xml")]
# Read all files (or chunk for huge directories)
xml_data = [open(f, "rb").read() for f in files]
# Split into chunks per worker
chunk_size = len(xml_data) // workers + 1
chunks = [xml_data[i:i+chunk_size]
for i in range(0, len(xml_data), chunk_size)]
with ProcessPoolExecutor(max_workers=workers) as pool:
results = list(pool.map(process_xml_chunk, chunks))
# Flatten results
return [item for chunk in results for item in chunk]
Performance Comparison
Parsing a 50 MB XML file:
| Library | Parse Time | Memory | XPath Support |
|---|---|---|---|
| ElementTree | ~2.5s | ~250 MB | Basic |
| lxml | ~0.8s | ~200 MB | Full |
| SAX | ~1.5s | ~2 MB | None |
| iterparse (ET) | ~3.0s | ~10 MB | Per-element |
| iterparse (lxml) | ~1.2s | ~10 MB | Per-element |
lxml is consistently fastest for both parsing and querying due to its C foundation (libxml2).
One Thing to Remember
Choose your XML parsing model based on file size and query complexity — ElementTree for simple tasks, lxml for XPath and validation, iterparse for large files, and always use defusedxml or hardened parsers when processing untrusted XML to prevent billion-laughs and XXE attacks.
See Also
- Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
- Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
- Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
- Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
- Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.