Python XML Parsing — Deep Dive

XML processing in Python spans from simple tree navigation to streaming gigabyte files, validating against schemas, and transforming with XSLT. This deep dive covers the full spectrum of XML techniques for production systems.

Parsing Models: DOM vs. SAX vs. Pull

Python offers three distinct XML parsing approaches:

DOM (Document Object Model) — ElementTree

Loads the entire document into a tree in memory. Best for small to medium files where you need random access.

import xml.etree.ElementTree as ET

tree = ET.parse("catalog.xml")
root = tree.getroot()
# Entire tree is in memory — navigate freely

Memory: Proportional to document size (typically 3-5x the file size)

SAX (Simple API for XML) — Event-Based

Fires events as the parser encounters elements. Never loads the full tree.

import xml.sax

class BookHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.current = ""
        self.books = []
        self.book = {}
    
    def startElement(self, tag, attrs):
        self.current = tag
        if tag == "book":
            self.book = {"id": attrs.get("id")}
    
    def characters(self, content):
        if self.current in ("title", "author"):
            self.book[self.current] = self.book.get(self.current, "") + content
    
    def endElement(self, tag):
        if tag == "book":
            self.books.append(self.book)
        self.current = ""

handler = BookHandler()
xml.sax.parse("catalog.xml", handler)
print(handler.books)

Memory: Constant — only the handler’s accumulated state.

iterparse — Pull-Based Streaming

ElementTree’s iterparse combines tree-style access with streaming efficiency:

import xml.etree.ElementTree as ET

def parse_large_xml(path):
    """Stream-parse a large XML file, yielding one book at a time."""
    context = ET.iterparse(path, events=("end",))
    
    for event, elem in context:
        if elem.tag == "book":
            yield {
                "id": elem.get("id"),
                "title": elem.findtext("title"),
                "author": elem.findtext("author"),
            }
            # Critical: clear processed elements to free memory
            elem.clear()

for book in parse_large_xml("huge_catalog.xml"):
    process(book)

The elem.clear() call is essential — without it, the tree grows unboundedly.

Choosing a Parsing Model

ModelMemorySpeedEase of UseBest For
DOM (ElementTree)HighModerateEasyFiles < 100 MB
SAXConstantFastHardHuge files, simple extraction
iterparseLowFastModerateHuge files, element-level access
lxml.iterparseLowFastestModerateHuge files, need XPath

Advanced XPath with lxml

lxml supports full XPath 1.0, which is far more powerful than ElementTree’s subset:

from lxml import etree

tree = etree.parse("catalog.xml")

# Predicates
expensive = tree.xpath("//book[price > 30]")

# Functions
count = tree.xpath("count(//book)")

# Axes
following = tree.xpath("//book[@id='1']/following-sibling::book")

# String functions
python_books = tree.xpath("//book[contains(title, 'Python')]")

# Namespace-aware
ns = {"atom": "http://www.w3.org/2005/Atom"}
entries = tree.xpath("//atom:entry/atom:title/text()", namespaces=ns)

XPath Variables

Prevent injection by using variables instead of string formatting:

# DANGEROUS — XPath injection
tree.xpath(f"//user[@name='{user_input}']")

# SAFE — parameterized XPath
tree.xpath("//user[@name=$name]", name=user_input)

Schema Validation

XML Schema (XSD)

from lxml import etree

# Load schema
with open("catalog.xsd", "rb") as f:
    schema_doc = etree.parse(f)
schema = etree.XMLSchema(schema_doc)

# Validate a document
doc = etree.parse("catalog.xml")
is_valid = schema.validate(doc)

if not is_valid:
    for error in schema.error_log:
        print(f"Line {error.line}: {error.message}")

RelaxNG

from lxml import etree

relaxng_doc = etree.parse("catalog.rng")
relaxng = etree.RelaxNG(relaxng_doc)
relaxng.validate(doc)

Schematron (Business Rules)

from lxml import etree
from lxml.isoschematron import Schematron

schematron_doc = etree.parse("rules.sch")
schematron = Schematron(schematron_doc)
schematron.validate(doc)

XSLT Transforms

Transform XML documents using XSLT stylesheets:

from lxml import etree

# XSLT stylesheet that converts catalog XML to HTML
xslt_text = """
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/catalog">
    <html>
      <body>
        <table>
          <tr><th>Title</th><th>Price</th></tr>
          <xsl:for-each select="book">
            <tr>
              <td><xsl:value-of select="title"/></td>
              <td><xsl:value-of select="price"/></td>
            </tr>
          </xsl:for-each>
        </table>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>
"""

xslt = etree.XSLT(etree.fromstring(xslt_text.encode()))
result = xslt(etree.parse("catalog.xml"))
html = str(result)

Security: Hardening XML Processing

The Threat Landscape

XML parsers are historically vulnerable because the XML specification includes features designed for flexibility that attackers exploit:

<!-- Billion Laughs: exponential entity expansion -->
<!DOCTYPE bomb [
  <!ENTITY a "lol">
  <!ENTITY b "&a;&a;&a;&a;&a;&a;&a;&a;&a;&a;">
  <!ENTITY c "&b;&b;&b;&b;&b;&b;&b;&b;&b;&b;">
  <!-- 10 levels = 10 billion "lol" strings -->
]>
<data>&c;</data>
<!-- XXE: External Entity Injection reads files -->
<!DOCTYPE foo [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>

defusedxml: The Safe Choice

import defusedxml.ElementTree as DET
import defusedxml.minidom as Dminidom
from defusedxml.lxml import parse as safe_parse

# Drop-in replacements
root = DET.fromstring(untrusted_xml)
tree = DET.parse("untrusted.xml")

# lxml safe parsing
tree = safe_parse("untrusted.xml")

defusedxml blocks:

  • Entity expansion (billion laughs)
  • External entity resolution (XXE)
  • External DTD loading
  • Processing instructions that reference external resources

Hardening lxml Directly

If you need lxml features without defusedxml:

from lxml import etree

# Create a parser that blocks dangerous features
parser = etree.XMLParser(
    resolve_entities=False,
    no_network=True,
    dtd_validation=False,
    load_dtd=False,
)

# Use the hardened parser
tree = etree.parse("data.xml", parser)
root = etree.fromstring(xml_bytes, parser)

Production Patterns

RSS/Atom Feed Parser

from lxml import etree
from datetime import datetime

def parse_rss(url_or_path: str) -> list[dict]:
    """Parse an RSS 2.0 feed into a list of articles."""
    tree = etree.parse(url_or_path)
    items = []
    
    for item in tree.xpath("//item"):
        items.append({
            "title": item.findtext("title", ""),
            "link": item.findtext("link", ""),
            "description": item.findtext("description", ""),
            "pub_date": item.findtext("pubDate", ""),
        })
    
    return items

def parse_atom(url_or_path: str) -> list[dict]:
    """Parse an Atom feed."""
    ns = {"atom": "http://www.w3.org/2005/Atom"}
    tree = etree.parse(url_or_path)
    entries = []
    
    for entry in tree.xpath("//atom:entry", namespaces=ns):
        entries.append({
            "title": entry.findtext("atom:title", "", ns),
            "link": entry.find("atom:link", ns).get("href", ""),
            "summary": entry.findtext("atom:summary", "", ns),
            "updated": entry.findtext("atom:updated", "", ns),
        })
    
    return entries

XML to Dict Converter

from lxml import etree

def xml_to_dict(element) -> dict | str:
    """Convert an lxml element tree to nested Python dicts."""
    result = {}
    
    # Add attributes
    if element.attrib:
        result["@attrs"] = dict(element.attrib)
    
    # Process children
    children = {}
    for child in element:
        tag = child.tag
        value = xml_to_dict(child)
        
        if tag in children:
            # Multiple children with same tag → list
            if not isinstance(children[tag], list):
                children[tag] = [children[tag]]
            children[tag].append(value)
        else:
            children[tag] = value
    
    if children:
        result.update(children)
    
    # Text content
    if element.text and element.text.strip():
        if result:
            result["#text"] = element.text.strip()
        else:
            return element.text.strip()
    
    return result

High-Performance Batch Processing

from lxml import etree
from concurrent.futures import ProcessPoolExecutor
import os

def process_xml_chunk(xml_strings: list[bytes]) -> list[dict]:
    """Process a batch of XML documents in a worker process."""
    results = []
    parser = etree.XMLParser(resolve_entities=False, no_network=True)
    
    for xml_bytes in xml_strings:
        root = etree.fromstring(xml_bytes, parser)
        results.append(extract_data(root))
    
    return results

def parallel_xml_processing(directory: str, workers: int = 4):
    """Process many XML files in parallel."""
    files = [os.path.join(directory, f) 
             for f in os.listdir(directory) if f.endswith(".xml")]
    
    # Read all files (or chunk for huge directories)
    xml_data = [open(f, "rb").read() for f in files]
    
    # Split into chunks per worker
    chunk_size = len(xml_data) // workers + 1
    chunks = [xml_data[i:i+chunk_size] 
              for i in range(0, len(xml_data), chunk_size)]
    
    with ProcessPoolExecutor(max_workers=workers) as pool:
        results = list(pool.map(process_xml_chunk, chunks))
    
    # Flatten results
    return [item for chunk in results for item in chunk]

Performance Comparison

Parsing a 50 MB XML file:

LibraryParse TimeMemoryXPath Support
ElementTree~2.5s~250 MBBasic
lxml~0.8s~200 MBFull
SAX~1.5s~2 MBNone
iterparse (ET)~3.0s~10 MBPer-element
iterparse (lxml)~1.2s~10 MBPer-element

lxml is consistently fastest for both parsing and querying due to its C foundation (libxml2).

One Thing to Remember

Choose your XML parsing model based on file size and query complexity — ElementTree for simple tasks, lxml for XPath and validation, iterparse for large files, and always use defusedxml or hardened parsers when processing untrusted XML to prevent billion-laughs and XXE attacks.

pythonxmlparsingtext-processinglxmladvanced

See Also

  • Python Fuzzy Matching Fuzzywuzzy Find out how Python's FuzzyWuzzy library matches messy, misspelled text — like a friend who understands you even when you mumble.
  • Python Regex Lookahead Lookbehind Learn how Python regex can peek ahead or behind without grabbing text — like checking what's next in line without stepping forward.
  • Python Regex Named Groups Learn how Python regex named groups let you label the pieces you capture — like putting name tags on your search results.
  • Python Regex Patterns Discover how Python regex patterns work like a secret code for finding hidden text treasures in any document.
  • Python Regular Expressions Learn how Python can find tricky text patterns fast, like spotting every phone number hidden in a messy page.