Abstract Syntax Trees — Deep Dive

From Source to Tree: The Parsing Pipeline

When CPython processes source code, the flow is: source text → tokenizer → parser → AST → compiler → bytecode. The AST stage sits right in the middle. Since Python 3.9, CPython uses a PEG parser (replacing the older LL(1) parser) that directly produces AST nodes defined in Parser/Python.asdl. This ASDL (Abstract Syntax Description Language) file is the authoritative schema for every node type.

You can inspect the ASDL definitions to understand exactly what attributes each node carries:

import ast

tree = ast.parse("x = 2 + 3")
assign = tree.body[0]

# assign is an ast.Assign node
# assign.targets = [ast.Name(id='x', ctx=ast.Store())]
# assign.value = ast.BinOp(left=ast.Constant(value=2),
#                           op=ast.Add(),
#                           right=ast.Constant(value=3))

Every expression node has a ctx attribute (Load, Store, Del) that indicates how the name is used. This is critical for correct compilation — the same name x produces different bytecode depending on whether it appears on the left side of an assignment or in an expression.

Walking and Visiting

The ast module provides two traversal strategies:

ast.walk(node) yields every node in the tree in breadth-first order. Useful for flat searches (“find all calls to eval”) where you do not care about nesting.

ast.NodeVisitor gives you depth-first traversal with method dispatch. Define visit_FunctionDef, visit_Call, etc., and the visitor calls the right method for each node type. Call self.generic_visit(node) to continue into child nodes.

class EvalFinder(ast.NodeVisitor):
    def __init__(self):
        self.calls = []

    def visit_Call(self, node):
        if isinstance(node.func, ast.Name) and node.func.id == 'eval':
            self.calls.append(node.lineno)
        self.generic_visit(node)

finder = EvalFinder()
finder.visit(ast.parse(open('suspect.py').read()))
print(f"eval() calls at lines: {finder.calls}")

Transforming the Tree

ast.NodeTransformer extends NodeVisitor — each visit_* method returns a node (the original, a replacement, or None to delete). After transformation, call ast.fix_missing_locations(tree) to propagate line numbers to new nodes, then compile(tree, '<string>', 'exec') to get a code object.

A practical example — automatically wrapping all division operations with a zero-check:

class SafeDivision(ast.NodeTransformer):
    def visit_BinOp(self, node):
        self.generic_visit(node)
        if isinstance(node.op, ast.Div):
            # Replace a / b with (a / b if b != 0 else float('inf'))
            guard = ast.IfExp(
                test=ast.Compare(
                    left=node.right,
                    ops=[ast.NotEq()],
                    comparators=[ast.Constant(value=0)]
                ),
                body=node,
                orelse=ast.Call(
                    func=ast.Name(id='float', ctx=ast.Load()),
                    args=[ast.Constant(value='inf')],
                    keywords=[]
                )
            )
            return guard
        return node

source = "result = a / b"
tree = ast.parse(source)
tree = SafeDivision().visit(tree)
ast.fix_missing_locations(tree)
code = compile(tree, '<safe>', 'exec')

Building Nodes from Scratch

Sometimes you need to construct AST nodes programmatically rather than parsing source. Every node type is a regular class:

func_node = ast.FunctionDef(
    name='greet',
    args=ast.arguments(
        posonlyargs=[], args=[ast.arg(arg='name')],
        vararg=None, kwonlyargs=[], kw_defaults=[],
        kwarg=None, defaults=[]
    ),
    body=[
        ast.Return(value=ast.JoinedStr(values=[
            ast.Constant(value='Hello, '),
            ast.FormattedValue(
                value=ast.Name(id='name', ctx=ast.Load()),
                conversion=-1, format_spec=None
            )
        ]))
    ],
    decorator_list=[],
    returns=None,
    lineno=1, col_offset=0
)

This is verbose but fully explicit. For complex generation, consider using ast.parse() on template strings and then surgically replacing nodes.

ast.unparse() and Round-Tripping

Since Python 3.9, ast.unparse(node) converts an AST back to source code. The output is semantically equivalent but not stylistically identical to the original — parentheses, formatting, and comments are lost. For production-grade source-to-source tools, libcst (Concrete Syntax Tree) preserves formatting and comments, making it better for codemods that must produce human-readable diffs.

Compile-Time Hooks: ast.PyCF_ONLY_AST

You can intercept compilation at the AST stage using the compile() built-in with the ast.PyCF_ONLY_AST flag, or more commonly by overriding the compilation in import hooks:

import importlib.abc, importlib.machinery

class ASTRewritingLoader(importlib.abc.SourceLoader):
    def source_to_code(self, data, path, *, _optimize=-1):
        tree = ast.parse(data, filename=path)
        tree = MyTransformer().visit(tree)
        ast.fix_missing_locations(tree)
        return compile(tree, path, 'exec', dont_inherit=True,
                       optimize=_optimize)

    def get_data(self, path):
        with open(path, 'rb') as f:
            return f.read()

This pattern is how tools like pytest rewrite assert statements to produce rich failure messages — they intercept the AST before compilation and inject comparison-reporting code.

Performance Considerations

Parsing into an AST is fast for typical file sizes (milliseconds for a few thousand lines). However, walking large trees repeatedly can add up in CI pipelines that analyze entire codebases. Strategies to mitigate this:

  • Cache parsed trees per file (keyed by content hash)
  • Use ast.walk() for simple searches instead of full visitor classes
  • Filter files before parsing (skip test data, generated code)
  • Run analysis in parallel across files with concurrent.futures

The AST itself is a pure Python object graph, so memory usage scales with code complexity. A 10,000-line file might produce tens of thousands of nodes, each a small Python object. For truly massive codebases, consider incremental analysis tools that only re-parse changed files.

Real-World AST Tools

Bandit (security linter) walks the AST looking for patterns like eval(), pickle.loads(), hardcoded passwords, and SQL string formatting. Each check is a visitor plugin.

MonkeyType records runtime types and patches function annotations by modifying the AST.

Hypothesis and pytest both use AST rewriting — pytest rewrites assert for better messages, Hypothesis rewrites test functions to inject its fuzzing machinery.

mypy and pyright build their own enriched ASTs (with type information attached to nodes) for static type checking, but they start from the same parser output.

Tradeoffs: AST vs. CST vs. Bytecode

LevelPreserves formatting?Preserves comments?Executable?Best for
CST (libcst)YesYesNoCodemods, formatting
AST (ast)NoNoVia compile()Analysis, simple transforms
Bytecode (dis)NoNoYesRuntime optimization, debugging

Choose AST when you need structural understanding without formatting concerns. Choose CST when the output must look like human-written code. Choose bytecode when you need to understand or modify what actually executes.

One thing to remember: The AST is the seam between human-readable code and machine-executable instructions. Mastering it gives you the ability to build linters, security scanners, code generators, and transformation tools — all grounded in the actual structure of Python programs rather than fragile text pattern matching.

pythoncompiler-internalslanguage-implementation

See Also

  • Python Bytecode Manipulation How Python secretly translates your code into tiny instructions — and how you can peek at and change those instructions yourself.
  • Python Code Objects Internals What Python actually creates when it reads your function — the hidden blueprint that tells the computer what to do.
  • Python Compiler Pipeline The journey your Python code takes from text file to running program — explained like an assembly line.
  • Python Frame Objects Why Python keeps a notepad for every running function — and how it remembers where it left off.
  • Python Peephole Optimizer How Python quietly tidies up your code behind the scenes — making it faster without you lifting a finger.