Python ast Module for Code Analysis — Deep Dive

Build production-grade code analysis tools using Python's ast module — from security scanners and complexity analyzers to source-to-source transformations.

How CPython builds the AST

When Python processes source code, it goes through several stages: tokenization (breaking source into tokens), parsing (building a parse tree), AST construction (simplifying the parse tree into an abstract syntax tree), compilation (generating bytecode), and execution. The ast module exposes the AST stage.

Internally, CPython uses a PEG parser (since Python 3.9, replacing the older LL(1) parser). The PEG parser produces a concrete syntax tree first, then a dedicated AST builder converts it to the ast node hierarchy. This two-phase approach is why ast.parse() rejects syntax errors — the full parser runs, not just a tokenizer.

Advanced parsing modes

ast.parse() accepts a mode parameter that controls what grammar rule is used:

import ast

# Parse a module (default) — expects statements
mod = ast.parse("x = 1\ny = 2", mode="exec")

# Parse a single expression — expects one expression, returns ast.Expression
expr = ast.parse("2 + 3", mode="eval")

# Parse a single interactive statement — like the REPL
inter = ast.parse("x = 1", mode="single")

# Parse type comments (Python 3.8+)
typed = ast.parse(
    "x = []  # type: List[int]",
    type_comments=True
)

# Parse with feature_version to target older Python syntax
old = ast.parse("match x:\n  case 1: pass", feature_version=(3, 10))

The feature_version parameter (Python 3.8+) lets you parse code as if it were a specific Python version — useful for linters that need to support multiple Python versions.

Building a security scanner

A practical security scanner that detects dangerous patterns:

import ast
import sys
from dataclasses import dataclass, field

@dataclass
class SecurityIssue:
    filename: str
    line: int
    col: int
    severity: str
    rule: str
    message: str

class SecurityScanner(ast.NodeVisitor):
    DANGEROUS_CALLS = {
        "eval": "Arbitrary code execution via eval()",
        "exec": "Arbitrary code execution via exec()",
        "compile": "Dynamic code compilation",
        "__import__": "Dynamic import can load arbitrary modules",
    }

    DANGEROUS_ATTRS = {
        ("os", "system"): "Shell command injection risk",
        ("subprocess", "call"): "Use subprocess.run with check=True instead",
        ("pickle", "loads"): "Pickle deserialization of untrusted data",
        ("yaml", "load"): "Use yaml.safe_load instead",
    }

    def __init__(self, filename):
        self.filename = filename
        self.issues: list[SecurityIssue] = []
        self._imports: dict[str, str] = {}

    def visit_Import(self, node):
        for alias in node.names:
            name = alias.asname or alias.name
            self._imports[name] = alias.name
        self.generic_visit(node)

    def visit_ImportFrom(self, node):
        for alias in node.names:
            name = alias.asname or alias.name
            self._imports[name] = f"{node.module}.{alias.name}"
        self.generic_visit(node)

    def visit_Call(self, node):
        # Check direct dangerous calls: eval(), exec()
        if isinstance(node.func, ast.Name):
            if node.func.id in self.DANGEROUS_CALLS:
                self.issues.append(SecurityIssue(
                    filename=self.filename,
                    line=node.lineno,
                    col=node.col_offset,
                    severity="HIGH",
                    rule="dangerous-call",
                    message=self.DANGEROUS_CALLS[node.func.id],
                ))

        # Check dangerous attribute calls: os.system(), pickle.loads()
        if isinstance(node.func, ast.Attribute):
            if isinstance(node.func.value, ast.Name):
                module = self._imports.get(
                    node.func.value.id, node.func.value.id
                )
                key = (module, node.func.attr)
                if key in self.DANGEROUS_ATTRS:
                    self.issues.append(SecurityIssue(
                        filename=self.filename,
                        line=node.lineno,
                        col=node.col_offset,
                        severity="MEDIUM",
                        rule="dangerous-attr-call",
                        message=self.DANGEROUS_ATTRS[key],
                    ))

        self.generic_visit(node)

def scan_file(filepath):
    with open(filepath) as f:
        source = f.read()
    tree = ast.parse(source, filename=filepath)
    scanner = SecurityScanner(filepath)
    scanner.visit(tree)
    return scanner.issues

Cyclomatic complexity calculator

Cyclomatic complexity counts the number of independent paths through code. Each branching construct adds 1:

import ast

class ComplexityVisitor(ast.NodeVisitor):
    BRANCHING_NODES = (
        ast.If, ast.For, ast.While, ast.ExceptHandler,
        ast.With, ast.Assert, ast.BoolOp,
    )

    def __init__(self):
        self.functions = {}
        self._current = None
        self._complexity = 0

    def visit_FunctionDef(self, node):
        old_name, old_complexity = self._current, self._complexity
        self._current = node.name
        self._complexity = 1  # Base complexity
        self.generic_visit(node)
        self.functions[self._current] = self._complexity
        self._current, self._complexity = old_name, old_complexity

    visit_AsyncFunctionDef = visit_FunctionDef

    def generic_visit(self, node):
        if isinstance(node, self.BRANCHING_NODES):
            self._complexity += 1
            if isinstance(node, ast.BoolOp):
                # Each additional and/or adds a path
                self._complexity += len(node.values) - 1
        super().generic_visit(node)

def calculate_complexity(source):
    tree = ast.parse(source)
    visitor = ComplexityVisitor()
    visitor.visit(tree)
    return visitor.functions

Source-to-source transformation

A practical example: automatically adding type-checking assertions to function parameters:

import ast

class AddTypeChecks(ast.NodeTransformer):
    def visit_FunctionDef(self, node):
        self.generic_visit(node)
        checks = []
        for arg in node.args.args:
            if arg.annotation:
                check = ast.parse(
                    f"if not isinstance({arg.arg}, {ast.unparse(arg.annotation)}):"
                    f" raise TypeError('{arg.arg} must be "
                    f"{ast.unparse(arg.annotation)}')"
                ).body[0]
                checks.append(check)
        node.body = checks + node.body
        ast.fix_missing_locations(node)
        return node

# Usage
source = '''
def greet(name: str, times: int):
    for _ in range(times):
        print(f"Hello, {name}!")
'''

tree = ast.parse(source)
transformed = AddTypeChecks().visit(tree)
print(ast.unparse(transformed))

ast.unparse() — from tree back to source

Python 3.9 added ast.unparse() which converts an AST back to valid Python source code. The output is semantically equivalent but may not match the original formatting:

import ast

source = "x   =   2  +  3  # comment"
tree = ast.parse(source)
print(ast.unparse(tree))  # "x = 2 + 3"  (no comment, normalized spacing)

Comments and whitespace are lost. For formatting-preserving transformations, use libcst instead.

Performance considerations

Parsing with ast.parse() is fast for individual files (typically under 10ms for a 1000-line file). For large codebases:

Parse files in parallel using concurrent.futures
Cache parsed ASTs if you need multiple analysis passes (AST nodes are picklable)
Use ast.walk() instead of NodeVisitor for simple pattern matching — it avoids method dispatch overhead
For very large files, consider incremental parsing with tree-sitter-python (third-party)

Pattern matching with ast (Python 3.10+)

Structural pattern matching works beautifully with AST nodes:

import ast

def find_string_concats(tree):
    """Find string concatenation that could use f-strings."""
    results = []
    for node in ast.walk(tree):
        match node:
            case ast.BinOp(
                left=ast.Constant(value=str()),
                op=ast.Add(),
                right=_
            ):
                results.append((node.lineno, ast.unparse(node)))
            case ast.BinOp(
                left=_,
                op=ast.Add(),
                right=ast.Constant(value=str())
            ):
                results.append((node.lineno, ast.unparse(node)))
    return results

Comparison with alternatives

Tool	Use case	Preserves formatting	Performance
`ast` (stdlib)	Analysis, simple transforms	No	Fast
`libcst`	Formatting-preserving transforms	Yes	Moderate
`tree-sitter`	Incremental parsing, multi-language	Yes (CST)	Very fast
`parso`	Error-tolerant parsing (Jedi)	Yes	Moderate
`tokenize`	Token-level analysis	Yes (tokens)	Fast

Edge cases and gotchas

f-strings before Python 3.12: f-string contents were not fully parsed into the AST — the JoinedStr node contained a mix of Constant and FormattedValue nodes but lacked precise source positions. Python 3.12 fixed this with PEP 701.

Type comments: ast.parse(type_comments=True) is needed to capture # type: comments used in Python 2-compatible type annotations. Without this flag, they are silently ignored.

Compile and exec round-trip: You can compile() a modified AST and exec() it, but the resulting code object has no source file. Tracebacks will show "<string>" as the filename. Set compile(tree, "original.py", "exec") to preserve filename attribution.

The one thing to remember: The ast module’s real power emerges when you combine NodeVisitor for analysis and NodeTransformer for rewriting — together they let you build linters, security scanners, and refactoring tools that work on code structure rather than fragile text patterns.

pythonmetaprogrammingcode-analysis