Python ast Module for Code Analysis — Deep Dive
How CPython builds the AST
When Python processes source code, it goes through several stages: tokenization (breaking source into tokens), parsing (building a parse tree), AST construction (simplifying the parse tree into an abstract syntax tree), compilation (generating bytecode), and execution. The ast module exposes the AST stage.
Internally, CPython uses a PEG parser (since Python 3.9, replacing the older LL(1) parser). The PEG parser produces a concrete syntax tree first, then a dedicated AST builder converts it to the ast node hierarchy. This two-phase approach is why ast.parse() rejects syntax errors — the full parser runs, not just a tokenizer.
Advanced parsing modes
ast.parse() accepts a mode parameter that controls what grammar rule is used:
import ast
# Parse a module (default) — expects statements
mod = ast.parse("x = 1\ny = 2", mode="exec")
# Parse a single expression — expects one expression, returns ast.Expression
expr = ast.parse("2 + 3", mode="eval")
# Parse a single interactive statement — like the REPL
inter = ast.parse("x = 1", mode="single")
# Parse type comments (Python 3.8+)
typed = ast.parse(
"x = [] # type: List[int]",
type_comments=True
)
# Parse with feature_version to target older Python syntax
old = ast.parse("match x:\n case 1: pass", feature_version=(3, 10))
The feature_version parameter (Python 3.8+) lets you parse code as if it were a specific Python version — useful for linters that need to support multiple Python versions.
Building a security scanner
A practical security scanner that detects dangerous patterns:
import ast
import sys
from dataclasses import dataclass, field
@dataclass
class SecurityIssue:
filename: str
line: int
col: int
severity: str
rule: str
message: str
class SecurityScanner(ast.NodeVisitor):
DANGEROUS_CALLS = {
"eval": "Arbitrary code execution via eval()",
"exec": "Arbitrary code execution via exec()",
"compile": "Dynamic code compilation",
"__import__": "Dynamic import can load arbitrary modules",
}
DANGEROUS_ATTRS = {
("os", "system"): "Shell command injection risk",
("subprocess", "call"): "Use subprocess.run with check=True instead",
("pickle", "loads"): "Pickle deserialization of untrusted data",
("yaml", "load"): "Use yaml.safe_load instead",
}
def __init__(self, filename):
self.filename = filename
self.issues: list[SecurityIssue] = []
self._imports: dict[str, str] = {}
def visit_Import(self, node):
for alias in node.names:
name = alias.asname or alias.name
self._imports[name] = alias.name
self.generic_visit(node)
def visit_ImportFrom(self, node):
for alias in node.names:
name = alias.asname or alias.name
self._imports[name] = f"{node.module}.{alias.name}"
self.generic_visit(node)
def visit_Call(self, node):
# Check direct dangerous calls: eval(), exec()
if isinstance(node.func, ast.Name):
if node.func.id in self.DANGEROUS_CALLS:
self.issues.append(SecurityIssue(
filename=self.filename,
line=node.lineno,
col=node.col_offset,
severity="HIGH",
rule="dangerous-call",
message=self.DANGEROUS_CALLS[node.func.id],
))
# Check dangerous attribute calls: os.system(), pickle.loads()
if isinstance(node.func, ast.Attribute):
if isinstance(node.func.value, ast.Name):
module = self._imports.get(
node.func.value.id, node.func.value.id
)
key = (module, node.func.attr)
if key in self.DANGEROUS_ATTRS:
self.issues.append(SecurityIssue(
filename=self.filename,
line=node.lineno,
col=node.col_offset,
severity="MEDIUM",
rule="dangerous-attr-call",
message=self.DANGEROUS_ATTRS[key],
))
self.generic_visit(node)
def scan_file(filepath):
with open(filepath) as f:
source = f.read()
tree = ast.parse(source, filename=filepath)
scanner = SecurityScanner(filepath)
scanner.visit(tree)
return scanner.issues
Cyclomatic complexity calculator
Cyclomatic complexity counts the number of independent paths through code. Each branching construct adds 1:
import ast
class ComplexityVisitor(ast.NodeVisitor):
BRANCHING_NODES = (
ast.If, ast.For, ast.While, ast.ExceptHandler,
ast.With, ast.Assert, ast.BoolOp,
)
def __init__(self):
self.functions = {}
self._current = None
self._complexity = 0
def visit_FunctionDef(self, node):
old_name, old_complexity = self._current, self._complexity
self._current = node.name
self._complexity = 1 # Base complexity
self.generic_visit(node)
self.functions[self._current] = self._complexity
self._current, self._complexity = old_name, old_complexity
visit_AsyncFunctionDef = visit_FunctionDef
def generic_visit(self, node):
if isinstance(node, self.BRANCHING_NODES):
self._complexity += 1
if isinstance(node, ast.BoolOp):
# Each additional and/or adds a path
self._complexity += len(node.values) - 1
super().generic_visit(node)
def calculate_complexity(source):
tree = ast.parse(source)
visitor = ComplexityVisitor()
visitor.visit(tree)
return visitor.functions
Source-to-source transformation
A practical example: automatically adding type-checking assertions to function parameters:
import ast
class AddTypeChecks(ast.NodeTransformer):
def visit_FunctionDef(self, node):
self.generic_visit(node)
checks = []
for arg in node.args.args:
if arg.annotation:
check = ast.parse(
f"if not isinstance({arg.arg}, {ast.unparse(arg.annotation)}):"
f" raise TypeError('{arg.arg} must be "
f"{ast.unparse(arg.annotation)}')"
).body[0]
checks.append(check)
node.body = checks + node.body
ast.fix_missing_locations(node)
return node
# Usage
source = '''
def greet(name: str, times: int):
for _ in range(times):
print(f"Hello, {name}!")
'''
tree = ast.parse(source)
transformed = AddTypeChecks().visit(tree)
print(ast.unparse(transformed))
ast.unparse() — from tree back to source
Python 3.9 added ast.unparse() which converts an AST back to valid Python source code. The output is semantically equivalent but may not match the original formatting:
import ast
source = "x = 2 + 3 # comment"
tree = ast.parse(source)
print(ast.unparse(tree)) # "x = 2 + 3" (no comment, normalized spacing)
Comments and whitespace are lost. For formatting-preserving transformations, use libcst instead.
Performance considerations
Parsing with ast.parse() is fast for individual files (typically under 10ms for a 1000-line file). For large codebases:
- Parse files in parallel using
concurrent.futures - Cache parsed ASTs if you need multiple analysis passes (AST nodes are picklable)
- Use
ast.walk()instead ofNodeVisitorfor simple pattern matching — it avoids method dispatch overhead - For very large files, consider incremental parsing with
tree-sitter-python(third-party)
Pattern matching with ast (Python 3.10+)
Structural pattern matching works beautifully with AST nodes:
import ast
def find_string_concats(tree):
"""Find string concatenation that could use f-strings."""
results = []
for node in ast.walk(tree):
match node:
case ast.BinOp(
left=ast.Constant(value=str()),
op=ast.Add(),
right=_
):
results.append((node.lineno, ast.unparse(node)))
case ast.BinOp(
left=_,
op=ast.Add(),
right=ast.Constant(value=str())
):
results.append((node.lineno, ast.unparse(node)))
return results
Comparison with alternatives
| Tool | Use case | Preserves formatting | Performance |
|---|---|---|---|
ast (stdlib) | Analysis, simple transforms | No | Fast |
libcst | Formatting-preserving transforms | Yes | Moderate |
tree-sitter | Incremental parsing, multi-language | Yes (CST) | Very fast |
parso | Error-tolerant parsing (Jedi) | Yes | Moderate |
tokenize | Token-level analysis | Yes (tokens) | Fast |
Edge cases and gotchas
f-strings before Python 3.12: f-string contents were not fully parsed into the AST — the JoinedStr node contained a mix of Constant and FormattedValue nodes but lacked precise source positions. Python 3.12 fixed this with PEP 701.
Type comments: ast.parse(type_comments=True) is needed to capture # type: comments used in Python 2-compatible type annotations. Without this flag, they are silently ignored.
Compile and exec round-trip: You can compile() a modified AST and exec() it, but the resulting code object has no source file. Tracebacks will show "<string>" as the filename. Set compile(tree, "original.py", "exec") to preserve filename attribution.
The one thing to remember: The ast module’s real power emerges when you combine NodeVisitor for analysis and NodeTransformer for rewriting — together they let you build linters, security scanners, and refactoring tools that work on code structure rather than fragile text patterns.
See Also
- Python Dis Module Bytecode How Python's dis module lets you peek at the secret instructions your computer actually runs when it executes your Python code.
- Python Gc Module Internals How Python's garbage collector automatically cleans up memory you are no longer using — like a tidy roommate for your program.
- Python Importlib Custom Loaders How Python's importlib lets you teach Python to load code from anywhere — databases, zip files, the internet, or even generated on the fly.
- Python Site Customization How Python's site module sets up your environment before your code even starts running — the invisible first step of every Python program.
- Python Startup Optimization Why Python takes a moment to start and what you can do to make your scripts and tools launch faster.