Source-to-Source Transformers — Deep Dive

CST vs AST: The Technical Difference

A Concrete Syntax Tree preserves every token from the source, including whitespace, comments, parentheses, and formatting choices. An Abstract Syntax Tree discards these, keeping only semantic information.

# Source code:
result = (
    first_value +   # add them
    second_value
)

AST representation (ast module): BinOp(Name('first_value'), Add(), Name('second_value')) — no parentheses, no comment, no whitespace.

CST representation (libcst): Every space, the comment, the parentheses, and the line breaks are represented as nodes in the tree. When you modify first_value to x, everything else stays exactly as written.

This is why CST tools are essential for codemods — you want minimal diffs that only show the intended change.

libcst Architecture

libcst represents Python source as an immutable tree of typed nodes. Each node has children that are either other nodes, sequences, or sentinel values. Key concepts:

Nodes are immutable. You cannot modify a node in place. Instead, use .with_changes() to create a new node with updated attributes:

import libcst as cst

name = cst.Name("old")
new_name = name.with_changes(value="new")

Visitors and Transformers. CSTVisitor walks the tree without modifying it. CSTTransformer walks and replaces nodes. Both provide visit_* (entering a node) and leave_* (exiting a node) methods:

class RemovePrintStatements(cst.CSTTransformer):
    def leave_Expr(self, original, updated):
        if isinstance(updated.value, cst.Call):
            if isinstance(updated.value.func, cst.Name):
                if updated.value.func.value == "print":
                    return cst.RemoveFromParent()
        return updated

RemoveFromParent() is a special sentinel that tells the parent node to remove this child entirely.

Matchers: Declarative Pattern Matching

libcst’s matchers module provides a powerful pattern-matching DSL:

import libcst as cst
import libcst.matchers as m

class UpgradeFormatStrings(cst.CSTTransformer):
    """Convert 'hello %s' % name  →  f'hello {name}'"""

    def leave_BinaryOperation(self, original, updated):
        if m.matches(updated, m.BinaryOperation(
            left=m.SimpleString(),
            operator=m.Modulo(),
            right=m.Name()
        )):
            template = updated.left.evaluated_value
            var_name = updated.right.value
            # Simple single-substitution case
            if template.count('%s') == 1:
                new_str = template.replace('%s', '{' + var_name + '}')
                return cst.FormattedString(
                    parts=[cst.FormattedStringText(value=new_str)]
                )
        return updated

Matchers support wildcards (m.DoNotCare()), alternatives (m.OneOf()), negation (m.DoesNotMatch()), and nested patterns. They are more readable than manual isinstance chains and less error-prone.

Metadata Providers: Adding Semantic Information

Raw CST nodes lack semantic context — a Name("open") node could refer to the built-in, a local variable, or an attribute. libcst’s metadata system adds this information through providers:

from libcst.metadata import (
    MetadataWrapper, QualifiedNameProvider, ScopeProvider
)

class SafeOpenTransformer(cst.CSTTransformer):
    METADATA_DEPENDENCIES = (QualifiedNameProvider, ScopeProvider)

    def leave_Call(self, original, updated):
        qualified_names = self.get_metadata(
            QualifiedNameProvider, updated.func, set()
        )
        for qn in qualified_names:
            if qn.name == "builtins.open":
                # This is definitely the built-in open()
                # Transform to use pathlib or add encoding param
                pass
        return updated

source = open("example.py").read()
tree = cst.parse_module(source)
wrapper = MetadataWrapper(tree)
modified = wrapper.visit(SafeOpenTransformer())

Available providers include:

  • QualifiedNameProvider — Fully qualified names for references
  • ScopeProvider — Variable scope information
  • PositionProvider — Line/column positions
  • ParentNodeProvider — Parent node access
  • ExpressionContextProvider — Whether an expression is Load, Store, or Del

Building a Complete Codemod

Here is a production-quality codemod that migrates typing.Optional[X] to X | None (Python 3.10+ syntax):

import libcst as cst
import libcst.matchers as m
from libcst.metadata import QualifiedNameProvider, MetadataWrapper

class OptionalToUnion(cst.CSTTransformer):
    METADATA_DEPENDENCIES = (QualifiedNameProvider,)

    def __init__(self):
        super().__init__()
        self.typing_imported = False
        self.optional_alias = "Optional"

    def visit_ImportFrom(self, node):
        if m.matches(node, m.ImportFrom(module=m.Attribute(
            value=m.Name("typing")
        ) | m.Name("typing"))):
            for name in node.names if isinstance(node.names, tuple) else []:
                if isinstance(name, cst.ImportAlias):
                    if m.matches(name.name, m.Name("Optional")):
                        self.typing_imported = True
                        if name.asname:
                            alias = name.asname
                            if isinstance(alias, cst.AsName):
                                if isinstance(alias.name, cst.Name):
                                    self.optional_alias = alias.name.value

    def leave_Subscript(self, original, updated):
        if not self.typing_imported:
            return updated

        if m.matches(updated.value, m.Name(self.optional_alias)):
            # Optional[X] → X | None
            if len(updated.slice) == 1:
                inner = updated.slice[0].slice
                if isinstance(inner, cst.Index):
                    inner_type = inner.value
                else:
                    inner_type = inner
                return cst.BinaryOperation(
                    left=inner_type,
                    operator=cst.BitOr(
                        whitespace_before=cst.SimpleWhitespace(" "),
                        whitespace_after=cst.SimpleWhitespace(" "),
                    ),
                    right=cst.Name("None"),
                )
        return updated

Large-Scale Codemod Execution

For running codemods across large codebases, libcst provides libcst.codemod:

from libcst.codemod import CodemodContext, VisitorBasedCodemodCommand
from libcst.codemod.visitors import AddImportsVisitor

class MyCodemod(VisitorBasedCodemodCommand):
    DESCRIPTION = "Migrate deprecated API calls"

    def visit_Call(self, node):
        # analysis pass
        pass

    def leave_Call(self, original, updated):
        # Add a new import if needed
        AddImportsVisitor.add_needed_import(
            self.context, "new_module", "new_function"
        )
        return updated

Run it across a project:

python -m libcst.tool codemod my_codemods.MyCodemod src/

This handles parallelization, error reporting, and file discovery automatically.

Bowler: High-Level Codemod API

Bowler (built on libcst) provides a fluent API for common transformations:

from bowler import Query

(Query("src/")
    .select_function("old_api_call")
    .rename("new_api_call")
    .execute(write=True))

For more complex transformations:

(Query("src/")
    .select_method("Response.json")
    .add_argument("encoding", '"utf-8"')
    .execute(write=True))

Rope: Refactoring Library

Rope is a Python refactoring library that provides IDE-level refactoring operations:

import rope.base.project
import rope.refactor.rename

project = rope.base.project.Project("myproject")
resource = project.get_resource("src/module.py")

# Rename a variable across the entire project
renamer = rope.refactor.rename.Rename(
    project, resource, offset=42  # character offset of the name
)
changes = renamer.get_changes("new_name")
project.do(changes)

Rope understands Python scoping rules, so renames are semantically correct — it only renames the specific variable, not other variables with the same name in different scopes.

Testing Codemods

Codemods need thorough testing because they modify code at scale:

def test_optional_migration():
    input_code = """
from typing import Optional

def greet(name: Optional[str] = None) -> Optional[str]:
    return f"Hello, {name}" if name else None
"""
    expected = """
from typing import Optional

def greet(name: str | None = None) -> str | None:
    return f"Hello, {name}" if name else None
"""
    tree = cst.parse_module(input_code)
    wrapper = MetadataWrapper(tree)
    modified = wrapper.visit(OptionalToUnion())
    assert modified.code == expected

Test categories:

  • Positive cases — Code that should be transformed
  • Negative cases — Similar-looking code that should NOT be transformed
  • Edge cases — Nested types, aliased imports, star imports
  • Idempotency — Running the codemod twice produces the same result
  • Comment preservation — Comments near transformed code survive

Performance for Large Codebases

Parsing every file with libcst is slower than a simple grep. For codebases with thousands of files, filter first:

# Find candidate files with grep, then run codemod only on matches
grep -rl "Optional\[" src/ | xargs python -m libcst.tool codemod ...

libcst’s codemod runner supports --jobs for parallel execution. On a 10,000-file codebase, parallel execution can reduce a 30-minute run to under 5 minutes.

Real-World Codemod Tools

pyupgrade — Automatically upgrades Python syntax to newer versions (f-strings, type union syntax, walrus operator opportunities).

django-upgrade — Migrates Django code to newer Django versions.

com2ann — Converts Python 2-style type comments to Python 3 annotations.

autoflake — Removes unused imports and variables.

flynt — Converts old-style string formatting to f-strings.

All of these are source-to-source transformers that parse, modify, and rewrite Python source while preserving formatting and intent.

One thing to remember: Source-to-source transformers powered by CST libraries like libcst give you the ability to make precise, large-scale code modifications that respect formatting, comments, and language structure. They are the industrial tool for codebased migrations, API updates, and systematic refactoring — turning weeks of manual work into automated, testable, repeatable transformations.

pythonmetaprogramminglanguage-implementation

See Also

  • Python Code Generation Patterns When your Python program writes other programs — like a chef who invents new recipes instead of just cooking.
  • Ci Cd Why big apps can ship updates every day without turning your phone into a glitchy mess — CI/CD is the behind-the-scenes quality gate and delivery truck.
  • Containerization Why does software that works on your computer break on everyone else's? Containers fix that — and they're why Netflix can deploy 100 updates a day without the site going down.
  • Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
  • Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.