Python Mypyc Compilation — Deep Dive

Master mypyc's IR generation, native class compilation, interop boundaries, incremental builds, and real-world optimization strategies used by mypy itself.

Compilation architecture

Mypyc’s compilation pipeline has four stages:

Stage 1: Mypy type analysis

Mypyc runs mypy’s full type-checking pass on your code. This produces a type-annotated AST where every expression has a known (or inferred) type. If mypy finds type errors, compilation fails — your code must be type-correct.

Stage 2: IR generation

The typed AST is lowered into mypyc’s intermediate representation (IR). This IR uses:

Registers — typed temporary values (like CPU registers but at a higher level).
Operations — primitive ops like int_add, list_get, call_native.
Blocks — basic blocks connected by branches, forming a control flow graph.

# Source
def sum_list(nums: list[int]) -> int:
    total: int = 0
    for n in nums:
        total += n
    return total

# Simplified IR
block_entry:
    r0 = 0              # total: int
    r1 = get_iter(nums) # iterator
    goto block_loop

block_loop:
    r2 = next(r1)       # StopIteration check
    if exhausted goto block_return
    r0 = int_add(r0, r2)  # native int addition
    goto block_loop

block_return:
    return box(r0)       # convert native int back to PyObject

The int_add operation compiles to a single machine instruction plus an overflow check — no PyNumber_Add dispatch, no PyObject allocation for intermediate results.

Stage 3: C code generation

The IR is translated to C code that uses CPython’s C API for Python interop but native C operations where types are known:

static PyObject *sum_list_impl(PyObject *nums) {
    CPyTagged total = 0;  // Native tagged integer
    PyObject *iter = PyObject_GetIter(nums);
    
    while (1) {
        PyObject *item = PyIter_Next(iter);
        if (item == NULL) break;
        
        CPyTagged n = CPyTagged_FromObject(item);
        Py_DECREF(item);
        total = CPyTagged_Add(total, n);  // Fast path for small ints
    }
    
    Py_DECREF(iter);
    return CPyTagged_AsObject(total);
}

Stage 4: C compilation

Standard C compiler (GCC/Clang/MSVC) compiles the generated C into a shared library.

Native classes

When mypyc compiles a class, it replaces the Python __dict__-based attribute storage with a C struct:

# Source
class Point:
    def __init__(self, x: float, y: float) -> None:
        self.x = x
        self.y = y

    def distance(self, other: 'Point') -> float:
        dx = self.x - other.x
        dy = self.y - other.y
        return (dx * dx + dy * dy) ** 0.5

Compiled, Point instances use a fixed C struct:

typedef struct {
    PyObject_HEAD
    double x;  // Direct double, not PyObject*
    double y;
} PointObject;

Accessing self.x compiles to a direct struct field access — a single memory load — instead of a dictionary lookup. This makes attribute-heavy code significantly faster.

The tradeoff: you cannot dynamically add attributes to compiled class instances. point.z = 3 raises AttributeError. This is similar to __slots__ but enforced at the C level.

Tagged integers

Mypyc uses a clever representation called tagged integers for int values:

Small integers (fits in 63 bits on 64-bit systems) are stored as C long values with the low bit set to 0.
Large integers (arbitrary precision) are stored as regular PyObject* pointers with the low bit set to 1.

// Tagged integer representation
typedef intptr_t CPyTagged;

// Small int: value << 1 (low bit = 0)
// Big int: PyObject* | 1 (low bit = 1)

static inline CPyTagged CPyTagged_Add(CPyTagged a, CPyTagged b) {
    if (likely(CPyTagged_IsShort(a) && CPyTagged_IsShort(b))) {
        // Fast path: native addition with overflow check
        CPyTagged result;
        if (likely(!__builtin_add_overflow(a, b, &result))) {
            return result;
        }
    }
    // Slow path: fall back to PyLong operations
    return CPyTagged_Add_slow(a, b);
}

This gives near-C-speed arithmetic for typical integer values while correctly handling Python’s unlimited-precision integers when needed.

Interop boundaries

Compiled and interpreted code can call each other, but there is a cost at the boundary:

# compiled_module.py (compiled with mypyc)
def fast_compute(data: list[int]) -> int:
    # Runs at native speed
    return sum(x * x for x in data)

# main.py (regular Python)
from compiled_module import fast_compute
result = fast_compute([1, 2, 3, 4, 5])  # Boundary crossing

At the boundary, mypyc generates wrapper functions that:

Convert Python objects to native representations (unboxing).
Call the native compiled function.
Convert the result back to Python objects (boxing).

For hot paths, keep the entire call chain within compiled modules to avoid repeated boxing/unboxing.

Incremental compilation

For large projects, mypyc supports incremental compilation through setuptools:

# setup.py
from mypyc.build import mypycify

# Group related modules for optimal cross-module optimization
setup(
    ext_modules=mypycify(
        [
            # Group 1: core (compiled together, can cross-optimize)
            ["mypackage/core.py", "mypackage/types.py"],
            # Group 2: utils (separate compilation unit)
            ["mypackage/utils.py"],
            # Group 3: API (depends on core, but compiled independently)
            ["mypackage/api.py"],
        ],
        opt_level="3",       # Maximum optimization
        multi_file=True,     # Separate .c files per module (faster incremental)
    ),
)

Modules in the same group can call each other with direct C function calls (no wrapper overhead). Modules in different groups go through the Python-level wrapper.

Real-world case study: mypy itself

Mypy uses mypyc to compile itself — the type checker is compiled with its own compiler. Results from the mypy project:

4x faster overall type-checking speed.
Critical path speedup: type analysis loops that were 10x slower in pure Python.
~40 modules compiled out of ~200 total (only performance-critical ones).
No code changes beyond ensuring type annotations were complete and correct.

The mypy team’s strategy:

Profile to find hot modules.
Ensure those modules pass mypy --strict.
Add them to the mypyc build list.
Benchmark before/after.
Iterate.

Optimization techniques

Technique 1: Final declarations

from typing import Final

MAX_RETRIES: Final = 3  # Compiled as a C constant
BUFFER_SIZE: Final = 8192

class Config:
    MAX_WORKERS: Final = 4  # Class constant, not instance attribute

Final values are inlined at compile time — no dictionary lookup, no attribute access.

Technique 2: Precise container types

# Slower — mypyc treats as generic dict
data: dict = {"a": 1}

# Faster — mypyc knows key and value types
data: dict[str, int] = {"a": 1}

Technique 3: Avoid dynamic patterns in compiled modules

# Bad for mypyc — dynamic attribute access
value = getattr(obj, attr_name)

# Good for mypyc — direct attribute access
value = obj.known_attribute

# Bad for mypyc — **kwargs
def func(**kwargs: Any) -> None: ...

# Good for mypyc — explicit parameters
def func(x: int, y: int, z: int = 0) -> None: ...

Technique 4: Use native operations

# Slower — generic Python built-in
total = sum(numbers)

# Faster in mypyc — explicit loop with typed accumulator
total: int = 0
for n in numbers:
    total += n

Mypyc can optimize the explicit loop with native integer operations, while sum() is a call to a Python built-in that returns a generic PyObject.

Testing compiled modules

Compiled modules should produce identical results to interpreted ones. Test both:

# conftest.py
import importlib
import pytest

@pytest.fixture(params=["compiled", "interpreted"])
def module_mode(request, monkeypatch):
    if request.param == "interpreted":
        # Force Python fallback by removing the .so
        import mypackage.core as mod
        monkeypatch.setattr(mod, '__file__', mod.__file__)
        # Or: test against a pure-Python copy
    return request.param

In CI, run tests both with and without compilation:

- name: Test interpreted
  run: pytest tests/
  
- name: Build compiled
  run: pip install -e .  # triggers mypyc
  
- name: Test compiled
  run: pytest tests/

Limitations in practice

No __init_subclass__ or __set_name__ in compiled classes.
No class decorators that modify the class significantly.
Generator functions compile but with limited optimization.
Closures work but capture variables by reference, not value (same as Python).
Error messages in compiled code may show C-level details in tracebacks.

For most projects, the practical approach is selective compilation — compile the 20% of modules responsible for 80% of runtime, leave the rest as regular Python.

One thing to remember: Mypyc’s magic is turning Python’s type system from a static analysis tool into a compilation strategy — tagged integers, native structs, and direct C calls emerge naturally from the type annotations you already write, delivering real speedups where it matters most.

pythonmypyccompilationtype-hintsperformance