Python Basics — Deep Dive

Under the hood: how CPython actually executes your code, what bytecode is, the GIL, and why Python is slow — but also why that often doesn't matter.

CPython: The Python Most People Use

When someone says “Python,” they almost always mean CPython — the reference implementation written in C, maintained at python.org. There are other implementations (PyPy, Jython, MicroPython), but CPython is the standard.

Understanding CPython’s execution model explains both Python’s limitations and why so many performance workarounds exist.

From Source Code to Execution

When you run python3 myfile.py, four things happen:

1. Lexing and Parsing

CPython reads your .py file and tokenizes it — breaking the source text into tokens (keywords, identifiers, operators, literals). These tokens are parsed into an Abstract Syntax Tree (AST), a tree structure representing the grammatical structure of your code.

You can see the AST yourself:

import ast
source = "x = 1 + 2"
tree = ast.parse(source)
print(ast.dump(tree, indent=2))

This produces a tree showing that the source is an assignment where the target is x and the value is a BinOp (binary operation) adding 1 and 2.

2. Compilation to Bytecode

The AST is compiled to bytecode — a set of instructions for the Python virtual machine (PVM). These are not machine instructions; they’re abstract operations like LOAD_FAST, BINARY_ADD, CALL_FUNCTION.

import dis
def add(a, b):
    return a + b

dis.dis(add)

Output (simplified):

  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 RETURN_VALUE

Bytecode is cached in .pyc files (inside __pycache__/). If the source file hasn’t changed, Python skips re-compilation on subsequent runs.

3. Execution by the PVM

The Python Virtual Machine is a loop that reads bytecode instructions and executes them. It’s a stack machine — most operations push to or pop from an evaluation stack.

For a + b:

LOAD_FAST 0 pushes a onto the stack
LOAD_FAST 1 pushes b onto the stack
BINARY_ADD pops both, adds them, pushes the result
RETURN_VALUE pops the result and returns it

This interpretation loop is why Python is slow. Each bytecode instruction requires a C function call, type checks, reference count updates, and memory allocations. A simple integer addition that takes 1 CPU cycle in C might take 50-100 cycles in CPython.

The Global Interpreter Lock (GIL)

The GIL is CPython’s most discussed limitation. It’s a mutex that prevents more than one thread from executing Python bytecode at a time.

Why It Exists

Python uses reference counting for memory management. Every object has a counter tracking how many references point to it. When the count hits zero, the object is freed.

Reference counts are not atomic operations — reading and modifying them can race across threads. Rather than adding fine-grained locking (expensive and complex), Guido added one big lock: the GIL.

The Consequence

Python threads cannot run Python code in parallel on multiple CPU cores. A CPU-bound Python program using 4 threads will not go 4x faster — it might even be slower due to lock contention.

import threading
import time

def count():
    for _ in range(100_000_000):
        pass

# Single thread: ~4 seconds
# Two threads: ~5-6 seconds (worse, not better)

The Workaround

The GIL only applies to CPython bytecode. When Python calls into a C extension (like NumPy), the extension can release the GIL and run in parallel.

For true CPU parallelism in Python, use multiprocessing instead of threading — each process gets its own Python interpreter with its own GIL.

from multiprocessing import Pool

def compute(n):
    return sum(range(n))

with Pool(4) as p:
    results = p.map(compute, [10**7, 10**7, 10**7, 10**7])

Python 3.13 note: CPython 3.13 (released late 2024) includes an experimental “free-threaded” mode that removes the GIL. It’s not the default yet, but the trajectory is clear — the GIL will eventually go away.

Object Model: Everything Is an Object

In Python, everything is an object — integers, strings, functions, classes, modules, None. Every object has:

Type (type(x))
Identity (id(x) — its memory address)
Value

This uniformity makes Python flexible but adds overhead. A Python integer is not a CPU integer — it’s a heap-allocated struct containing a reference count, a pointer to the type, and the actual value. Small integers (-5 to 256) are cached as singletons, which is why a = 256; b = 256; a is b is True, but a = 257; b = 257; a is b may be False.

a = 256
b = 256
print(a is b)   # True — same cached object

a = 257
b = 257
print(a is b)   # False — different objects (same value, but ==, not is)

Memory Management

CPython uses two mechanisms:

Reference counting — fast, deterministic, immediate cleanup when count hits zero.

Cyclic garbage collector — handles reference cycles (object A points to B, B points to A — their counts never hit zero). The gc module runs periodically to find and clean these up.

You can inspect and control the GC:

import gc
gc.collect()        # manually trigger
gc.disable()        # turn off (be careful)
print(gc.get_count())  # (gen0, gen1, gen2) counts

Python Startup Overhead

Starting a Python process isn’t free. A bare python3 -c "pass" on a modern machine takes 30-80ms — Python loads its standard library modules, sets up the memory allocator, initializes the GIL, and runs startup code.

This matters for short-lived scripts (running Python thousands of times in a loop in a shell script is slow). Solutions: keep the process alive (server model), use PyPy (faster startup, JIT compiled), or use a different tool entirely.

When Python’s Speed Doesn’t Matter

For I/O-bound work — reading files, making HTTP requests, querying databases — Python is perfectly fast. Your program spends 99% of its time waiting for network or disk; the interpreter overhead is noise.

The benchmarks that show “Python is 100x slower than C” are measuring CPU-bound operations in tight loops. Most real applications aren’t doing that.

And when they are: call into NumPy (C), use Cython (compiles Python-like code to C), or write a C extension. The ecosystem makes dropping down to C easy when you need it.

One Thing to Remember

CPython compiles your code to bytecode and runs it in a virtual machine loop — which is why it’s slow on pure computation, but also why the GIL, the object model, and the rich C-extension ecosystem make sense once you understand the tradeoffs being made.

pythonprogrammingcpythonbytecodeinternals