Python Memory Layout Optimization — Deep Dive
Memory layout optimization in Python operates at the intersection of CPython’s object model and hardware cache architecture. Understanding both lets you make data structure choices that yield 5-50× performance improvements in data-intensive code.
CPython object memory anatomy
The PyObject header
Every CPython object starts with a common header:
// CPython 3.12+ object header
typedef struct {
Py_ssize_t ob_refcnt; // 8 bytes: reference count
PyTypeObject *ob_type; // 8 bytes: pointer to type object
} PyObject;
That’s 16 bytes before any actual data. Variable-size objects add another field:
typedef struct {
PyObject ob_base; // 16 bytes
Py_ssize_t ob_size; // 8 bytes: number of items
} PyVarObject;
Size breakdown of common types
import sys
print(sys.getsizeof(0)) # 28 bytes (int, small)
print(sys.getsizeof(2**30)) # 32 bytes (int, medium)
print(sys.getsizeof(2**60)) # 36 bytes (int, large)
print(sys.getsizeof(1.0)) # 24 bytes (float)
print(sys.getsizeof(True)) # 28 bytes (bool)
print(sys.getsizeof('')) # 49 bytes (empty str)
print(sys.getsizeof('hello')) # 54 bytes (5-char str)
print(sys.getsizeof(())) # 40 bytes (empty tuple)
print(sys.getsizeof([])) # 56 bytes (empty list)
print(sys.getsizeof({})) # 64 bytes (empty dict)
A Python float uses 24 bytes for 8 bytes of actual data — 3× overhead. This matters at scale.
The dict cost
Regular class instances store attributes in a dictionary. Since Python 3.12, CPython uses a “shared keys” optimization where instances of the same class share the keys portion of their dict, but each still has a values array plus the dict header:
class Regular:
def __init__(self, x, y):
self.x = x
self.y = y
class Slotted:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
# Measure total including referenced objects
import pympler.asizeof
print(pympler.asizeof.asizeof(Regular(1.0, 2.0))) # ~280 bytes
print(pympler.asizeof.asizeof(Slotted(1.0, 2.0))) # ~104 bytes
Struct-of-Arrays vs Array-of-Structs
This classic data layout decision has dramatic performance implications:
Array-of-Structs (AoS)
Each entity is a struct; you have an array of entities:
# AoS: natural Python approach
particles = [
{'x': 1.0, 'y': 2.0, 'mass': 1.5},
{'x': 3.0, 'y': 4.0, 'mass': 2.0},
# ... millions more
]
# Memory layout: [dict_header, x_ptr, y_ptr, mass_ptr, dict_header, x_ptr, ...]
# Every access to x jumps over y, mass, and dict overhead
Struct-of-Arrays (SoA)
Each attribute is its own array:
import numpy as np
# SoA: NumPy approach
particles_x = np.array([1.0, 3.0, ...], dtype=np.float64)
particles_y = np.array([2.0, 4.0, ...], dtype=np.float64)
particles_mass = np.array([1.5, 2.0, ...], dtype=np.float64)
# Memory layout: [x0, x1, x2, ...] [y0, y1, y2, ...] [m0, m1, m2, ...]
# Iterating over x accesses contiguous memory
Performance comparison
import numpy as np
import time
N = 1_000_000
# AoS: list of dicts
particles_aos = [{'x': float(i), 'y': float(i), 'mass': 1.0} for i in range(N)]
# SoA: separate arrays
x = np.arange(N, dtype=np.float64)
y = np.arange(N, dtype=np.float64)
mass = np.ones(N, dtype=np.float64)
# Operation: compute kinetic energy = 0.5 * mass * (x^2 + y^2)
# AoS
t0 = time.perf_counter()
result_aos = [0.5 * p['mass'] * (p['x']**2 + p['y']**2) for p in particles_aos]
t_aos = time.perf_counter() - t0
# SoA
t0 = time.perf_counter()
result_soa = 0.5 * mass * (x**2 + y**2)
t_soa = time.perf_counter() - t0
# Typical result: SoA is 50-100× faster
The speedup comes from three factors: contiguous memory (cache efficiency), vectorized operations (SIMD), and no interpreter overhead per element.
Python 3.12+ compact dict implementation
CPython 3.6+ uses a compact dictionary layout that preserves insertion order:
Sparse index table: [-, 0, -, -, 1, -, 2, -] (hash → dense index)
Dense entries: [(hash0, key0, val0), (hash1, key1, val1), ...]
This means dict iteration is fast (traverses dense array sequentially) but random key lookup still requires hash computation plus potential collision resolution.
For memory-critical cases, consider alternatives:
from collections import namedtuple
# namedtuple: ~72 bytes per instance (no __dict__)
Point = namedtuple('Point', ['x', 'y'])
p = Point(1.0, 2.0)
# Regular dict: ~232 bytes per instance
d = {'x': 1.0, 'y': 2.0}
Memory-mapped arrays
For datasets larger than RAM:
import numpy as np
# Create memory-mapped file
mmap_array = np.memmap('data.bin', dtype=np.float64,
mode='w+', shape=(100_000_000,))
# Use like a regular array
mmap_array[0:1000] = np.random.randn(1000)
# Only accessed pages are loaded into RAM
# OS handles paging transparently
Memory-mapped files let you work with 100GB datasets on a 16GB machine, with the OS handling page faults. Sequential access patterns work well; random access causes heavy page thrashing.
Buffer protocol and zero-copy
Python’s buffer protocol enables zero-copy data sharing between compatible objects:
import numpy as np
data = np.zeros(1000, dtype=np.float64)
# memoryview: zero-copy slice
view = memoryview(data)
slice_view = view[100:200] # no data copied
# Pass to C extensions without copying
# struct.unpack_from, socket.send, etc. accept buffer protocol objects
Sharing between processes
from multiprocessing import shared_memory
import numpy as np
# Create shared memory block
shm = shared_memory.SharedMemory(create=True, size=8_000_000)
arr = np.ndarray((1_000_000,), dtype=np.float64, buffer=shm.buf)
# In another process: attach to existing block
shm2 = shared_memory.SharedMemory(name=shm.name)
arr2 = np.ndarray((1_000_000,), dtype=np.float64, buffer=shm2.buf)
# arr and arr2 share the same physical memory — no serialization
Cache-line optimization
A CPU cache line is typically 64 bytes. When you access one byte, the entire 64-byte line is loaded.
import numpy as np
# float64 = 8 bytes, so 8 elements per cache line
# Accessing every element: 100% cache utilization
data = np.arange(1_000_000, dtype=np.float64)
total = data.sum() # Sequential access, perfect cache use
# Accessing every 8th element: ~12.5% cache utilization
total = data[::8].sum() # Loads same cache lines, uses 1/8 of data
# Accessing random elements: worst case
indices = np.random.permutation(1_000_000)
total = data[indices].sum() # Every access potentially misses cache
For multi-dimensional arrays, iteration order matters:
# C-contiguous (row-major): rows are contiguous
arr = np.zeros((1000, 1000), dtype=np.float64, order='C')
# FAST: iterate over last axis (contiguous in memory)
row_sums = arr.sum(axis=1)
# SLOW: iterate over first axis (stride 8000 bytes between elements)
col_sums = arr.sum(axis=0)
# Fortran-contiguous (column-major): columns are contiguous
arr_f = np.asfortranarray(arr)
# Now column sums are fast, row sums are slow
Measuring memory layout impact
import tracemalloc
import gc
tracemalloc.start()
# Allocate your data structure
gc.collect()
snapshot1 = tracemalloc.take_snapshot()
data = create_large_dataset()
gc.collect()
snapshot2 = tracemalloc.take_snapshot()
stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in stats[:10]:
print(stat)
# Also check total
current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024 / 1024:.1f}MB, Peak: {peak / 1024 / 1024:.1f}MB")
Decision framework
Is your data >100K elements?
├── No → Use whatever's readable (dicts, classes, lists)
└── Yes
├── All same numeric type?
│ ├── Yes → NumPy array (SoA layout)
│ └── No → Structured NumPy array or Arrow table
├── Need per-element attribute access?
│ ├── Yes → @dataclass(slots=True) or namedtuple
│ └── No → Columnar format (SoA)
├── Larger than RAM?
│ ├── Yes → Memory-mapped files or chunked processing
│ └── No → In-memory arrays
└── Shared between processes?
├── Yes → shared_memory + NumPy
└── No → Regular NumPy
The one thing to remember: CPython’s per-object overhead (16-100+ bytes) and pointer-based layout cause cache thrashing at scale — switch to contiguous, typed containers (NumPy, __slots__, struct) for data-intensive workloads where layout determines whether your code is 10× or 100× slower than it needs to be.
See Also
- Python Algorithmic Complexity Understand Algorithmic Complexity through a practical analogy so your Python decisions become faster and clearer.
- Python Async Performance Tuning Making your async Python faster is like organizing a busy restaurant kitchen — it's all about flow.
- Python Benchmark Methodology Why timing Python code once means nothing, and how fair testing works like a science experiment.
- Python C Extension Performance How Python borrows C's speed for the hard parts — like hiring a specialist for the toughest job on the worksite.
- Python Caching Strategies Understand Python caching strategies with a shortcut-road analogy so your app gets faster without taking wrong turns.