Python Memory Layout Optimization — Deep Dive

Memory layout optimization in Python operates at the intersection of CPython’s object model and hardware cache architecture. Understanding both lets you make data structure choices that yield 5-50× performance improvements in data-intensive code.

CPython object memory anatomy

The PyObject header

Every CPython object starts with a common header:

// CPython 3.12+ object header
typedef struct {
    Py_ssize_t ob_refcnt;    // 8 bytes: reference count
    PyTypeObject *ob_type;    // 8 bytes: pointer to type object
} PyObject;

That’s 16 bytes before any actual data. Variable-size objects add another field:

typedef struct {
    PyObject ob_base;         // 16 bytes
    Py_ssize_t ob_size;       // 8 bytes: number of items
} PyVarObject;

Size breakdown of common types

import sys

print(sys.getsizeof(0))          # 28 bytes (int, small)
print(sys.getsizeof(2**30))      # 32 bytes (int, medium)
print(sys.getsizeof(2**60))      # 36 bytes (int, large)
print(sys.getsizeof(1.0))        # 24 bytes (float)
print(sys.getsizeof(True))       # 28 bytes (bool)
print(sys.getsizeof(''))         # 49 bytes (empty str)
print(sys.getsizeof('hello'))    # 54 bytes (5-char str)
print(sys.getsizeof(()))         # 40 bytes (empty tuple)
print(sys.getsizeof([]))         # 56 bytes (empty list)
print(sys.getsizeof({}))         # 64 bytes (empty dict)

A Python float uses 24 bytes for 8 bytes of actual data — 3× overhead. This matters at scale.

The dict cost

Regular class instances store attributes in a dictionary. Since Python 3.12, CPython uses a “shared keys” optimization where instances of the same class share the keys portion of their dict, but each still has a values array plus the dict header:

class Regular:
    def __init__(self, x, y):
        self.x = x
        self.y = y

class Slotted:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Measure total including referenced objects
import pympler.asizeof
print(pympler.asizeof.asizeof(Regular(1.0, 2.0)))  # ~280 bytes
print(pympler.asizeof.asizeof(Slotted(1.0, 2.0)))  # ~104 bytes

Struct-of-Arrays vs Array-of-Structs

This classic data layout decision has dramatic performance implications:

Array-of-Structs (AoS)

Each entity is a struct; you have an array of entities:

# AoS: natural Python approach
particles = [
    {'x': 1.0, 'y': 2.0, 'mass': 1.5},
    {'x': 3.0, 'y': 4.0, 'mass': 2.0},
    # ... millions more
]

# Memory layout: [dict_header, x_ptr, y_ptr, mass_ptr, dict_header, x_ptr, ...]
# Every access to x jumps over y, mass, and dict overhead

Struct-of-Arrays (SoA)

Each attribute is its own array:

import numpy as np

# SoA: NumPy approach
particles_x = np.array([1.0, 3.0, ...], dtype=np.float64)
particles_y = np.array([2.0, 4.0, ...], dtype=np.float64)
particles_mass = np.array([1.5, 2.0, ...], dtype=np.float64)

# Memory layout: [x0, x1, x2, ...] [y0, y1, y2, ...] [m0, m1, m2, ...]
# Iterating over x accesses contiguous memory

Performance comparison

import numpy as np
import time

N = 1_000_000

# AoS: list of dicts
particles_aos = [{'x': float(i), 'y': float(i), 'mass': 1.0} for i in range(N)]

# SoA: separate arrays
x = np.arange(N, dtype=np.float64)
y = np.arange(N, dtype=np.float64)
mass = np.ones(N, dtype=np.float64)

# Operation: compute kinetic energy = 0.5 * mass * (x^2 + y^2)
# AoS
t0 = time.perf_counter()
result_aos = [0.5 * p['mass'] * (p['x']**2 + p['y']**2) for p in particles_aos]
t_aos = time.perf_counter() - t0

# SoA
t0 = time.perf_counter()
result_soa = 0.5 * mass * (x**2 + y**2)
t_soa = time.perf_counter() - t0

# Typical result: SoA is 50-100× faster

The speedup comes from three factors: contiguous memory (cache efficiency), vectorized operations (SIMD), and no interpreter overhead per element.

Python 3.12+ compact dict implementation

CPython 3.6+ uses a compact dictionary layout that preserves insertion order:

Sparse index table:  [-, 0, -, -, 1, -, 2, -]  (hash → dense index)
Dense entries:       [(hash0, key0, val0), (hash1, key1, val1), ...]

This means dict iteration is fast (traverses dense array sequentially) but random key lookup still requires hash computation plus potential collision resolution.

For memory-critical cases, consider alternatives:

from collections import namedtuple

# namedtuple: ~72 bytes per instance (no __dict__)
Point = namedtuple('Point', ['x', 'y'])
p = Point(1.0, 2.0)

# Regular dict: ~232 bytes per instance
d = {'x': 1.0, 'y': 2.0}

Memory-mapped arrays

For datasets larger than RAM:

import numpy as np

# Create memory-mapped file
mmap_array = np.memmap('data.bin', dtype=np.float64,
                        mode='w+', shape=(100_000_000,))

# Use like a regular array
mmap_array[0:1000] = np.random.randn(1000)

# Only accessed pages are loaded into RAM
# OS handles paging transparently

Memory-mapped files let you work with 100GB datasets on a 16GB machine, with the OS handling page faults. Sequential access patterns work well; random access causes heavy page thrashing.

Buffer protocol and zero-copy

Python’s buffer protocol enables zero-copy data sharing between compatible objects:

import numpy as np

data = np.zeros(1000, dtype=np.float64)

# memoryview: zero-copy slice
view = memoryview(data)
slice_view = view[100:200]  # no data copied

# Pass to C extensions without copying
# struct.unpack_from, socket.send, etc. accept buffer protocol objects

Sharing between processes

from multiprocessing import shared_memory
import numpy as np

# Create shared memory block
shm = shared_memory.SharedMemory(create=True, size=8_000_000)
arr = np.ndarray((1_000_000,), dtype=np.float64, buffer=shm.buf)

# In another process: attach to existing block
shm2 = shared_memory.SharedMemory(name=shm.name)
arr2 = np.ndarray((1_000_000,), dtype=np.float64, buffer=shm2.buf)
# arr and arr2 share the same physical memory — no serialization

Cache-line optimization

A CPU cache line is typically 64 bytes. When you access one byte, the entire 64-byte line is loaded.

import numpy as np

# float64 = 8 bytes, so 8 elements per cache line
# Accessing every element: 100% cache utilization
data = np.arange(1_000_000, dtype=np.float64)
total = data.sum()  # Sequential access, perfect cache use

# Accessing every 8th element: ~12.5% cache utilization
total = data[::8].sum()  # Loads same cache lines, uses 1/8 of data

# Accessing random elements: worst case
indices = np.random.permutation(1_000_000)
total = data[indices].sum()  # Every access potentially misses cache

For multi-dimensional arrays, iteration order matters:

# C-contiguous (row-major): rows are contiguous
arr = np.zeros((1000, 1000), dtype=np.float64, order='C')

# FAST: iterate over last axis (contiguous in memory)
row_sums = arr.sum(axis=1)

# SLOW: iterate over first axis (stride 8000 bytes between elements)
col_sums = arr.sum(axis=0)

# Fortran-contiguous (column-major): columns are contiguous
arr_f = np.asfortranarray(arr)
# Now column sums are fast, row sums are slow

Measuring memory layout impact

import tracemalloc
import gc

tracemalloc.start()

# Allocate your data structure
gc.collect()
snapshot1 = tracemalloc.take_snapshot()

data = create_large_dataset()

gc.collect()
snapshot2 = tracemalloc.take_snapshot()

stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in stats[:10]:
    print(stat)

# Also check total
current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024 / 1024:.1f}MB, Peak: {peak / 1024 / 1024:.1f}MB")

Decision framework

Is your data >100K elements?
├── No → Use whatever's readable (dicts, classes, lists)
└── Yes
    ├── All same numeric type?
    │   ├── Yes → NumPy array (SoA layout)
    │   └── No → Structured NumPy array or Arrow table
    ├── Need per-element attribute access?
    │   ├── Yes → @dataclass(slots=True) or namedtuple
    │   └── No → Columnar format (SoA)
    ├── Larger than RAM?
    │   ├── Yes → Memory-mapped files or chunked processing
    │   └── No → In-memory arrays
    └── Shared between processes?
        ├── Yes → shared_memory + NumPy
        └── No → Regular NumPy

The one thing to remember: CPython’s per-object overhead (16-100+ bytes) and pointer-based layout cause cache thrashing at scale — switch to contiguous, typed containers (NumPy, __slots__, struct) for data-intensive workloads where layout determines whether your code is 10× or 100× slower than it needs to be.

pythonperformanceinternals

See Also