Python Data Structures — Deep Dive

CPython's list as a dynamic array, dict as a compact hash table, set internals, memory layout, and when to reach for collections, array, or numpy instead.

Lists: Dynamic Arrays Under the Hood

A Python list is implemented as a dynamic array — a contiguous block of pointers to PyObject structs in memory.

// Simplified CPython list struct
typedef struct {
    PyObject_HEAD
    Py_ssize_t ob_size;      // current number of items
    PyObject **ob_item;      // pointer to array of pointers
    Py_ssize_t allocated;    // allocated slots (capacity)
} PyListObject;

The list stores pointers to objects, not the objects themselves. This is why lists can hold mixed types: the array is homogeneous (all pointers are the same size), but the objects being pointed to can be anything.

Growth Strategy

When you append to a full list, CPython reallocates the internal array with more space. The growth pattern is approximately: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88…

The formula targets roughly 1.125x growth for large lists — this amortizes the reallocation cost to O(1) per append on average. You can see the allocated capacity:

import sys
lst = []
for i in range(10):
    lst.append(i)
    print(f"len={len(lst)}, allocated≈{sys.getsizeof(lst)}")

sys.getsizeof returns the size of the list object itself (not the objects it points to). A list of 1 million integers uses about 8MB for the pointer array, plus memory for each integer object.

Memory Implications

Lists of Python integers are memory-inefficient for large numerical data:

Each integer is a separate heap-allocated object (28 bytes minimum for small ints)
The list stores 8-byte pointers to each

A list of 1 million floats uses ~28MB for the float objects + 8MB for the pointer array = 36MB.

array.array('d', data) stores raw C doubles — 8 bytes each = 8MB total.

numpy.array(data, dtype=float64) is similar but adds ndarray overhead and vectorized operations.

Dictionaries: The Compact Hash Table

CPython’s dict evolved significantly in Python 3.6. The current implementation is a compact hash table designed by Raymond Hettinger.

Classic Hash Table vs. Compact Table

Classic open-addressing hash tables store (hash, key, value) at each slot. With a load factor of 2/3, 1/3 of slots are empty — wasting space.

CPython’s compact dict separates the hash index from the entry storage:

indices: [empty, 0, empty, 1, empty, empty, 2, empty]
entries: [(hash0, key0, val0), (hash1, key1, val1), (hash2, key2, val2)]

The indices array is small (one int per slot), and entries are stored compactly in insertion order. This gives:

Preserved insertion order (free, as a side effect)
Better cache performance (entries are contiguous)
~20-25% memory savings vs. the old layout

Lookup Process

For d[key]:

Compute hash(key) — calls key.__hash__()
Index into the indices array: slot = hash % len(indices)
If indices[slot] is empty → KeyError
Otherwise, i = indices[slot]; fetch entries[i]
If entries[i].hash == hash and entries[i].key == key → return value
Otherwise → probe to next slot (linear or quadratic probing)

Lookup is O(1) average case. Worst case O(n) if every key collides — but in practice, Python’s hash function makes this extremely rare for typical data.

Hash Collisions and Denial-of-Service

Before Python 3.3, string hash values were deterministic. An attacker could craft inputs that all map to the same dict slot, degrading O(1) operations to O(n) — a hash flooding attack.

Python 3.3+ uses hash randomization (SipHash-1-3 for strings/bytes, with a random seed per process). The seed changes every run, making targeted collision attacks infeasible.

# This changes every run
print(hash("hello"))   # e.g., 3713081631934410656 today, different tomorrow

You can disable this with PYTHONHASHSEED=0 for reproducible hashing (e.g., in tests).

Keys Must Be Hashable

Dict keys must be hashable — they must implement __hash__ and __eq__. The invariant: if a == b, then hash(a) == hash(b).

Built-in hashable types: int, float, str, tuple (of hashable items), frozenset. Unhashable: list, dict, set — mutable containers can change after being used as a key, breaking the invariant.

d = {}
d[(1, 2)] = "point"     # tuple key — OK
d[[1, 2]] = "list"      # TypeError: unhashable type: 'list'

Set Implementation

Sets are essentially dicts with no values — just the (hash, key) pairs. They use the same compact hash table internally, with the same O(1) average operations.

Set operations use C-level bit manipulation on the internal tables:

a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7}

# These are implemented as hash table walks, O(min(len(a), len(b)))
intersection = a & b    # {3, 4, 5}
union = a | b           # {1, 2, 3, 4, 5, 6, 7}

issubset and issuperset can be faster: if len(a) > len(b), a can’t be a subset of b without even checking the hash tables.

Tuple Storage: Compact and Fast

Tuples, being immutable, have a simpler layout than lists:

typedef struct {
    PyObject_HEAD
    Py_ssize_t ob_size;
    PyObject *ob_item[1];   // Flexible array member
} PyTupleObject;

The items are stored in the object itself (not a separate allocation). This makes tuples faster to create and access, and better for cache locality.

CPython also caches small empty and single-element tuples as singletons, and maintains a free list of recently-used tuple objects up to length 20 to avoid repeated allocation/deallocation overhead.

When to Use collections Module

The standard collections module provides specialized data structures:

collections.deque

A doubly-linked list of fixed-size blocks. O(1) at both ends:

from collections import deque
q = deque(maxlen=5)   # Fixed-size circular buffer
q.appendleft(1)       # O(1) — lists are O(n) for this
q.popleft()           # O(1)

Use deque when you need fast appends/pops from both ends, or a fixed-size queue.

collections.OrderedDict

Largely superseded by regular dicts in Python 3.7+, but has move_to_end() and compares order-sensitively:

from collections import OrderedDict
d = OrderedDict()
d['a'] = 1
d.move_to_end('a')   # Move to front or back

collections.defaultdict

from collections import defaultdict
graph = defaultdict(list)
graph['A'].append('B')   # No KeyError for new keys

Internally, defaultdict overrides __missing__ — called when a key is absent — to create the default value.

collections.Counter

Built on dict, adds .most_common(), arithmetic, and element counting:

from collections import Counter
c = Counter("aababcabcd")
print(c)              # Counter({'a': 4, 'b': 3, 'c': 2, 'd': 1})
print(c.most_common(2))  # [('a', 4), ('b', 3)]
c2 = Counter("abcd")
print(c - c2)         # Counter({'a': 3, 'b': 2, 'c': 1})

Memory-Efficient Alternatives

For large numerical datasets, Python’s built-ins aren’t the right tool:

Use case	Tool	Memory vs. list
Array of uniform numbers	`array.array`	~4-8x smaller
Numerical computation	`numpy.ndarray`	~4-8x smaller + vectorization
Large integers	`gmpy2.mpz`	Faster than Python’s int for huge numbers
Memory-mapped data	`mmap`	Zero copy, backed by file

The Python array module is often overlooked:

import array
floats = array.array('d', [1.0, 2.0, 3.0])   # 'd' = C double
# ~8 bytes per element vs ~28 bytes per float object in a list

One Thing to Remember

Python’s dict and set are O(1) for insertion, deletion, and lookup because they’re hash tables — and since Python 3.6, the compact hash table design gives you insertion-order preservation essentially for free. When you’re hitting memory or performance limits, that’s when you reach for array, numpy, or collections.

pythondata-structureshash-tabledynamic-arraycpythoninternalscollections