Python Data Structures — Deep Dive
Lists: Dynamic Arrays Under the Hood
A Python list is implemented as a dynamic array — a contiguous block of pointers to PyObject structs in memory.
// Simplified CPython list struct
typedef struct {
PyObject_HEAD
Py_ssize_t ob_size; // current number of items
PyObject **ob_item; // pointer to array of pointers
Py_ssize_t allocated; // allocated slots (capacity)
} PyListObject;
The list stores pointers to objects, not the objects themselves. This is why lists can hold mixed types: the array is homogeneous (all pointers are the same size), but the objects being pointed to can be anything.
Growth Strategy
When you append to a full list, CPython reallocates the internal array with more space. The growth pattern is approximately: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88…
The formula targets roughly 1.125x growth for large lists — this amortizes the reallocation cost to O(1) per append on average. You can see the allocated capacity:
import sys
lst = []
for i in range(10):
lst.append(i)
print(f"len={len(lst)}, allocated≈{sys.getsizeof(lst)}")
sys.getsizeof returns the size of the list object itself (not the objects it points to). A list of 1 million integers uses about 8MB for the pointer array, plus memory for each integer object.
Memory Implications
Lists of Python integers are memory-inefficient for large numerical data:
- Each integer is a separate heap-allocated object (28 bytes minimum for small ints)
- The list stores 8-byte pointers to each
A list of 1 million floats uses ~28MB for the float objects + 8MB for the pointer array = 36MB.
array.array('d', data) stores raw C doubles — 8 bytes each = 8MB total.
numpy.array(data, dtype=float64) is similar but adds ndarray overhead and vectorized operations.
Dictionaries: The Compact Hash Table
CPython’s dict evolved significantly in Python 3.6. The current implementation is a compact hash table designed by Raymond Hettinger.
Classic Hash Table vs. Compact Table
Classic open-addressing hash tables store (hash, key, value) at each slot. With a load factor of 2/3, 1/3 of slots are empty — wasting space.
CPython’s compact dict separates the hash index from the entry storage:
indices: [empty, 0, empty, 1, empty, empty, 2, empty]
entries: [(hash0, key0, val0), (hash1, key1, val1), (hash2, key2, val2)]
The indices array is small (one int per slot), and entries are stored compactly in insertion order. This gives:
- Preserved insertion order (free, as a side effect)
- Better cache performance (entries are contiguous)
- ~20-25% memory savings vs. the old layout
Lookup Process
For d[key]:
- Compute
hash(key)— callskey.__hash__() - Index into the indices array:
slot = hash % len(indices) - If
indices[slot]is empty → KeyError - Otherwise,
i = indices[slot]; fetchentries[i] - If
entries[i].hash == hash and entries[i].key == key→ return value - Otherwise → probe to next slot (linear or quadratic probing)
Lookup is O(1) average case. Worst case O(n) if every key collides — but in practice, Python’s hash function makes this extremely rare for typical data.
Hash Collisions and Denial-of-Service
Before Python 3.3, string hash values were deterministic. An attacker could craft inputs that all map to the same dict slot, degrading O(1) operations to O(n) — a hash flooding attack.
Python 3.3+ uses hash randomization (SipHash-1-3 for strings/bytes, with a random seed per process). The seed changes every run, making targeted collision attacks infeasible.
# This changes every run
print(hash("hello")) # e.g., 3713081631934410656 today, different tomorrow
You can disable this with PYTHONHASHSEED=0 for reproducible hashing (e.g., in tests).
Keys Must Be Hashable
Dict keys must be hashable — they must implement __hash__ and __eq__. The invariant: if a == b, then hash(a) == hash(b).
Built-in hashable types: int, float, str, tuple (of hashable items), frozenset.
Unhashable: list, dict, set — mutable containers can change after being used as a key, breaking the invariant.
d = {}
d[(1, 2)] = "point" # tuple key — OK
d[[1, 2]] = "list" # TypeError: unhashable type: 'list'
Set Implementation
Sets are essentially dicts with no values — just the (hash, key) pairs. They use the same compact hash table internally, with the same O(1) average operations.
Set operations use C-level bit manipulation on the internal tables:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7}
# These are implemented as hash table walks, O(min(len(a), len(b)))
intersection = a & b # {3, 4, 5}
union = a | b # {1, 2, 3, 4, 5, 6, 7}
issubset and issuperset can be faster: if len(a) > len(b), a can’t be a subset of b without even checking the hash tables.
Tuple Storage: Compact and Fast
Tuples, being immutable, have a simpler layout than lists:
typedef struct {
PyObject_HEAD
Py_ssize_t ob_size;
PyObject *ob_item[1]; // Flexible array member
} PyTupleObject;
The items are stored in the object itself (not a separate allocation). This makes tuples faster to create and access, and better for cache locality.
CPython also caches small empty and single-element tuples as singletons, and maintains a free list of recently-used tuple objects up to length 20 to avoid repeated allocation/deallocation overhead.
When to Use collections Module
The standard collections module provides specialized data structures:
collections.deque
A doubly-linked list of fixed-size blocks. O(1) at both ends:
from collections import deque
q = deque(maxlen=5) # Fixed-size circular buffer
q.appendleft(1) # O(1) — lists are O(n) for this
q.popleft() # O(1)
Use deque when you need fast appends/pops from both ends, or a fixed-size queue.
collections.OrderedDict
Largely superseded by regular dicts in Python 3.7+, but has move_to_end() and compares order-sensitively:
from collections import OrderedDict
d = OrderedDict()
d['a'] = 1
d.move_to_end('a') # Move to front or back
collections.defaultdict
from collections import defaultdict
graph = defaultdict(list)
graph['A'].append('B') # No KeyError for new keys
Internally, defaultdict overrides __missing__ — called when a key is absent — to create the default value.
collections.Counter
Built on dict, adds .most_common(), arithmetic, and element counting:
from collections import Counter
c = Counter("aababcabcd")
print(c) # Counter({'a': 4, 'b': 3, 'c': 2, 'd': 1})
print(c.most_common(2)) # [('a', 4), ('b', 3)]
c2 = Counter("abcd")
print(c - c2) # Counter({'a': 3, 'b': 2, 'c': 1})
Memory-Efficient Alternatives
For large numerical datasets, Python’s built-ins aren’t the right tool:
| Use case | Tool | Memory vs. list |
|---|---|---|
| Array of uniform numbers | array.array | ~4-8x smaller |
| Numerical computation | numpy.ndarray | ~4-8x smaller + vectorization |
| Large integers | gmpy2.mpz | Faster than Python’s int for huge numbers |
| Memory-mapped data | mmap | Zero copy, backed by file |
The Python array module is often overlooked:
import array
floats = array.array('d', [1.0, 2.0, 3.0]) # 'd' = C double
# ~8 bytes per element vs ~28 bytes per float object in a list
One Thing to Remember
Python’s dict and set are O(1) for insertion, deletion, and lookup because they’re hash tables — and since Python 3.6, the compact hash table design gives you insertion-order preservation essentially for free. When you’re hitting memory or performance limits, that’s when you reach for
array,numpy, orcollections.
See Also
- Python Async Await Async/await helps one Python program juggle many waiting jobs at once, like a chef who keeps multiple pots moving without standing still.
- Python Basics Python is the programming language that reads like plain English — here's why millions of beginners (and experts) choose it first.
- Python Booleans Make Booleans click with one clear analogy you can reuse whenever Python feels confusing.
- Python Break Continue Make Break Continue click with one clear analogy you can reuse whenever Python feels confusing.
- Python Closures See how Python functions can remember private information, even after the outer function has already finished.