Hash Tables Internals — Deep Dive
Technical framing
CPython’s dict is one of the most performance-critical data structures in the interpreter. Module globals, object attributes (__dict__), keyword arguments, and class definitions all use dictionaries. The implementation has evolved significantly: Python 3.6 introduced the compact+ordered layout (PEP 468/509), and subsequent releases have continued to optimize memory and speed.
Understanding the C-level implementation helps you predict performance, avoid pathological cases, and write more cache-friendly code.
CPython’s compact dict layout
Since Python 3.6, a dict consists of two arrays:
1. The index table (sparse)
A compact array of indices, where each entry is either an index into the entries array or a sentinel value (-1 for empty, -2 for deleted). The size of each index entry depends on the table size:
- Tables ≤ 128 entries: 1 byte per index (int8)
- Tables ≤ 32,768: 2 bytes per index (int16)
- Tables ≤ 2,147,483,648: 4 bytes per index (int32)
- Larger: 8 bytes per index (int64)
2. The entries array (dense)
A contiguous array of (hash, key, value) tuples stored in insertion order. No gaps.
Index table: [ 1, -1, -1, 0, -1, 2, -1, -1]
Entries array: [(hash_a, "a", 1), (hash_b, "b", 2), (hash_c, "c", 3)]
To look up key “a”: compute hash("a") % 8 = 3, read index table slot 3, get 0, look up entries[0]. Done.
Memory savings
The old layout stored (hash, key, value) directly in the sparse table, wasting space for empty slots (which are 8-byte pointers × 3 per slot). The new layout wastes only 1-8 bytes per empty slot in the index table.
For a dict with 100,000 entries, memory usage dropped approximately 25% compared to Python 3.5.
Hash functions in CPython
String hashing
CPython uses SipHash-1-3 (as of Python 3.4+) for strings. SipHash is a keyed hash function that resists hash-flooding attacks while remaining fast:
hash = SipHash_1_3(key=random_secret, data=string_bytes)
The secret key is generated randomly at interpreter startup (PYTHONHASHSEED). This means hash("hello") returns different values across Python processes.
Integer hashing
Small integers hash to themselves: hash(42) == 42. For larger integers, Python uses the formula:
hash(n) = n % MODULUS # where MODULUS = 2^61 - 1 on 64-bit systems
This is a Mersenne prime, chosen for efficient modular arithmetic.
Custom object hashing
When you define __hash__ on a custom class, follow these rules:
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __eq__(self, other):
return isinstance(other, Point) and self.x == other.x and self.y == other.y
def __hash__(self):
return hash((self.x, self.y)) # delegate to tuple hash
Critical rules:
- If
a == b, thenhash(a) == hash(b)(mandatory). - If you override
__eq__, you must override__hash__. - Objects used as dict keys must be immutable (or at least their hash-relevant attributes must not change).
Collision resolution: open addressing details
CPython uses open addressing with a perturbation-based probe sequence:
perturb = hash;
i = hash & mask; // mask = table_size - 1
while (table[i] is occupied and table[i].key != target_key) {
i = (5 * i + 1 + perturb) & mask;
perturb >>= PERTURB_SHIFT; // PERTURB_SHIFT = 5
}
Why this works well:
- The
5 * i + 1part generates a full permutation of table indices (sincegcd(5, 2^n) = 1). - The
perturbcomponent uses higher bits of the hash to break up clusters that form when only the low bits matter. - As
perturbshifts to zero, the probe degrades to5 * i + 1— still a complete permutation.
Deletion: tombstones
When an entry is deleted, its index slot is marked as “deleted” (tombstone) rather than “empty.” This is necessary because probing chains would break if a deleted slot looked empty. During resize, tombstones are cleaned up.
Too many tombstones waste probe steps. Python triggers a resize (which cleans tombstones) when used + deleted > 2/3 * table_size.
Hash randomization and security
Before Python 3.3, string hash values were deterministic. This allowed hash-flooding attacks: an attacker could send specially crafted keys that all hash to the same slot, degrading O(1) lookups to O(n²) and causing denial of service.
Python 3.3+ randomizes hash seeds at startup via PYTHONHASHSEED. To get deterministic hashes (for testing):
PYTHONHASHSEED=0 python my_script.py
Key-sharing dictionaries
When multiple instances of the same class have the same set of attribute names (common pattern), CPython shares the key table across instances:
class User:
def __init__(self, name, age):
self.name = name
self.age = age
# All User instances share the same keys table ["name", "age"]
# Each instance only stores its own values array
This saves significant memory when creating many instances of the same class. The optimization is transparent and automatic.
Performance edge cases
Pathological hash collisions
If all keys hash to the same value, every lookup degrades to O(n):
# Demonstration: custom class with bad hash
class BadHash:
def __init__(self, value):
self.value = value
def __hash__(self):
return 1 # all instances collide
def __eq__(self, other):
return isinstance(other, BadHash) and self.value == other.value
# This dict has O(n) lookups:
d = {BadHash(i): i for i in range(10000)}
Resize storms
Rapidly adding and deleting entries can trigger frequent resizes. If you know the final size, pre-allocating with dict.fromkeys() can help:
# Pre-allocate a dict with known keys
keys = range(100000)
d = dict.fromkeys(keys) # single allocation + resize
for k in keys:
d[k] = compute_value(k)
Integer key optimization
Since small integers hash to themselves and hash(n) % table_size for sequential integers produces sequential indices, dictionaries with sequential integer keys have excellent cache locality. This makes dict surprisingly fast for sparse arrays indexed by integers.
Implementing a hash table from scratch
class HashTable:
EMPTY = object()
DELETED = object()
def __init__(self, capacity=8):
self.capacity = capacity
self.size = 0
self.keys = [self.EMPTY] * capacity
self.values = [self.EMPTY] * capacity
def _probe(self, key):
h = hash(key)
idx = h % self.capacity
perturb = h
while True:
if self.keys[idx] is self.EMPTY:
return idx, False
if self.keys[idx] is not self.DELETED and self.keys[idx] == key:
return idx, True
idx = (5 * idx + 1 + perturb) % self.capacity
perturb >>= 5
def _resize(self):
old_keys, old_values = self.keys, self.values
self.capacity *= 2
self.keys = [self.EMPTY] * self.capacity
self.values = [self.EMPTY] * self.capacity
self.size = 0
for k, v in zip(old_keys, old_values):
if k is not self.EMPTY and k is not self.DELETED:
self[k] = v
def __setitem__(self, key, value):
if self.size >= self.capacity * 2 // 3:
self._resize()
idx, found = self._probe(key)
if not found:
self.size += 1
self.keys[idx] = key
self.values[idx] = value
def __getitem__(self, key):
idx, found = self._probe(key)
if not found:
raise KeyError(key)
return self.values[idx]
def __delitem__(self, key):
idx, found = self._probe(key)
if not found:
raise KeyError(key)
self.keys[idx] = self.DELETED
self.values[idx] = self.EMPTY
self.size -= 1
def __contains__(self, key):
_, found = self._probe(key)
return found
This implementation mirrors CPython’s approach: open addressing, perturbation probing, tombstone deletion, and 2/3 load factor resizing.
Benchmarks: dict vs. alternatives
| Operation (100K entries) | dict | list (linear scan) | sorted list + bisect | B-tree (sortedcontainers) |
|---|---|---|---|---|
| Lookup | ~50 ns | ~2 ms | ~2 μs | ~500 ns |
| Insert | ~60 ns | ~50 ns (append) | ~500 μs (shift) | ~1 μs |
| Delete | ~60 ns | ~1 ms (search+shift) | ~500 μs | ~1 μs |
| Memory per entry | ~90 bytes | ~8 bytes | ~8 bytes | ~50 bytes |
The dict trades memory for O(1) speed. For data that needs ordering, sorted containers are the alternative.
One thing to remember: Python’s dict is not just a hash table — it is a carefully engineered data structure with insertion ordering, key sharing, hash randomization, and compact memory layout. Respect what it does for you, and understand its edge cases so you never accidentally fight against it.
See Also
- Python Backtracking Algorithms How computers solve puzzles by trying every path, backing up when stuck, and never giving up until they find the answer.
- Python Big O Complexity Analysis Why some programs finish in a blink while others take forever — explained with pizza delivery and toy cleanup.
- Python Binary Search Implementation Find anything in a sorted list insanely fast using the same trick you already use with dictionaries and phone books.
- Python Dynamic Programming The clever trick of remembering answers you already figured out so you never solve the same puzzle twice.
- Python Graph Algorithms How computers navigate maps, friendships, and connections using dots and lines — explained without any math.