Hash Tables Internals — Deep Dive

Explore CPython's dict implementation at the C level — compact layout, collision resolution, hash randomization, and performance edge cases.

Technical framing

CPython’s dict is one of the most performance-critical data structures in the interpreter. Module globals, object attributes (__dict__), keyword arguments, and class definitions all use dictionaries. The implementation has evolved significantly: Python 3.6 introduced the compact+ordered layout (PEP 468/509), and subsequent releases have continued to optimize memory and speed.

Understanding the C-level implementation helps you predict performance, avoid pathological cases, and write more cache-friendly code.

CPython’s compact dict layout

Since Python 3.6, a dict consists of two arrays:

1. The index table (sparse)

A compact array of indices, where each entry is either an index into the entries array or a sentinel value (-1 for empty, -2 for deleted). The size of each index entry depends on the table size:

Tables ≤ 128 entries: 1 byte per index (int8)
Tables ≤ 32,768: 2 bytes per index (int16)
Tables ≤ 2,147,483,648: 4 bytes per index (int32)
Larger: 8 bytes per index (int64)

2. The entries array (dense)

A contiguous array of (hash, key, value) tuples stored in insertion order. No gaps.

Index table:     [ 1, -1, -1,  0, -1,  2, -1, -1]
Entries array:   [(hash_a, "a", 1), (hash_b, "b", 2), (hash_c, "c", 3)]

To look up key “a”: compute hash("a") % 8 = 3, read index table slot 3, get 0, look up entries[0]. Done.

Memory savings

The old layout stored (hash, key, value) directly in the sparse table, wasting space for empty slots (which are 8-byte pointers × 3 per slot). The new layout wastes only 1-8 bytes per empty slot in the index table.

For a dict with 100,000 entries, memory usage dropped approximately 25% compared to Python 3.5.

Hash functions in CPython

String hashing

CPython uses SipHash-1-3 (as of Python 3.4+) for strings. SipHash is a keyed hash function that resists hash-flooding attacks while remaining fast:

hash = SipHash_1_3(key=random_secret, data=string_bytes)

The secret key is generated randomly at interpreter startup (PYTHONHASHSEED). This means hash("hello") returns different values across Python processes.

Integer hashing

Small integers hash to themselves: hash(42) == 42. For larger integers, Python uses the formula:

hash(n) = n % MODULUS  # where MODULUS = 2^61 - 1 on 64-bit systems

This is a Mersenne prime, chosen for efficient modular arithmetic.

Custom object hashing

When you define __hash__ on a custom class, follow these rules:

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __eq__(self, other):
        return isinstance(other, Point) and self.x == other.x and self.y == other.y
    
    def __hash__(self):
        return hash((self.x, self.y))  # delegate to tuple hash

Critical rules:

If a == b, then hash(a) == hash(b) (mandatory).
If you override __eq__, you must override __hash__.
Objects used as dict keys must be immutable (or at least their hash-relevant attributes must not change).

Collision resolution: open addressing details

CPython uses open addressing with a perturbation-based probe sequence:

perturb = hash;
i = hash & mask;  // mask = table_size - 1

while (table[i] is occupied and table[i].key != target_key) {
    i = (5 * i + 1 + perturb) & mask;
    perturb >>= PERTURB_SHIFT;  // PERTURB_SHIFT = 5
}

Why this works well:

The 5 * i + 1 part generates a full permutation of table indices (since gcd(5, 2^n) = 1).
The perturb component uses higher bits of the hash to break up clusters that form when only the low bits matter.
As perturb shifts to zero, the probe degrades to 5 * i + 1 — still a complete permutation.

Deletion: tombstones

When an entry is deleted, its index slot is marked as “deleted” (tombstone) rather than “empty.” This is necessary because probing chains would break if a deleted slot looked empty. During resize, tombstones are cleaned up.

Too many tombstones waste probe steps. Python triggers a resize (which cleans tombstones) when used + deleted > 2/3 * table_size.

Hash randomization and security

Before Python 3.3, string hash values were deterministic. This allowed hash-flooding attacks: an attacker could send specially crafted keys that all hash to the same slot, degrading O(1) lookups to O(n²) and causing denial of service.

Python 3.3+ randomizes hash seeds at startup via PYTHONHASHSEED. To get deterministic hashes (for testing):

PYTHONHASHSEED=0 python my_script.py

When multiple instances of the same class have the same set of attribute names (common pattern), CPython shares the key table across instances:

class User:
    def __init__(self, name, age):
        self.name = name
        self.age = age

# All User instances share the same keys table ["name", "age"]
# Each instance only stores its own values array

This saves significant memory when creating many instances of the same class. The optimization is transparent and automatic.

Performance edge cases

Pathological hash collisions

If all keys hash to the same value, every lookup degrades to O(n):

# Demonstration: custom class with bad hash
class BadHash:
    def __init__(self, value):
        self.value = value
    def __hash__(self):
        return 1  # all instances collide
    def __eq__(self, other):
        return isinstance(other, BadHash) and self.value == other.value

# This dict has O(n) lookups:
d = {BadHash(i): i for i in range(10000)}

Resize storms

Rapidly adding and deleting entries can trigger frequent resizes. If you know the final size, pre-allocating with dict.fromkeys() can help:

# Pre-allocate a dict with known keys
keys = range(100000)
d = dict.fromkeys(keys)  # single allocation + resize
for k in keys:
    d[k] = compute_value(k)

Integer key optimization

Since small integers hash to themselves and hash(n) % table_size for sequential integers produces sequential indices, dictionaries with sequential integer keys have excellent cache locality. This makes dict surprisingly fast for sparse arrays indexed by integers.

Implementing a hash table from scratch

class HashTable:
    EMPTY = object()
    DELETED = object()
    
    def __init__(self, capacity=8):
        self.capacity = capacity
        self.size = 0
        self.keys = [self.EMPTY] * capacity
        self.values = [self.EMPTY] * capacity
    
    def _probe(self, key):
        h = hash(key)
        idx = h % self.capacity
        perturb = h
        
        while True:
            if self.keys[idx] is self.EMPTY:
                return idx, False
            if self.keys[idx] is not self.DELETED and self.keys[idx] == key:
                return idx, True
            idx = (5 * idx + 1 + perturb) % self.capacity
            perturb >>= 5
    
    def _resize(self):
        old_keys, old_values = self.keys, self.values
        self.capacity *= 2
        self.keys = [self.EMPTY] * self.capacity
        self.values = [self.EMPTY] * self.capacity
        self.size = 0
        
        for k, v in zip(old_keys, old_values):
            if k is not self.EMPTY and k is not self.DELETED:
                self[k] = v
    
    def __setitem__(self, key, value):
        if self.size >= self.capacity * 2 // 3:
            self._resize()
        idx, found = self._probe(key)
        if not found:
            self.size += 1
        self.keys[idx] = key
        self.values[idx] = value
    
    def __getitem__(self, key):
        idx, found = self._probe(key)
        if not found:
            raise KeyError(key)
        return self.values[idx]
    
    def __delitem__(self, key):
        idx, found = self._probe(key)
        if not found:
            raise KeyError(key)
        self.keys[idx] = self.DELETED
        self.values[idx] = self.EMPTY
        self.size -= 1
    
    def __contains__(self, key):
        _, found = self._probe(key)
        return found

This implementation mirrors CPython’s approach: open addressing, perturbation probing, tombstone deletion, and 2/3 load factor resizing.

Benchmarks: dict vs. alternatives

Operation (100K entries)	dict	list (linear scan)	sorted list + bisect	B-tree (sortedcontainers)
Lookup	~50 ns	~2 ms	~2 μs	~500 ns
Insert	~60 ns	~50 ns (append)	~500 μs (shift)	~1 μs
Delete	~60 ns	~1 ms (search+shift)	~500 μs	~1 μs
Memory per entry	~90 bytes	~8 bytes	~8 bytes	~50 bytes

The dict trades memory for O(1) speed. For data that needs ordering, sorted containers are the alternative.

One thing to remember: Python’s dict is not just a hash table — it is a carefully engineered data structure with insertion ordering, key sharing, hash randomization, and compact memory layout. Respect what it does for you, and understand its edge cases so you never accidentally fight against it.

pythondata-structuresdictionaries