Hash Tables Internals — Core Concepts

Why hash table internals matter

Python dictionaries and sets are built on hash tables. They are among the most-used data structures in the language — function keyword arguments, class attributes, module namespaces, and JSON parsing all rely on dict. Understanding how hash tables work helps you write faster code, debug unexpected behavior, and know when to reach for alternatives.

How hashing works

A hash function converts a key into an integer (the hash value). Python uses the built-in hash() function:

hash("hello")     # e.g., 2534051032171071756
hash(42)          # 42 (integers hash to themselves for small values)
hash((1, 2, 3))   # tuples are hashable
# hash([1, 2, 3]) # TypeError: unhashable type: 'list'

Rules for hash functions:

  • Deterministic: the same key always produces the same hash.
  • Fast: hashing should be O(1) for typical keys.
  • Uniform: hash values should spread evenly across the table.
  • Consistent with equality: if a == b, then hash(a) == hash(b).

Only immutable types are hashable in Python: strings, numbers, tuples (of hashable elements), frozensets. This is why lists and dicts cannot be dictionary keys.

From hash to index

The hash value is converted to a table index using modular arithmetic:

index = hash(key) % table_size

If the table has 8 slots and hash("hello") = 2534051032171071756, the index is 2534051032171071756 % 8 = 4. The key-value pair goes in slot 4.

Collision resolution

When two keys hash to the same index, you have a collision. Python uses open addressing with probing — it looks for the next available slot using a probe sequence.

CPython’s probe formula:

next_index = (5 * current_index + 1 + perturb) % table_size
perturb >>= 5

The perturb value starts as the full hash and is shifted right each step. This creates a pseudo-random probe sequence that avoids clustering (where collisions bunch together).

Load factor and resizing

The load factor is the ratio of filled slots to total slots. When the table becomes too full, lookups slow down because more probing is needed.

Python resizes the table when it reaches about 2/3 full:

  • Tables start with 8 slots.
  • When resized, the table grows to the next power of 2 that keeps the load factor below 2/3.
  • All existing entries are re-inserted (rehashed) into the new table.

Resizing is O(n), but it happens infrequently enough that the amortized cost per insertion remains O(1).

Dictionary ordering

Since Python 3.7, dictionaries maintain insertion order. This is implemented using a compact layout:

  1. A hash table (sparse array) stores indices into a dense array.
  2. A dense array stores entries in insertion order: [(hash, key, value), ...].

This design uses less memory than the old layout (which stored entries directly in the sparse table with many empty slots) and preserves order as a free bonus.

Sets vs. dictionaries

A set is essentially a dict without values — it uses the same hash table mechanism. This is why sets have O(1) average-case membership testing:

my_set = {1, 2, 3, 4, 5}
3 in my_set  # O(1) — uses hash table lookup

Performance characteristics

OperationAverage caseWorst case
Lookup (d[key])O(1)O(n)
Insert (d[key] = val)O(1)O(n)
Delete (del d[key])O(1)O(n)
IterationO(n)O(n)

The O(n) worst case happens when many keys collide — extremely rare with Python’s hash function for typical data.

Common misconception

“Dictionaries are slow because they use more memory.” Dictionaries trade memory for speed. A list lookup by value is O(n); a dict lookup by key is O(1). For anything beyond a handful of items, the speed advantage of dictionaries far outweighs the memory cost. Python’s compact dict implementation also reduced memory usage by 20-25% compared to Python 3.5.

One thing to remember: When you need fast lookups by key, reach for a dictionary. When you need fast membership checks, reach for a set. Both are hash tables underneath, and they are among the most optimized data structures in CPython.

pythondata-structuresdictionaries

See Also