Python heapq & Priority Queues — Deep Dive

Heap internals: array-backed binary tree

Python’s heapq stores a min-heap in a flat list. For an element at index i:

  • Left child: 2*i + 1
  • Right child: 2*i + 2
  • Parent: (i - 1) // 2

The heap invariant is: heap[parent] <= heap[child] for every node. This guarantees heap[0] is always the minimum.

heapify in O(n)

A naive approach—inserting elements one by one—costs O(n log n). The heapify() function uses a bottom-up sift-down approach that runs in O(n). It starts from the last non-leaf node and sifts each node down. Since most nodes are near the bottom and need minimal sifting, the total work is bounded by a convergent geometric series.

import heapq

data = [5, 3, 8, 1, 2, 7]
heapq.heapify(data)
# data is now [1, 2, 7, 5, 3, 8] — a valid heap, NOT sorted

Push and pop mechanics

heappush appends the item to the end of the list, then “sifts up” — swapping with its parent until the invariant holds. heappop swaps heap[0] with the last element, pops the last, then “sifts down” from the root. Both are O(log n).

heappushpop(heap, item) is optimized: if the new item is smaller than the current root, it’s returned immediately without modifying the heap. Otherwise it replaces the root and sifts down once. This avoids the overhead of a push followed by a separate pop.

Building a max-heap

Python only provides a min-heap. Three approaches for max-heap behavior:

1. Negate values:

heapq.heappush(heap, -value)
max_val = -heapq.heappop(heap)

Simple and fast, but only works for numeric values.

2. Wrapper class with reversed comparison:

from dataclasses import dataclass, field
from typing import Any

@dataclass(order=True)
class MaxItem:
    priority: int = field(compare=True)
    item: Any = field(compare=False)

    def __lt__(self, other):
        return self.priority > other.priority  # reversed

3. Use nlargest for one-shot queries:

heapq.nlargest(5, data, key=lambda x: x.score)

For repeated push/pop with max semantics, negation is the standard practice.

Handling removal from the middle

Heaps don’t support efficient arbitrary removal. Two production patterns address this:

Lazy deletion

Mark items as deleted instead of removing them. When they surface to the top via heappop, discard them.

import heapq
import itertools

class PriorityQueue:
    def __init__(self):
        self._heap = []
        self._entry_finder = {}
        self._counter = itertools.count()
        self._REMOVED = object()

    def add(self, task, priority=0):
        if task in self._entry_finder:
            self.remove(task)
        count = next(self._counter)
        entry = [priority, count, task]
        self._entry_finder[task] = entry
        heapq.heappush(self._heap, entry)

    def remove(self, task):
        entry = self._entry_finder.pop(task)
        entry[-1] = self._REMOVED

    def pop(self):
        while self._heap:
            priority, count, task = heapq.heappop(self._heap)
            if task is not self._REMOVED:
                del self._entry_finder[task]
                return task
        raise KeyError("pop from empty priority queue")

This is the pattern from the official Python docs. The trade-off: the heap may accumulate stale entries, so memory usage grows if you remove heavily without popping. Periodic rebuilding via heapify can mitigate this.

Decrease-key via re-insertion

For algorithms like Dijkstra’s, you often need to update a node’s priority. Rather than implementing decrease-key (which requires index tracking), re-insert with the new priority and use lazy deletion. CPython’s heapq is implemented in C, so the constant factors are low enough that this approach outperforms more complex alternatives in practice.

Dijkstra’s shortest path with heapq

import heapq

def dijkstra(graph, start):
    dist = {start: 0}
    heap = [(0, start)]

    while heap:
        d, u = heapq.heappop(heap)
        if d > dist.get(u, float('inf')):
            continue  # stale entry
        for v, weight in graph[u]:
            new_dist = d + weight
            if new_dist < dist.get(v, float('inf')):
                dist[v] = new_dist
                heapq.heappush(heap, (new_dist, v))

    return dist

The if d > dist[u] check is the lazy deletion in action — stale entries with outdated distances are skipped.

Merging sorted streams

heapq.merge(*iterables) returns a lazy iterator that yields items in sorted order. It maintains a heap of size k (number of iterables) and advances the iterator that yielded the smallest value.

Real-world use: merging sorted log files from 100 servers. Each file can be gigabytes, but merge only holds one element per file in memory.

import heapq

def merge_log_files(file_paths):
    openers = [open(p) for p in file_paths]
    for line in heapq.merge(*openers, key=lambda l: l.split()[0]):
        yield line
    for f in openers:
        f.close()

Performance comparison

Benchmarks on a list of 1,000,000 random integers (CPython 3.12, M2 Mac):

OperationTime
heapify then 1000 heappop~3 ms
sorted() then slice first 1000~180 ms
nsmallest(1000, data)~85 ms
1000 × min() with removal~12,000 ms

For repeated extraction, heapq is orders of magnitude faster than re-sorting or scanning.

When nsmallest/nlargest wins

If k is tiny relative to n and you only need the result once, nsmallest uses a fixed-size heap of k and makes a single pass. If k approaches n, sorted() is faster because Timsort’s constant factors are lower.

Thread safety

heapq functions are not thread-safe. If multiple threads push and pop, wrap access in a threading.Lock or use queue.PriorityQueue, which adds locking internally. queue.PriorityQueue uses heapq under the hood — it’s the same data structure with synchronization bolted on.

Alternative: sortedcontainers.SortedList

If you need O(log n) insertion and O(log n) arbitrary removal and iteration in sorted order, consider the third-party sortedcontainers.SortedList. It uses a B-tree-like structure with excellent constant factors. The trade-off is an external dependency and slightly slower push/pop compared to heapq’s C implementation.

Pitfalls recap

  1. Iterating a heap doesn’t give sorted order. Only heap[0] is guaranteed smallest.
  2. Comparing non-comparable items raises TypeError. Use a counter for tie-breaking.
  3. No built-in decrease-key. Use lazy deletion.
  4. nsmallest(1, data) is slower than min(data). Use the right tool.
  5. Memory accumulation with lazy deletion. Rebuild periodically if removal is frequent.

The one thing to remember: heapq is CPython’s workhorse for priority queues — learn lazy deletion and the counter-based tie-breaking pattern, and you can handle scheduling, graph algorithms, and stream merging without reaching for external libraries.

pythonstandard-librarydata-structures

See Also

  • Python Atexit How Python's atexit module lets your program clean up after itself right before it shuts down.
  • Python Bisect Sorted Lists How Python's bisect module finds things in sorted lists the way you'd find a word in a dictionary — by jumping to the middle.
  • Python Contextlib How Python's contextlib module makes the 'with' statement work for anything, not just files.
  • Python Copy Module Why copying data in Python isn't as simple as it sounds, and how the copy module prevents sneaky bugs.
  • Python Dataclass Field Metadata How Python dataclass fields can carry hidden notes — like sticky notes on a filing cabinet that tools read automatically.