Python Audio Fingerprinting — Deep Dive

Build a production audio fingerprinting system in Python — from spectrogram math and landmark hashing to database design, matching algorithms, and scaling to millions of tracks.

The Shazam algorithm in depth

The foundational paper is Avery Wang’s “An Industrial-Strength Audio Search Algorithm” (2003). The core insight: spectrogram peaks form a constellation that is unique to each recording and survives severe degradation (noise, compression, reverberation).

Step 1: Preprocessing

Convert to mono, resample to a standard rate (e.g., 8 000 Hz for efficiency or 11 025 Hz for better frequency resolution), and apply a high-pass filter to remove DC offset and low-frequency rumble.

import librosa
import numpy as np

y, sr = librosa.load("song.mp3", sr=11025, mono=True)
# High-pass filter at 100 Hz
from scipy.signal import butter, sosfilt
sos = butter(5, 100, btype='high', fs=sr, output='sos')
y = sosfilt(sos, y)

Step 2: Spectrogram computation

from scipy.signal import spectrogram

nperseg = 1024  # ~93 ms window at 11025 Hz
noverlap = nperseg // 2
f, t, Sxx = spectrogram(y, fs=sr, nperseg=nperseg, noverlap=noverlap)
Sxx_db = 10 * np.log10(Sxx + 1e-10)

The choice of window size affects the time-frequency tradeoff. Larger windows give better frequency resolution (important for distinguishing nearby harmonics) but worse time resolution. For fingerprinting, 1024 samples at 11 025 Hz is a good balance.

Step 3: Peak detection (constellation map)

from scipy.ndimage import maximum_filter, minimum_filter

def detect_peaks(Sxx_db, amp_min=10, neighborhood_size=20):
    """Find local maxima in the spectrogram."""
    local_max = maximum_filter(Sxx_db, size=neighborhood_size)
    
    # Point is a peak if it equals the local max and exceeds threshold
    detected = (Sxx_db == local_max) & (Sxx_db > amp_min)
    
    # Also require it stands out from local minimum
    local_min = minimum_filter(Sxx_db, size=neighborhood_size)
    detected = detected & ((Sxx_db - local_min) > amp_min)
    
    freq_idx, time_idx = np.where(detected)
    return list(zip(time_idx, freq_idx, Sxx_db[freq_idx, time_idx]))

The neighborhood_size parameter controls peak density. Too small: thousands of spurious peaks per second. Too large: you miss genuine features. Shazam reportedly uses an adaptive threshold per frequency band.

Step 4: Combinatorial hashing with target zones

For each peak (the “anchor”), define a target zone: a rectangular region offset in time from the anchor (e.g., 1–10 frames ahead, any frequency). Pair the anchor with every peak in its target zone.

def generate_hashes(peaks, fan_out=15, time_delta_min=1, time_delta_max=10):
    """Generate (hash, time_offset) pairs from constellation peaks."""
    # Sort by time
    peaks.sort(key=lambda p: p[0])
    
    hashes = []
    for i, (t1, f1, _) in enumerate(peaks):
        targets = 0
        for j in range(i + 1, len(peaks)):
            t2, f2, _ = peaks[j]
            dt = t2 - t1
            if dt < time_delta_min:
                continue
            if dt > time_delta_max:
                break
            
            # Hash: (freq1, freq2, delta_time) → 32-bit integer
            h = (f1 & 0x3FF) << 20 | (f2 & 0x3FF) << 10 | (dt & 0x3FF)
            hashes.append((h, t1))
            
            targets += 1
            if targets >= fan_out:
                break
    
    return hashes

The fan_out controls how many pairs per anchor. More pairs mean better recall at the cost of database size and query time. 15–20 is typical.

Step 5: Database storage

CREATE TABLE fingerprints (
    hash       INTEGER NOT NULL,
    song_id    INTEGER NOT NULL,
    time_offset INTEGER NOT NULL,
    INDEX idx_hash (hash)
);

For each song, insert all (hash, song_id, time_offset) tuples. A 3-minute song at 11 025 Hz typically produces 10 000–50 000 fingerprint entries depending on peak density and fan-out.

Step 6: Query and matching

from collections import defaultdict, Counter

def identify(query_hashes, db_cursor):
    """Match query hashes against the database."""
    # Collect all candidate matches
    candidates = defaultdict(list)
    
    for q_hash, q_time in query_hashes:
        db_cursor.execute("SELECT song_id, time_offset FROM fingerprints WHERE hash = %s", (q_hash,))
        for song_id, db_time in db_cursor.fetchall():
            # Time alignment: the difference should be constant for true matches
            delta = db_time - q_time
            candidates[song_id].append(delta)
    
    # Score: count how many hashes align at the same time offset
    scores = {}
    for song_id, deltas in candidates.items():
        offset_counts = Counter(deltas)
        best_offset, best_count = offset_counts.most_common(1)[0]
        scores[song_id] = best_count
    
    if not scores:
        return None
    
    best_song = max(scores, key=scores.get)
    return best_song, scores[best_song]

The key insight: for a true match, all matching hashes will share the same time offset (the difference between the query’s position in the clip and the song’s absolute time). A histogram of offsets with a sharp peak indicates a confident match.

Scaling to millions of tracks

Database choice

Scale	Recommended store	Notes
< 10K songs	SQLite	Simple, single file, fast enough
10K–100K songs	PostgreSQL with hash index	B-tree on hash column, connection pooling
100K–1M songs	Redis (hash → sorted set)	In-memory, sub-millisecond lookups
1M+ songs	Custom sharded store	Partition by hash prefix, distributed

Optimization techniques

Hash quantization: Reduce hash space by binning frequencies into 512 or 1024 bands (10-bit each) — keeps hashes in 32-bit integers.
Batch queries: Instead of one SQL query per hash, batch lookup all query hashes in a single WHERE hash IN (...) statement.
Bloom filter pre-screening: Before hitting the database, check a Bloom filter to skip hashes with zero matches.
Inverted index: Store fingerprints as hash → [(song_id, offset), ...] in memory-mapped files for zero-copy access.

Memory math

At 30K fingerprints per song × 12 bytes per entry (4 hash + 4 song_id + 4 offset), 1M songs requires ~360 GB raw. With compression and smarter encoding, practical systems fit 1M songs in 50–100 GB.

Robustness analysis

What survives fingerprinting

Lossy compression (MP3 128 kbps+)
Background noise up to ~0 dB SNR
Reverberation (moderate room echo)
Mono/stereo conversion
Volume changes
FM radio processing (compression, EQ)

What breaks fingerprints

Time stretching (changes delta_time in hashes)
Pitch shifting (changes frequency values in hashes)
Extreme low bitrate (< 64 kbps destroys spectral peaks)
Heavy remix or mashup (different peak constellations)

For pitch/tempo-invariant matching, research approaches use chroma fingerprints or self-similarity matrices instead of spectral peaks — at significant computational cost.

Advanced: chromaprint and AcoustID

The open-source alternative to Shazam’s approach is Chromaprint (used by AcoustID/MusicBrainz):

import chromaprint

duration, fingerprint = chromaprint.decode_audio_file("song.mp3")
# fingerprint is a list of 32-bit integers

Chromaprint uses chroma features instead of raw spectral peaks, making it more robust to audio quality variations but less precise for exact-recording matching.

# Using the acoustid Python library
import acoustid

for score, recording_id, title, artist in acoustid.match("API_KEY", "unknown.mp3"):
    print(f"{score:.0%} {artist} - {title}")

Building a complete system

class AudioFingerprinter:
    def __init__(self, db_path="fingerprints.db"):
        self.conn = sqlite3.connect(db_path)
        self._create_tables()
    
    def ingest(self, audio_path: str, song_name: str):
        y, sr = librosa.load(audio_path, sr=11025, mono=True)
        peaks = detect_peaks(compute_spectrogram(y, sr))
        hashes = generate_hashes(peaks)
        
        song_id = self._insert_song(song_name)
        self._insert_fingerprints(song_id, hashes)
        return len(hashes)
    
    def identify(self, audio_path: str, duration: float = 10.0):
        y, sr = librosa.load(audio_path, sr=11025, mono=True, duration=duration)
        peaks = detect_peaks(compute_spectrogram(y, sr))
        hashes = generate_hashes(peaks)
        
        return self._match(hashes)
    
    def identify_from_mic(self, seconds: float = 10.0):
        import sounddevice as sd
        recording = sd.rec(int(seconds * 11025), samplerate=11025, channels=1)
        sd.wait()
        peaks = detect_peaks(compute_spectrogram(recording.flatten(), 11025))
        hashes = generate_hashes(peaks)
        return self._match(hashes)

Tradeoffs

Approach	Latency	Accuracy	Robustness	Scale
Spectral landmarks (Shazam-style)	Fast	Exact match only	High (noise-resistant)	Millions of songs
Chromaprint (AcoustID)	Fast	Near-exact	Moderate	Open ecosystem
Neural embedding (VGGish, CLAP)	Slower	Similarity-based	Very high	Requires GPU

One thing to remember: A production audio fingerprinting system combines spectrogram peak extraction, combinatorial landmark hashing, and time-offset histogram matching — a pipeline that’s mathematically elegant, noise-robust, and scales from a hobby project to a billion-query service.

pythonaudiofingerprintingrecognitionshazamdsp