Python Audio Fingerprinting — Deep Dive

The Shazam algorithm in depth

The foundational paper is Avery Wang’s “An Industrial-Strength Audio Search Algorithm” (2003). The core insight: spectrogram peaks form a constellation that is unique to each recording and survives severe degradation (noise, compression, reverberation).

Step 1: Preprocessing

Convert to mono, resample to a standard rate (e.g., 8 000 Hz for efficiency or 11 025 Hz for better frequency resolution), and apply a high-pass filter to remove DC offset and low-frequency rumble.

import librosa
import numpy as np

y, sr = librosa.load("song.mp3", sr=11025, mono=True)
# High-pass filter at 100 Hz
from scipy.signal import butter, sosfilt
sos = butter(5, 100, btype='high', fs=sr, output='sos')
y = sosfilt(sos, y)

Step 2: Spectrogram computation

from scipy.signal import spectrogram

nperseg = 1024  # ~93 ms window at 11025 Hz
noverlap = nperseg // 2
f, t, Sxx = spectrogram(y, fs=sr, nperseg=nperseg, noverlap=noverlap)
Sxx_db = 10 * np.log10(Sxx + 1e-10)

The choice of window size affects the time-frequency tradeoff. Larger windows give better frequency resolution (important for distinguishing nearby harmonics) but worse time resolution. For fingerprinting, 1024 samples at 11 025 Hz is a good balance.

Step 3: Peak detection (constellation map)

from scipy.ndimage import maximum_filter, minimum_filter

def detect_peaks(Sxx_db, amp_min=10, neighborhood_size=20):
    """Find local maxima in the spectrogram."""
    local_max = maximum_filter(Sxx_db, size=neighborhood_size)
    
    # Point is a peak if it equals the local max and exceeds threshold
    detected = (Sxx_db == local_max) & (Sxx_db > amp_min)
    
    # Also require it stands out from local minimum
    local_min = minimum_filter(Sxx_db, size=neighborhood_size)
    detected = detected & ((Sxx_db - local_min) > amp_min)
    
    freq_idx, time_idx = np.where(detected)
    return list(zip(time_idx, freq_idx, Sxx_db[freq_idx, time_idx]))

The neighborhood_size parameter controls peak density. Too small: thousands of spurious peaks per second. Too large: you miss genuine features. Shazam reportedly uses an adaptive threshold per frequency band.

Step 4: Combinatorial hashing with target zones

For each peak (the “anchor”), define a target zone: a rectangular region offset in time from the anchor (e.g., 1–10 frames ahead, any frequency). Pair the anchor with every peak in its target zone.

def generate_hashes(peaks, fan_out=15, time_delta_min=1, time_delta_max=10):
    """Generate (hash, time_offset) pairs from constellation peaks."""
    # Sort by time
    peaks.sort(key=lambda p: p[0])
    
    hashes = []
    for i, (t1, f1, _) in enumerate(peaks):
        targets = 0
        for j in range(i + 1, len(peaks)):
            t2, f2, _ = peaks[j]
            dt = t2 - t1
            if dt < time_delta_min:
                continue
            if dt > time_delta_max:
                break
            
            # Hash: (freq1, freq2, delta_time) → 32-bit integer
            h = (f1 & 0x3FF) << 20 | (f2 & 0x3FF) << 10 | (dt & 0x3FF)
            hashes.append((h, t1))
            
            targets += 1
            if targets >= fan_out:
                break
    
    return hashes

The fan_out controls how many pairs per anchor. More pairs mean better recall at the cost of database size and query time. 15–20 is typical.

Step 5: Database storage

CREATE TABLE fingerprints (
    hash       INTEGER NOT NULL,
    song_id    INTEGER NOT NULL,
    time_offset INTEGER NOT NULL,
    INDEX idx_hash (hash)
);

For each song, insert all (hash, song_id, time_offset) tuples. A 3-minute song at 11 025 Hz typically produces 10 000–50 000 fingerprint entries depending on peak density and fan-out.

Step 6: Query and matching

from collections import defaultdict, Counter

def identify(query_hashes, db_cursor):
    """Match query hashes against the database."""
    # Collect all candidate matches
    candidates = defaultdict(list)
    
    for q_hash, q_time in query_hashes:
        db_cursor.execute("SELECT song_id, time_offset FROM fingerprints WHERE hash = %s", (q_hash,))
        for song_id, db_time in db_cursor.fetchall():
            # Time alignment: the difference should be constant for true matches
            delta = db_time - q_time
            candidates[song_id].append(delta)
    
    # Score: count how many hashes align at the same time offset
    scores = {}
    for song_id, deltas in candidates.items():
        offset_counts = Counter(deltas)
        best_offset, best_count = offset_counts.most_common(1)[0]
        scores[song_id] = best_count
    
    if not scores:
        return None
    
    best_song = max(scores, key=scores.get)
    return best_song, scores[best_song]

The key insight: for a true match, all matching hashes will share the same time offset (the difference between the query’s position in the clip and the song’s absolute time). A histogram of offsets with a sharp peak indicates a confident match.

Scaling to millions of tracks

Database choice

ScaleRecommended storeNotes
< 10K songsSQLiteSimple, single file, fast enough
10K–100K songsPostgreSQL with hash indexB-tree on hash column, connection pooling
100K–1M songsRedis (hash → sorted set)In-memory, sub-millisecond lookups
1M+ songsCustom sharded storePartition by hash prefix, distributed

Optimization techniques

  1. Hash quantization: Reduce hash space by binning frequencies into 512 or 1024 bands (10-bit each) — keeps hashes in 32-bit integers.
  2. Batch queries: Instead of one SQL query per hash, batch lookup all query hashes in a single WHERE hash IN (...) statement.
  3. Bloom filter pre-screening: Before hitting the database, check a Bloom filter to skip hashes with zero matches.
  4. Inverted index: Store fingerprints as hash → [(song_id, offset), ...] in memory-mapped files for zero-copy access.

Memory math

At 30K fingerprints per song × 12 bytes per entry (4 hash + 4 song_id + 4 offset), 1M songs requires ~360 GB raw. With compression and smarter encoding, practical systems fit 1M songs in 50–100 GB.

Robustness analysis

What survives fingerprinting

  • Lossy compression (MP3 128 kbps+)
  • Background noise up to ~0 dB SNR
  • Reverberation (moderate room echo)
  • Mono/stereo conversion
  • Volume changes
  • FM radio processing (compression, EQ)

What breaks fingerprints

  • Time stretching (changes delta_time in hashes)
  • Pitch shifting (changes frequency values in hashes)
  • Extreme low bitrate (< 64 kbps destroys spectral peaks)
  • Heavy remix or mashup (different peak constellations)

For pitch/tempo-invariant matching, research approaches use chroma fingerprints or self-similarity matrices instead of spectral peaks — at significant computational cost.

Advanced: chromaprint and AcoustID

The open-source alternative to Shazam’s approach is Chromaprint (used by AcoustID/MusicBrainz):

import chromaprint

duration, fingerprint = chromaprint.decode_audio_file("song.mp3")
# fingerprint is a list of 32-bit integers

Chromaprint uses chroma features instead of raw spectral peaks, making it more robust to audio quality variations but less precise for exact-recording matching.

# Using the acoustid Python library
import acoustid

for score, recording_id, title, artist in acoustid.match("API_KEY", "unknown.mp3"):
    print(f"{score:.0%} {artist} - {title}")

Building a complete system

class AudioFingerprinter:
    def __init__(self, db_path="fingerprints.db"):
        self.conn = sqlite3.connect(db_path)
        self._create_tables()
    
    def ingest(self, audio_path: str, song_name: str):
        y, sr = librosa.load(audio_path, sr=11025, mono=True)
        peaks = detect_peaks(compute_spectrogram(y, sr))
        hashes = generate_hashes(peaks)
        
        song_id = self._insert_song(song_name)
        self._insert_fingerprints(song_id, hashes)
        return len(hashes)
    
    def identify(self, audio_path: str, duration: float = 10.0):
        y, sr = librosa.load(audio_path, sr=11025, mono=True, duration=duration)
        peaks = detect_peaks(compute_spectrogram(y, sr))
        hashes = generate_hashes(peaks)
        
        return self._match(hashes)
    
    def identify_from_mic(self, seconds: float = 10.0):
        import sounddevice as sd
        recording = sd.rec(int(seconds * 11025), samplerate=11025, channels=1)
        sd.wait()
        peaks = detect_peaks(compute_spectrogram(recording.flatten(), 11025))
        hashes = generate_hashes(peaks)
        return self._match(hashes)

Tradeoffs

ApproachLatencyAccuracyRobustnessScale
Spectral landmarks (Shazam-style)FastExact match onlyHigh (noise-resistant)Millions of songs
Chromaprint (AcoustID)FastNear-exactModerateOpen ecosystem
Neural embedding (VGGish, CLAP)SlowerSimilarity-basedVery highRequires GPU

One thing to remember: A production audio fingerprinting system combines spectrogram peak extraction, combinatorial landmark hashing, and time-offset histogram matching — a pipeline that’s mathematically elegant, noise-robust, and scales from a hobby project to a billion-query service.

pythonaudiofingerprintingrecognitionshazamdsp

See Also

  • Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
  • Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
  • Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
  • Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.
  • Python Librosa Audio Analysis Picture a music detective that can look at any song and tell you exactly what notes, beats, and moods are hiding inside — that's what Librosa does for Python.