Python Audio Fingerprinting — Deep Dive
The Shazam algorithm in depth
The foundational paper is Avery Wang’s “An Industrial-Strength Audio Search Algorithm” (2003). The core insight: spectrogram peaks form a constellation that is unique to each recording and survives severe degradation (noise, compression, reverberation).
Step 1: Preprocessing
Convert to mono, resample to a standard rate (e.g., 8 000 Hz for efficiency or 11 025 Hz for better frequency resolution), and apply a high-pass filter to remove DC offset and low-frequency rumble.
import librosa
import numpy as np
y, sr = librosa.load("song.mp3", sr=11025, mono=True)
# High-pass filter at 100 Hz
from scipy.signal import butter, sosfilt
sos = butter(5, 100, btype='high', fs=sr, output='sos')
y = sosfilt(sos, y)
Step 2: Spectrogram computation
from scipy.signal import spectrogram
nperseg = 1024 # ~93 ms window at 11025 Hz
noverlap = nperseg // 2
f, t, Sxx = spectrogram(y, fs=sr, nperseg=nperseg, noverlap=noverlap)
Sxx_db = 10 * np.log10(Sxx + 1e-10)
The choice of window size affects the time-frequency tradeoff. Larger windows give better frequency resolution (important for distinguishing nearby harmonics) but worse time resolution. For fingerprinting, 1024 samples at 11 025 Hz is a good balance.
Step 3: Peak detection (constellation map)
from scipy.ndimage import maximum_filter, minimum_filter
def detect_peaks(Sxx_db, amp_min=10, neighborhood_size=20):
"""Find local maxima in the spectrogram."""
local_max = maximum_filter(Sxx_db, size=neighborhood_size)
# Point is a peak if it equals the local max and exceeds threshold
detected = (Sxx_db == local_max) & (Sxx_db > amp_min)
# Also require it stands out from local minimum
local_min = minimum_filter(Sxx_db, size=neighborhood_size)
detected = detected & ((Sxx_db - local_min) > amp_min)
freq_idx, time_idx = np.where(detected)
return list(zip(time_idx, freq_idx, Sxx_db[freq_idx, time_idx]))
The neighborhood_size parameter controls peak density. Too small: thousands of spurious peaks per second. Too large: you miss genuine features. Shazam reportedly uses an adaptive threshold per frequency band.
Step 4: Combinatorial hashing with target zones
For each peak (the “anchor”), define a target zone: a rectangular region offset in time from the anchor (e.g., 1–10 frames ahead, any frequency). Pair the anchor with every peak in its target zone.
def generate_hashes(peaks, fan_out=15, time_delta_min=1, time_delta_max=10):
"""Generate (hash, time_offset) pairs from constellation peaks."""
# Sort by time
peaks.sort(key=lambda p: p[0])
hashes = []
for i, (t1, f1, _) in enumerate(peaks):
targets = 0
for j in range(i + 1, len(peaks)):
t2, f2, _ = peaks[j]
dt = t2 - t1
if dt < time_delta_min:
continue
if dt > time_delta_max:
break
# Hash: (freq1, freq2, delta_time) → 32-bit integer
h = (f1 & 0x3FF) << 20 | (f2 & 0x3FF) << 10 | (dt & 0x3FF)
hashes.append((h, t1))
targets += 1
if targets >= fan_out:
break
return hashes
The fan_out controls how many pairs per anchor. More pairs mean better recall at the cost of database size and query time. 15–20 is typical.
Step 5: Database storage
CREATE TABLE fingerprints (
hash INTEGER NOT NULL,
song_id INTEGER NOT NULL,
time_offset INTEGER NOT NULL,
INDEX idx_hash (hash)
);
For each song, insert all (hash, song_id, time_offset) tuples. A 3-minute song at 11 025 Hz typically produces 10 000–50 000 fingerprint entries depending on peak density and fan-out.
Step 6: Query and matching
from collections import defaultdict, Counter
def identify(query_hashes, db_cursor):
"""Match query hashes against the database."""
# Collect all candidate matches
candidates = defaultdict(list)
for q_hash, q_time in query_hashes:
db_cursor.execute("SELECT song_id, time_offset FROM fingerprints WHERE hash = %s", (q_hash,))
for song_id, db_time in db_cursor.fetchall():
# Time alignment: the difference should be constant for true matches
delta = db_time - q_time
candidates[song_id].append(delta)
# Score: count how many hashes align at the same time offset
scores = {}
for song_id, deltas in candidates.items():
offset_counts = Counter(deltas)
best_offset, best_count = offset_counts.most_common(1)[0]
scores[song_id] = best_count
if not scores:
return None
best_song = max(scores, key=scores.get)
return best_song, scores[best_song]
The key insight: for a true match, all matching hashes will share the same time offset (the difference between the query’s position in the clip and the song’s absolute time). A histogram of offsets with a sharp peak indicates a confident match.
Scaling to millions of tracks
Database choice
| Scale | Recommended store | Notes |
|---|---|---|
| < 10K songs | SQLite | Simple, single file, fast enough |
| 10K–100K songs | PostgreSQL with hash index | B-tree on hash column, connection pooling |
| 100K–1M songs | Redis (hash → sorted set) | In-memory, sub-millisecond lookups |
| 1M+ songs | Custom sharded store | Partition by hash prefix, distributed |
Optimization techniques
- Hash quantization: Reduce hash space by binning frequencies into 512 or 1024 bands (10-bit each) — keeps hashes in 32-bit integers.
- Batch queries: Instead of one SQL query per hash, batch lookup all query hashes in a single
WHERE hash IN (...)statement. - Bloom filter pre-screening: Before hitting the database, check a Bloom filter to skip hashes with zero matches.
- Inverted index: Store fingerprints as
hash → [(song_id, offset), ...]in memory-mapped files for zero-copy access.
Memory math
At 30K fingerprints per song × 12 bytes per entry (4 hash + 4 song_id + 4 offset), 1M songs requires ~360 GB raw. With compression and smarter encoding, practical systems fit 1M songs in 50–100 GB.
Robustness analysis
What survives fingerprinting
- Lossy compression (MP3 128 kbps+)
- Background noise up to ~0 dB SNR
- Reverberation (moderate room echo)
- Mono/stereo conversion
- Volume changes
- FM radio processing (compression, EQ)
What breaks fingerprints
- Time stretching (changes delta_time in hashes)
- Pitch shifting (changes frequency values in hashes)
- Extreme low bitrate (< 64 kbps destroys spectral peaks)
- Heavy remix or mashup (different peak constellations)
For pitch/tempo-invariant matching, research approaches use chroma fingerprints or self-similarity matrices instead of spectral peaks — at significant computational cost.
Advanced: chromaprint and AcoustID
The open-source alternative to Shazam’s approach is Chromaprint (used by AcoustID/MusicBrainz):
import chromaprint
duration, fingerprint = chromaprint.decode_audio_file("song.mp3")
# fingerprint is a list of 32-bit integers
Chromaprint uses chroma features instead of raw spectral peaks, making it more robust to audio quality variations but less precise for exact-recording matching.
# Using the acoustid Python library
import acoustid
for score, recording_id, title, artist in acoustid.match("API_KEY", "unknown.mp3"):
print(f"{score:.0%} {artist} - {title}")
Building a complete system
class AudioFingerprinter:
def __init__(self, db_path="fingerprints.db"):
self.conn = sqlite3.connect(db_path)
self._create_tables()
def ingest(self, audio_path: str, song_name: str):
y, sr = librosa.load(audio_path, sr=11025, mono=True)
peaks = detect_peaks(compute_spectrogram(y, sr))
hashes = generate_hashes(peaks)
song_id = self._insert_song(song_name)
self._insert_fingerprints(song_id, hashes)
return len(hashes)
def identify(self, audio_path: str, duration: float = 10.0):
y, sr = librosa.load(audio_path, sr=11025, mono=True, duration=duration)
peaks = detect_peaks(compute_spectrogram(y, sr))
hashes = generate_hashes(peaks)
return self._match(hashes)
def identify_from_mic(self, seconds: float = 10.0):
import sounddevice as sd
recording = sd.rec(int(seconds * 11025), samplerate=11025, channels=1)
sd.wait()
peaks = detect_peaks(compute_spectrogram(recording.flatten(), 11025))
hashes = generate_hashes(peaks)
return self._match(hashes)
Tradeoffs
| Approach | Latency | Accuracy | Robustness | Scale |
|---|---|---|---|---|
| Spectral landmarks (Shazam-style) | Fast | Exact match only | High (noise-resistant) | Millions of songs |
| Chromaprint (AcoustID) | Fast | Near-exact | Moderate | Open ecosystem |
| Neural embedding (VGGish, CLAP) | Slower | Similarity-based | Very high | Requires GPU |
One thing to remember: A production audio fingerprinting system combines spectrogram peak extraction, combinatorial landmark hashing, and time-offset histogram matching — a pipeline that’s mathematically elegant, noise-robust, and scales from a hobby project to a billion-query service.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.
- Python Librosa Audio Analysis Picture a music detective that can look at any song and tell you exactly what notes, beats, and moods are hiding inside — that's what Librosa does for Python.