Python Audio Fingerprinting — Core Concepts

Learn how audio fingerprinting works — spectrogram peak extraction, hash generation, database matching — and how to implement it in Python with dejavu and custom pipelines.

What audio fingerprinting does

Audio fingerprinting creates a compact, searchable representation of an audio recording. Given a short sample — even recorded through a phone in a noisy room — the system matches it against a database of known songs in milliseconds. Shazam, SoundHound, YouTube Content ID, and Spotify all use variations of this technique.

The pipeline

1. Spectrogram generation

The audio signal is converted to a time-frequency representation using the Short-Time Fourier Transform (STFT). The result is a 2-D matrix where the x-axis is time, the y-axis is frequency, and the value is amplitude (or power in decibels).

2. Peak extraction

From the spectrogram, the system finds local maxima — points that are louder than their neighbors in both time and frequency. These peaks correspond to the most prominent tonal events in the recording. A typical 3-minute song yields a few thousand peaks.

Peaks are robust because:

Background noise is usually broadband and doesn’t create sharp peaks
Recording quality differences (phone mic vs studio) attenuate the overall level but preserve relative peak positions
Peaks from different overlapping sounds rarely collide exactly

3. Combinatorial hashing

Peaks alone aren’t enough — you need a way to search them fast. The standard approach (from the landmark Shazam paper by Avery Wang, 2003) pairs nearby peaks into landmark pairs:

Each pair produces a hash from three values:

Frequency of peak A
Frequency of peak B
Time difference between A and B

This hash, combined with the absolute time of peak A, forms a fingerprint entry.

4. Database storage and lookup

For each known song, all fingerprint hashes are stored in a database (key = hash, value = song ID + time offset). When a query arrives, its hashes are looked up. If many hashes from the query match the same song at a consistent time offset, that song is identified.

The time-offset consistency check is critical — it distinguishes true matches (many hashes aligned in time) from random collisions.

Python implementation options

Dejavu

Dejavu is the most popular open-source Python audio fingerprinting library. It handles fingerprinting, database storage (MySQL or PostgreSQL), and recognition:

from dejavu import Dejavu
from dejavu.recognize import FileRecognizer

config = {
    "database": {"host": "localhost", "user": "root", "password": "pass", "database": "dejavu"},
    "database_type": "mysql"
}

djv = Dejavu(config)
djv.fingerprint_directory("music_library/", [".mp3", ".wav"])

result = djv.recognize(FileRecognizer, "unknown_sample.wav")
print(result)  # {'song_name': ..., 'confidence': ...}

Custom implementation

For learning or specialized use cases, build from scratch with SciPy:

from scipy.ndimage import maximum_filter
from scipy.signal import spectrogram
import numpy as np

def find_peaks(audio, sr, threshold=20):
    f, t, Sxx = spectrogram(audio, fs=sr, nperseg=1024, noverlap=512)
    Sxx_db = 10 * np.log10(Sxx + 1e-10)
    
    # Local maximum filter
    local_max = maximum_filter(Sxx_db, size=20)
    peaks = (Sxx_db == local_max) & (Sxx_db > threshold)
    
    freq_idx, time_idx = np.where(peaks)
    return list(zip(time_idx, freq_idx))

Matching accuracy factors

Factor	Effect on accuracy
Peak extraction threshold	Too low: too many peaks (slow, false matches). Too high: misses quiet features
Fan-out (pairs per peak)	More pairs: better recall, larger database. Fewer: faster but may miss matches
Query length	5+ seconds typically needed for reliable matching
Noise level	Works well up to ~0 dB SNR (signal and noise equally loud)
Time stretching	Breaks fingerprints — pitch-shifted or tempo-changed versions won’t match standard fingerprints

Common misconception

Audio fingerprinting is not the same as audio classification or genre detection. Fingerprinting identifies exact recordings (or close variants). It cannot tell you that two different performances of the same song are “the same piece of music” — for that you need higher-level analysis like melody extraction or harmonic comparison.

How it fits with other tools

Use Librosa or SciPy to compute spectrograms and extract peaks. Store fingerprints in PostgreSQL, Redis, or SQLite depending on scale. For the recognition frontend, combine with sounddevice for mic input or process uploaded files via FastAPI.

One thing to remember: Audio fingerprinting works by extracting spectrogram peaks, hashing pairs of peaks into compact codes, and matching those codes against a database — making it resilient to noise and enabling millisecond-speed identification.

pythonaudiofingerprintingrecognitionshazam