Python Librosa Audio Analysis — Deep Dive
Architecture overview
Librosa is organized into submodules: core (loading, STFT, resampling), feature (spectral and rhythm descriptors), beat (tempo and beat tracking), effects (time-stretch, pitch-shift, HPSS), decompose (NMF, RPCA), segment (structural analysis), and display (matplotlib helpers). All functions operate on plain NumPy arrays, making integration with other scientific Python tools trivial.
The STFT in detail
import librosa
import numpy as np
y, sr = librosa.load("track.wav", sr=22050)
S = librosa.stft(y, n_fft=2048, hop_length=512, win_length=2048, window="hann")
n_fft controls the FFT size (frequency resolution). hop_length controls the stride between windows (time resolution). A 2048-point FFT at 22 050 Hz gives frequency bins roughly 10.7 Hz apart. With a hop of 512, each frame covers about 23 ms with a step of 23 ms, yielding ~43 frames per second.
The output shape is (1 + n_fft/2, T) — 1025 frequency bins by T time frames. Each entry is a complex number; magnitude gives energy, phase gives timing.
Windowing tradeoffs
The default Hann window balances main-lobe width and side-lobe suppression. For tasks needing sharper frequency resolution (e.g., tuning estimation), a Blackman-Harris window reduces spectral leakage at the cost of a wider main lobe. Librosa accepts any window that scipy.signal.get_window understands.
Mel spectrogram internals
The mel scale converts linear frequencies to a perceptual scale:
mel = 2595 * log10(1 + f / 700)
librosa.filters.mel() builds a bank of triangular filters spaced linearly in mel. Multiplying the power spectrogram by this filter bank yields a mel spectrogram with n_mels bands (default 128). Each band integrates energy over a frequency range that widens as pitch increases.
S_mel = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=512, n_mels=128)
S_db = librosa.power_to_db(S_mel, ref=np.max)
Choosing n_mels is task-dependent. Speech recognition models typically use 40–80 bands; music analysis benefits from 128 or more.
MFCCs — from spectrum to cepstrum
MFCCs apply a Discrete Cosine Transform (DCT) to the log-mel spectrogram:
- Compute mel spectrogram.
- Take the log.
- Apply DCT, keep the first
n_mfcccoefficients (commonly 13 or 20).
The DCT decorrelates the mel bands, compressing spectral shape into a compact representation. The first coefficient captures overall energy, the second captures spectral tilt, and higher coefficients encode finer spectral detail.
Delta and delta-delta MFCCs (librosa.feature.delta) add velocity and acceleration of the cepstral trajectory, improving performance in speech recognition.
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfcc_delta = librosa.feature.delta(mfcc)
mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
features = np.vstack([mfcc, mfcc_delta, mfcc_delta2]) # 39 features per frame
Harmonic-percussive source separation (HPSS)
HPSS splits a spectrogram into harmonic (sustained tones) and percussive (transients) components using median filtering:
y_harmonic, y_percussive = librosa.effects.hpss(y)
Internally, librosa.decompose.hpss applies a horizontal median filter (captures harmonic continuity across time) and a vertical median filter (captures percussive continuity across frequency). Soft masking blends the two when they overlap.
Use the harmonic signal for pitch and chord analysis; use the percussive signal for beat and onset detection. This preprocessing step dramatically improves downstream accuracy.
Beat tracking pipeline
- Onset strength:
librosa.onset.onset_strength()computes a per-frame envelope by taking the positive first-order difference of the mel spectrogram across frequency bands. - Tempo estimation:
librosa.beat.tempo()autocorrelates the onset envelope and applies a prior favoring tempos near 120 BPM (configurable). - Beat placement: Dynamic programming aligns beat positions to onset peaks while maintaining regularity. The
tightnessparameter controls how rigidly beats snap to the estimated tempo vs. local onset peaks.
tempo, beats = librosa.beat.beat_track(y=y, sr=sr, hop_length=512, tightness=100)
beat_times = librosa.frames_to_time(beats, sr=sr, hop_length=512)
For complex music with tempo changes, process in segments: window the audio, estimate local tempo per segment, and stitch beat arrays together.
Chroma features and key estimation
Chroma maps all frequency energy onto the 12 pitch classes:
chroma = librosa.feature.chroma_cqt(y=y, sr=sr) # CQT-based, better for music
chroma_stft uses the STFT; chroma_cqt uses the Constant-Q Transform, which provides logarithmically spaced frequency bins aligned to musical notes. CQT-based chroma is more accurate for polyphonic music.
To estimate the key of a recording, correlate the average chroma vector against Krumhansl-Kessler key profiles — 24 templates (12 major + 12 minor) — and pick the one with the highest correlation.
Structural segmentation
librosa.segment.recurrence_matrix() builds a self-similarity matrix from features (chroma, MFCCs), revealing repeating sections in music. Spectral clustering or checkerboard-kernel novelty detection on this matrix identifies verse, chorus, and bridge boundaries.
R = librosa.segment.recurrence_matrix(chroma, mode="affinity", sym=True)
Performance considerations
- Memory: Loading a 5-minute song at 22 050 Hz mono uses ~13 MB. A full STFT spectrogram with 2048 FFT and 512 hop takes ~80 MB as complex128.
- Speed: Precompute the mel spectrogram once and derive MFCCs, chroma, and other features from it rather than reloading audio.
- Batch processing: For large datasets, use
librosa.load(duration=...)to process clips and parallelize with joblib or multiprocessing. Librosa itself is single-threaded. - Caching:
librosa.cache(backed by joblib) caches function outputs to disk. Enable withlibrosa.cache.level = 20at the start of a session.
Production ML pipeline example
import librosa
import numpy as np
from pathlib import Path
def extract_features(path: str, sr: int = 22050, n_mfcc: int = 13) -> np.ndarray:
y, _ = librosa.load(path, sr=sr, mono=True)
# Harmonic-percussive separation
y_h, y_p = librosa.effects.hpss(y)
# Mel spectrogram from harmonic component
S = librosa.feature.melspectrogram(y=y_h, sr=sr, n_mels=128)
# MFCCs + deltas
mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), n_mfcc=n_mfcc)
mfcc_d = librosa.feature.delta(mfcc)
mfcc_d2 = librosa.feature.delta(mfcc, order=2)
# Chroma from harmonic
chroma = librosa.feature.chroma_cqt(y=y_h, sr=sr)
# Rhythm features from percussive
tempo, _ = librosa.beat.beat_track(y=y_p, sr=sr)
onset_env = librosa.onset.onset_strength(y=y_p, sr=sr)
# Aggregate: mean + std over time axis
feat_list = [mfcc, mfcc_d, mfcc_d2, chroma]
aggregated = np.concatenate([
np.concatenate([f.mean(axis=1), f.std(axis=1)]) for f in feat_list
])
aggregated = np.append(aggregated, [tempo, onset_env.mean(), onset_env.std()])
return aggregated
This yields a fixed-length feature vector per track regardless of duration — ready for scikit-learn classifiers or as input to a neural network.
Tradeoffs
| Approach | Pro | Con |
|---|---|---|
| Raw waveform (1-D CNN) | No hand-crafted features | Needs lots of data, slow training |
| Mel spectrogram (2-D CNN) | Perceptually motivated, works well | Loses phase information |
| MFCCs + classifier | Fast, interpretable, small models | Less expressive for complex tasks |
| CQT-based features | Musical pitch alignment | Slower to compute than STFT |
For most practical tasks — genre classification, mood tagging, instrument recognition — mel spectrograms or MFCCs with gradient-boosted trees or small CNNs outperform raw waveform approaches when data is limited.
One thing to remember: Librosa’s power lies in its composable, NumPy-native feature extraction pipeline — STFT, mel, MFCC, chroma, beats — that turns raw audio into structured data ready for any ML framework.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.