Python Librosa Audio Analysis — Deep Dive

Master Librosa's STFT internals, mel filterbank math, advanced feature pipelines, harmonic-percussive separation, and production-grade audio ML workflows.

Architecture overview

Librosa is organized into submodules: core (loading, STFT, resampling), feature (spectral and rhythm descriptors), beat (tempo and beat tracking), effects (time-stretch, pitch-shift, HPSS), decompose (NMF, RPCA), segment (structural analysis), and display (matplotlib helpers). All functions operate on plain NumPy arrays, making integration with other scientific Python tools trivial.

The STFT in detail

import librosa
import numpy as np

y, sr = librosa.load("track.wav", sr=22050)
S = librosa.stft(y, n_fft=2048, hop_length=512, win_length=2048, window="hann")

n_fft controls the FFT size (frequency resolution). hop_length controls the stride between windows (time resolution). A 2048-point FFT at 22 050 Hz gives frequency bins roughly 10.7 Hz apart. With a hop of 512, each frame covers about 23 ms with a step of 23 ms, yielding ~43 frames per second.

The output shape is (1 + n_fft/2, T) — 1025 frequency bins by T time frames. Each entry is a complex number; magnitude gives energy, phase gives timing.

Windowing tradeoffs

The default Hann window balances main-lobe width and side-lobe suppression. For tasks needing sharper frequency resolution (e.g., tuning estimation), a Blackman-Harris window reduces spectral leakage at the cost of a wider main lobe. Librosa accepts any window that scipy.signal.get_window understands.

Mel spectrogram internals

The mel scale converts linear frequencies to a perceptual scale:

mel = 2595 * log10(1 + f / 700)

librosa.filters.mel() builds a bank of triangular filters spaced linearly in mel. Multiplying the power spectrogram by this filter bank yields a mel spectrogram with n_mels bands (default 128). Each band integrates energy over a frequency range that widens as pitch increases.

S_mel = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=512, n_mels=128)
S_db = librosa.power_to_db(S_mel, ref=np.max)

Choosing n_mels is task-dependent. Speech recognition models typically use 40–80 bands; music analysis benefits from 128 or more.

MFCCs — from spectrum to cepstrum

MFCCs apply a Discrete Cosine Transform (DCT) to the log-mel spectrogram:

Compute mel spectrogram.
Take the log.
Apply DCT, keep the first n_mfcc coefficients (commonly 13 or 20).

The DCT decorrelates the mel bands, compressing spectral shape into a compact representation. The first coefficient captures overall energy, the second captures spectral tilt, and higher coefficients encode finer spectral detail.

Delta and delta-delta MFCCs (librosa.feature.delta) add velocity and acceleration of the cepstral trajectory, improving performance in speech recognition.

mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
mfcc_delta = librosa.feature.delta(mfcc)
mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
features = np.vstack([mfcc, mfcc_delta, mfcc_delta2])  # 39 features per frame

Harmonic-percussive source separation (HPSS)

HPSS splits a spectrogram into harmonic (sustained tones) and percussive (transients) components using median filtering:

y_harmonic, y_percussive = librosa.effects.hpss(y)

Internally, librosa.decompose.hpss applies a horizontal median filter (captures harmonic continuity across time) and a vertical median filter (captures percussive continuity across frequency). Soft masking blends the two when they overlap.

Use the harmonic signal for pitch and chord analysis; use the percussive signal for beat and onset detection. This preprocessing step dramatically improves downstream accuracy.

Beat tracking pipeline

Onset strength: librosa.onset.onset_strength() computes a per-frame envelope by taking the positive first-order difference of the mel spectrogram across frequency bands.
Tempo estimation: librosa.beat.tempo() autocorrelates the onset envelope and applies a prior favoring tempos near 120 BPM (configurable).
Beat placement: Dynamic programming aligns beat positions to onset peaks while maintaining regularity. The tightness parameter controls how rigidly beats snap to the estimated tempo vs. local onset peaks.

tempo, beats = librosa.beat.beat_track(y=y, sr=sr, hop_length=512, tightness=100)
beat_times = librosa.frames_to_time(beats, sr=sr, hop_length=512)

For complex music with tempo changes, process in segments: window the audio, estimate local tempo per segment, and stitch beat arrays together.

Chroma features and key estimation

Chroma maps all frequency energy onto the 12 pitch classes:

chroma = librosa.feature.chroma_cqt(y=y, sr=sr)  # CQT-based, better for music

chroma_stft uses the STFT; chroma_cqt uses the Constant-Q Transform, which provides logarithmically spaced frequency bins aligned to musical notes. CQT-based chroma is more accurate for polyphonic music.

To estimate the key of a recording, correlate the average chroma vector against Krumhansl-Kessler key profiles — 24 templates (12 major + 12 minor) — and pick the one with the highest correlation.

Structural segmentation

librosa.segment.recurrence_matrix() builds a self-similarity matrix from features (chroma, MFCCs), revealing repeating sections in music. Spectral clustering or checkerboard-kernel novelty detection on this matrix identifies verse, chorus, and bridge boundaries.

R = librosa.segment.recurrence_matrix(chroma, mode="affinity", sym=True)

Performance considerations

Memory: Loading a 5-minute song at 22 050 Hz mono uses ~13 MB. A full STFT spectrogram with 2048 FFT and 512 hop takes ~80 MB as complex128.
Speed: Precompute the mel spectrogram once and derive MFCCs, chroma, and other features from it rather than reloading audio.
Batch processing: For large datasets, use librosa.load(duration=...) to process clips and parallelize with joblib or multiprocessing. Librosa itself is single-threaded.
Caching: librosa.cache (backed by joblib) caches function outputs to disk. Enable with librosa.cache.level = 20 at the start of a session.

Production ML pipeline example

import librosa
import numpy as np
from pathlib import Path

def extract_features(path: str, sr: int = 22050, n_mfcc: int = 13) -> np.ndarray:
    y, _ = librosa.load(path, sr=sr, mono=True)
    
    # Harmonic-percussive separation
    y_h, y_p = librosa.effects.hpss(y)
    
    # Mel spectrogram from harmonic component
    S = librosa.feature.melspectrogram(y=y_h, sr=sr, n_mels=128)
    
    # MFCCs + deltas
    mfcc = librosa.feature.mfcc(S=librosa.power_to_db(S), n_mfcc=n_mfcc)
    mfcc_d = librosa.feature.delta(mfcc)
    mfcc_d2 = librosa.feature.delta(mfcc, order=2)
    
    # Chroma from harmonic
    chroma = librosa.feature.chroma_cqt(y=y_h, sr=sr)
    
    # Rhythm features from percussive
    tempo, _ = librosa.beat.beat_track(y=y_p, sr=sr)
    onset_env = librosa.onset.onset_strength(y=y_p, sr=sr)
    
    # Aggregate: mean + std over time axis
    feat_list = [mfcc, mfcc_d, mfcc_d2, chroma]
    aggregated = np.concatenate([
        np.concatenate([f.mean(axis=1), f.std(axis=1)]) for f in feat_list
    ])
    aggregated = np.append(aggregated, [tempo, onset_env.mean(), onset_env.std()])
    
    return aggregated

This yields a fixed-length feature vector per track regardless of duration — ready for scikit-learn classifiers or as input to a neural network.

Tradeoffs

Approach	Pro	Con
Raw waveform (1-D CNN)	No hand-crafted features	Needs lots of data, slow training
Mel spectrogram (2-D CNN)	Perceptually motivated, works well	Loses phase information
MFCCs + classifier	Fast, interpretable, small models	Less expressive for complex tasks
CQT-based features	Musical pitch alignment	Slower to compute than STFT

For most practical tasks — genre classification, mood tagging, instrument recognition — mel spectrograms or MFCCs with gradient-boosted trees or small CNNs outperform raw waveform approaches when data is limited.

One thing to remember: Librosa’s power lies in its composable, NumPy-native feature extraction pipeline — STFT, mel, MFCC, chroma, beats — that turns raw audio into structured data ready for any ML framework.

pythonlibrosaaudioanalysismusicdsp