Python Spectrogram Analysis — Deep Dive

Master STFT math, window function theory, mel filterbank construction, phase spectrograms, reassignment methods, and spectrogram-based ML pipelines in Python.

The math behind the STFT

The STFT of a signal x[n] with window w[n] is:

X[m, k] = Σ_{n=0}^{N-1} x[n + m·H] · w[n] · e^{-j2πkn/N}

Where m is the frame index, k is the frequency bin, N is the FFT size, and H is the hop length. The result is complex: magnitude |X[m,k]| gives energy, and angle(X[m,k]) gives phase.

import numpy as np

def manual_stft(signal, n_fft=2048, hop_length=512, window='hann'):
    win = np.hanning(n_fft) if window == 'hann' else np.ones(n_fft)
    n_frames = 1 + (len(signal) - n_fft) // hop_length
    stft_matrix = np.zeros((n_fft // 2 + 1, n_frames), dtype=complex)
    
    for m in range(n_frames):
        start = m * hop_length
        frame = signal[start:start + n_fft] * win
        stft_matrix[:, m] = np.fft.rfft(frame)
    
    return stft_matrix

Window function theory

The choice of window function controls the spectral leakage tradeoff:

Main lobe width vs side lobe level

Window	Main lobe width (bins)	Side lobe level (dB)	Use case
Rectangular	1	-13	Maximum frequency resolution, most leakage
Hann	2	-31	General purpose (default in most libraries)
Hamming	2	-42	Speech processing
Blackman	3	-58	When side lobe suppression is critical
Kaiser (β=8)	~2.5	-50	Tunable tradeoff via β parameter
Flat-top	5	-93	Accurate amplitude measurement

Zero-padding

Adding zeros to the window before FFT interpolates the frequency axis without changing resolution. If you have a 1024-sample window, zero-pad to 2048 and get 1025 frequency bins instead of 513 — smoother-looking spectrograms with the same underlying resolution.

frame = np.zeros(2048)
frame[:1024] = signal_chunk * np.hanning(1024)
spectrum = np.fft.rfft(frame)

Mel filterbank construction

The mel scale approximates human pitch perception:

mel(f) = 2595 · log₁₀(1 + f/700)
f(mel) = 700 · (10^(mel/2595) - 1)

A mel filterbank is a set of triangular bandpass filters, linearly spaced in mel but exponentially spaced in Hz:

def mel_filterbank(sr, n_fft, n_mels=128, fmin=0, fmax=None):
    fmax = fmax or sr / 2
    
    # Mel-spaced center frequencies
    mel_min = 2595 * np.log10(1 + fmin / 700)
    mel_max = 2595 * np.log10(1 + fmax / 700)
    mel_points = np.linspace(mel_min, mel_max, n_mels + 2)
    hz_points = 700 * (10 ** (mel_points / 2595) - 1)
    
    # Convert to FFT bin indices
    bin_points = np.floor((n_fft + 1) * hz_points / sr).astype(int)
    
    filters = np.zeros((n_mels, n_fft // 2 + 1))
    for i in range(n_mels):
        left, center, right = bin_points[i], bin_points[i+1], bin_points[i+2]
        # Rising slope
        filters[i, left:center] = np.linspace(0, 1, center - left, endpoint=False)
        # Falling slope
        filters[i, center:right] = np.linspace(1, 0, right - center, endpoint=False)
    
    return filters  # shape: (n_mels, n_fft//2+1)

# Apply: mel_spectrogram = filterbank @ power_spectrogram

Slaney normalization

Librosa’s default normalization divides each filter by its bandwidth in Hz (slaney norm), ensuring each filter contributes equally regardless of bandwidth. Without this, high-frequency filters (which span wider Hz ranges) would dominate.

Phase spectrograms and reconstruction

Phase information

The magnitude spectrogram discards phase, which encodes timing information. Phase is crucial for:

Perfect reconstruction (inverse STFT needs both magnitude and phase)
Time-delay estimation
Source separation algorithms

Griffin-Lim reconstruction

When you only have a magnitude spectrogram (common in generative ML), Griffin-Lim iteratively estimates phase:

def griffin_lim(S_mag, n_iter=32, hop_length=512, n_fft=2048):
    """Reconstruct audio from magnitude spectrogram."""
    # Initialize with random phase
    angles = np.exp(2j * np.pi * np.random.random(S_mag.shape))
    S = S_mag * angles
    
    for _ in range(n_iter):
        audio = librosa.istft(S, hop_length=hop_length)
        S_new = librosa.stft(audio, n_fft=n_fft, hop_length=hop_length)
        angles = np.exp(1j * np.angle(S_new))
        S = S_mag * angles
    
    return librosa.istft(S, hop_length=hop_length)

Librosa provides librosa.griffinlim(). More advanced methods (WaveGlow, HiFi-GAN) use neural vocoders for higher-quality reconstruction.

Instantaneous frequency

The phase derivative across time gives the instantaneous frequency — the “true” frequency at each time-frequency cell, which can be more precise than the bin center:

def instantaneous_frequency(S):
    phase = np.angle(S)
    # Unwrap phase across time axis
    freq = np.diff(np.unwrap(phase, axis=1), axis=1)
    return freq / (2 * np.pi)  # normalize to cycles per frame

Spectrogram reassignment

Standard STFT assigns energy to the center of each time-frequency bin. Reassignment methods redistribute energy to its true center of gravity, producing sharper spectrograms:

freqs, times, S_reassigned = librosa.reassigned_spectrogram(
    y, sr=sr, n_fft=2048, hop_length=512
)

Reassigned spectrograms reveal harmonics, onsets, and fine structure that standard spectrograms blur.

Constant-Q Transform (CQT)

The CQT uses logarithmically spaced frequency bins — constant Q (quality factor, ratio of center frequency to bandwidth):

C = librosa.cqt(y, sr=sr, hop_length=512, n_bins=84, bins_per_octave=12)

Each bin corresponds to a musical semitone. This makes the CQT ideal for pitch tracking, chord recognition, and any task where musical intervals matter. The tradeoff: CQT is slower to compute than STFT and has variable time resolution (better at high frequencies, worse at low).

Spectrogram-based ML pipelines

Audio classification with mel spectrograms

import torch
import torch.nn as nn
import torchaudio

class AudioClassifier(nn.Module):
    def __init__(self, n_classes, sr=22050, n_mels=128):
        super().__init__()
        self.mel = torchaudio.transforms.MelSpectrogram(
            sample_rate=sr, n_mels=n_mels, n_fft=2048, hop_length=512
        )
        self.db = torchaudio.transforms.AmplitudeToDB()
        
        # Treat spectrogram as a single-channel image
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d((4, 4)),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, n_classes)
        )
    
    def forward(self, waveform):
        spec = self.db(self.mel(waveform)).unsqueeze(1)  # add channel dim
        features = self.cnn(spec)
        return self.classifier(features)

SpecAugment (data augmentation)

SpecAugment applies time and frequency masking to spectrograms during training:

spec_augment = nn.Sequential(
    torchaudio.transforms.FrequencyMasking(freq_mask_param=30),
    torchaudio.transforms.TimeMasking(time_mask_param=100),
)

This simple augmentation dramatically improves speech recognition and audio classification accuracy — it was a key contribution from Google’s 2019 paper.

Live spectrogram visualization

import sounddevice as sd
import matplotlib.pyplot as plt
import numpy as np

BLOCK = 2048
SR = 22050

fig, ax = plt.subplots()
spec_data = np.zeros((BLOCK // 2 + 1, 100))
img = ax.imshow(spec_data, aspect='auto', origin='lower', vmin=-80, vmax=0)

def update_callback(indata, frames, time, status):
    spectrum = np.fft.rfft(indata[:, 0] * np.hanning(BLOCK))
    mag_db = 20 * np.log10(np.abs(spectrum) + 1e-10)
    spec_data[:, :-1] = spec_data[:, 1:]
    spec_data[:, -1] = mag_db
    img.set_data(spec_data)

with sd.InputStream(samplerate=SR, blocksize=BLOCK, channels=1, callback=update_callback):
    plt.show()

Performance comparison

Method	Compute time (3 min audio)	Memory	Frequency resolution
SciPy `spectrogram`	~50 ms	Low	Linear, fixed
Librosa `stft`	~80 ms	Moderate	Linear, fixed
Librosa `melspectrogram`	~120 ms	Moderate	Mel-scaled
Librosa `cqt`	~300 ms	Higher	Log (musical)
torchaudio `MelSpectrogram` (GPU)	~5 ms	GPU VRAM	Mel-scaled

For ML training loops, compute spectrograms on GPU with torchaudio for 10–50× speedup over CPU libraries.

One thing to remember: Spectrograms are the bridge between raw audio waveforms and visual/ML-friendly representations — mastering the STFT parameters, scaling choices (linear, mel, CQT), and augmentation techniques determines the quality of every downstream audio analysis or classification task.

pythonspectrogramaudiofftstftdspmachine-learning