Python Spectrogram Analysis — Deep Dive
The math behind the STFT
The STFT of a signal x[n] with window w[n] is:
X[m, k] = Σ_{n=0}^{N-1} x[n + m·H] · w[n] · e^{-j2πkn/N}
Where m is the frame index, k is the frequency bin, N is the FFT size, and H is the hop length. The result is complex: magnitude |X[m,k]| gives energy, and angle(X[m,k]) gives phase.
import numpy as np
def manual_stft(signal, n_fft=2048, hop_length=512, window='hann'):
win = np.hanning(n_fft) if window == 'hann' else np.ones(n_fft)
n_frames = 1 + (len(signal) - n_fft) // hop_length
stft_matrix = np.zeros((n_fft // 2 + 1, n_frames), dtype=complex)
for m in range(n_frames):
start = m * hop_length
frame = signal[start:start + n_fft] * win
stft_matrix[:, m] = np.fft.rfft(frame)
return stft_matrix
Window function theory
The choice of window function controls the spectral leakage tradeoff:
Main lobe width vs side lobe level
| Window | Main lobe width (bins) | Side lobe level (dB) | Use case |
|---|---|---|---|
| Rectangular | 1 | -13 | Maximum frequency resolution, most leakage |
| Hann | 2 | -31 | General purpose (default in most libraries) |
| Hamming | 2 | -42 | Speech processing |
| Blackman | 3 | -58 | When side lobe suppression is critical |
| Kaiser (β=8) | ~2.5 | -50 | Tunable tradeoff via β parameter |
| Flat-top | 5 | -93 | Accurate amplitude measurement |
Zero-padding
Adding zeros to the window before FFT interpolates the frequency axis without changing resolution. If you have a 1024-sample window, zero-pad to 2048 and get 1025 frequency bins instead of 513 — smoother-looking spectrograms with the same underlying resolution.
frame = np.zeros(2048)
frame[:1024] = signal_chunk * np.hanning(1024)
spectrum = np.fft.rfft(frame)
Mel filterbank construction
The mel scale approximates human pitch perception:
mel(f) = 2595 · log₁₀(1 + f/700)
f(mel) = 700 · (10^(mel/2595) - 1)
A mel filterbank is a set of triangular bandpass filters, linearly spaced in mel but exponentially spaced in Hz:
def mel_filterbank(sr, n_fft, n_mels=128, fmin=0, fmax=None):
fmax = fmax or sr / 2
# Mel-spaced center frequencies
mel_min = 2595 * np.log10(1 + fmin / 700)
mel_max = 2595 * np.log10(1 + fmax / 700)
mel_points = np.linspace(mel_min, mel_max, n_mels + 2)
hz_points = 700 * (10 ** (mel_points / 2595) - 1)
# Convert to FFT bin indices
bin_points = np.floor((n_fft + 1) * hz_points / sr).astype(int)
filters = np.zeros((n_mels, n_fft // 2 + 1))
for i in range(n_mels):
left, center, right = bin_points[i], bin_points[i+1], bin_points[i+2]
# Rising slope
filters[i, left:center] = np.linspace(0, 1, center - left, endpoint=False)
# Falling slope
filters[i, center:right] = np.linspace(1, 0, right - center, endpoint=False)
return filters # shape: (n_mels, n_fft//2+1)
# Apply: mel_spectrogram = filterbank @ power_spectrogram
Slaney normalization
Librosa’s default normalization divides each filter by its bandwidth in Hz (slaney norm), ensuring each filter contributes equally regardless of bandwidth. Without this, high-frequency filters (which span wider Hz ranges) would dominate.
Phase spectrograms and reconstruction
Phase information
The magnitude spectrogram discards phase, which encodes timing information. Phase is crucial for:
- Perfect reconstruction (inverse STFT needs both magnitude and phase)
- Time-delay estimation
- Source separation algorithms
Griffin-Lim reconstruction
When you only have a magnitude spectrogram (common in generative ML), Griffin-Lim iteratively estimates phase:
def griffin_lim(S_mag, n_iter=32, hop_length=512, n_fft=2048):
"""Reconstruct audio from magnitude spectrogram."""
# Initialize with random phase
angles = np.exp(2j * np.pi * np.random.random(S_mag.shape))
S = S_mag * angles
for _ in range(n_iter):
audio = librosa.istft(S, hop_length=hop_length)
S_new = librosa.stft(audio, n_fft=n_fft, hop_length=hop_length)
angles = np.exp(1j * np.angle(S_new))
S = S_mag * angles
return librosa.istft(S, hop_length=hop_length)
Librosa provides librosa.griffinlim(). More advanced methods (WaveGlow, HiFi-GAN) use neural vocoders for higher-quality reconstruction.
Instantaneous frequency
The phase derivative across time gives the instantaneous frequency — the “true” frequency at each time-frequency cell, which can be more precise than the bin center:
def instantaneous_frequency(S):
phase = np.angle(S)
# Unwrap phase across time axis
freq = np.diff(np.unwrap(phase, axis=1), axis=1)
return freq / (2 * np.pi) # normalize to cycles per frame
Spectrogram reassignment
Standard STFT assigns energy to the center of each time-frequency bin. Reassignment methods redistribute energy to its true center of gravity, producing sharper spectrograms:
freqs, times, S_reassigned = librosa.reassigned_spectrogram(
y, sr=sr, n_fft=2048, hop_length=512
)
Reassigned spectrograms reveal harmonics, onsets, and fine structure that standard spectrograms blur.
Constant-Q Transform (CQT)
The CQT uses logarithmically spaced frequency bins — constant Q (quality factor, ratio of center frequency to bandwidth):
C = librosa.cqt(y, sr=sr, hop_length=512, n_bins=84, bins_per_octave=12)
Each bin corresponds to a musical semitone. This makes the CQT ideal for pitch tracking, chord recognition, and any task where musical intervals matter. The tradeoff: CQT is slower to compute than STFT and has variable time resolution (better at high frequencies, worse at low).
Spectrogram-based ML pipelines
Audio classification with mel spectrograms
import torch
import torch.nn as nn
import torchaudio
class AudioClassifier(nn.Module):
def __init__(self, n_classes, sr=22050, n_mels=128):
super().__init__()
self.mel = torchaudio.transforms.MelSpectrogram(
sample_rate=sr, n_mels=n_mels, n_fft=2048, hop_length=512
)
self.db = torchaudio.transforms.AmplitudeToDB()
# Treat spectrogram as a single-channel image
self.cnn = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d((4, 4)),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, n_classes)
)
def forward(self, waveform):
spec = self.db(self.mel(waveform)).unsqueeze(1) # add channel dim
features = self.cnn(spec)
return self.classifier(features)
SpecAugment (data augmentation)
SpecAugment applies time and frequency masking to spectrograms during training:
spec_augment = nn.Sequential(
torchaudio.transforms.FrequencyMasking(freq_mask_param=30),
torchaudio.transforms.TimeMasking(time_mask_param=100),
)
This simple augmentation dramatically improves speech recognition and audio classification accuracy — it was a key contribution from Google’s 2019 paper.
Live spectrogram visualization
import sounddevice as sd
import matplotlib.pyplot as plt
import numpy as np
BLOCK = 2048
SR = 22050
fig, ax = plt.subplots()
spec_data = np.zeros((BLOCK // 2 + 1, 100))
img = ax.imshow(spec_data, aspect='auto', origin='lower', vmin=-80, vmax=0)
def update_callback(indata, frames, time, status):
spectrum = np.fft.rfft(indata[:, 0] * np.hanning(BLOCK))
mag_db = 20 * np.log10(np.abs(spectrum) + 1e-10)
spec_data[:, :-1] = spec_data[:, 1:]
spec_data[:, -1] = mag_db
img.set_data(spec_data)
with sd.InputStream(samplerate=SR, blocksize=BLOCK, channels=1, callback=update_callback):
plt.show()
Performance comparison
| Method | Compute time (3 min audio) | Memory | Frequency resolution |
|---|---|---|---|
SciPy spectrogram | ~50 ms | Low | Linear, fixed |
Librosa stft | ~80 ms | Moderate | Linear, fixed |
Librosa melspectrogram | ~120 ms | Moderate | Mel-scaled |
Librosa cqt | ~300 ms | Higher | Log (musical) |
torchaudio MelSpectrogram (GPU) | ~5 ms | GPU VRAM | Mel-scaled |
For ML training loops, compute spectrograms on GPU with torchaudio for 10–50× speedup over CPU libraries.
One thing to remember: Spectrograms are the bridge between raw audio waveforms and visual/ML-friendly representations — mastering the STFT parameters, scaling choices (linear, mel, CQT), and augmentation techniques determines the quality of every downstream audio analysis or classification task.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.