Python Spectrogram Analysis — Core Concepts
What a spectrogram is
A spectrogram is a 2-D representation of sound: time on the x-axis, frequency on the y-axis, and intensity (color) on the z-axis. It is the standard visualization for audio analysis, used in speech recognition, music information retrieval, bioacoustics, sonar, and medical diagnostics.
How it is computed: the STFT
The Short-Time Fourier Transform (STFT) slices an audio signal into overlapping windows and applies the Discrete Fourier Transform (DFT) to each window:
- Slide a window (e.g., 2048 samples wide) across the signal
- Multiply each window by a tapering function (Hann, Hamming) to reduce edge artifacts
- Compute the FFT of each windowed segment
- Collect the magnitude (or power) of each FFT into columns of a matrix
The result is a complex matrix with shape (frequency_bins, time_frames). Taking the absolute value gives the magnitude spectrogram.
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
y, sr = librosa.load("audio.wav", sr=22050)
S = np.abs(librosa.stft(y, n_fft=2048, hop_length=512))
fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
sr=sr, hop_length=512, x_axis='time', y_axis='hz', ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
plt.title("Spectrogram")
plt.tight_layout()
plt.savefig("spectrogram.png")
Key parameters
| Parameter | Controls | Tradeoff |
|---|---|---|
| n_fft (window size) | Frequency resolution | Larger = better freq resolution, worse time resolution |
| hop_length (step size) | Time resolution | Smaller = more frames, smoother time axis, more computation |
| window function | Spectral leakage | Hann balances main-lobe width and side-lobe suppression |
With n_fft=2048 at sr=22050, each frequency bin spans ~10.7 Hz, and with hop_length=512, each time frame covers ~23 ms.
Types of spectrograms
Linear spectrogram
Frequency axis is linear (Hz). Good for seeing the full spectrum but wastes visual space on high frequencies where human hearing is less sensitive.
Mel spectrogram
Maps frequencies to the mel scale, which approximates human pitch perception. Low frequencies get more resolution; high frequencies are compressed.
S_mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
S_mel_db = librosa.power_to_db(S_mel, ref=np.max)
This is the most common input for audio ML models (speech recognition, music classification).
Log-frequency spectrogram
Uses a logarithmic frequency axis (like musical octaves). Each octave occupies the same visual height, making harmonic relationships visible. The Constant-Q Transform (CQT) computes this directly.
C = np.abs(librosa.cqt(y, sr=sr))
librosa.display.specshow(librosa.amplitude_to_db(C), sr=sr, x_axis='time', y_axis='cqt_hz')
Decibel scaling
Raw magnitude spectrograms have huge dynamic range — quiet sounds are invisible next to loud ones. Converting to decibels with librosa.amplitude_to_db() or librosa.power_to_db() compresses the range to something visually useful.
Rule of thumb: if your spectrogram looks mostly black with a few bright spots, you probably need dB scaling.
Reading a spectrogram
- Horizontal bright bands: Sustained tonal sounds (vowels, instrument notes, hums)
- Vertical bright streaks: Transients (drum hits, clicks, consonants like “t” and “k”)
- Rising/falling bright lines: Pitch glides (sirens, bird calls, portamento)
- Harmonic series: Evenly spaced horizontal lines above a fundamental — the signature of a musical note
- Broadband noise: Diffuse brightness across many frequencies (wind, white noise, “s” sounds)
Common misconception
A spectrogram is not a waveform. A waveform shows amplitude over time (1-D). A spectrogram shows frequency content over time (2-D). They contain different information — you cannot perfectly reconstruct one from the other without phase information (though algorithms like Griffin-Lim can approximate).
How it fits with other tools
Generate spectrograms with SciPy (scipy.signal.spectrogram), Librosa, or torchaudio. Visualize with Matplotlib. Feed mel spectrograms to PyTorch/TensorFlow for classification. Combine with sounddevice for live updating spectrograms.
One thing to remember: A spectrogram transforms a 1-D audio signal into a 2-D time-frequency image using the STFT — and the choice between linear, mel, and log-frequency scaling depends on whether your goal is general analysis, ML, or musical pitch work.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.