Python Spectrogram Analysis — Core Concepts

Learn how spectrograms are computed from the STFT, the differences between linear, mel, and log spectrograms, and how to create and interpret them with Python.

What a spectrogram is

A spectrogram is a 2-D representation of sound: time on the x-axis, frequency on the y-axis, and intensity (color) on the z-axis. It is the standard visualization for audio analysis, used in speech recognition, music information retrieval, bioacoustics, sonar, and medical diagnostics.

How it is computed: the STFT

The Short-Time Fourier Transform (STFT) slices an audio signal into overlapping windows and applies the Discrete Fourier Transform (DFT) to each window:

Slide a window (e.g., 2048 samples wide) across the signal
Multiply each window by a tapering function (Hann, Hamming) to reduce edge artifacts
Compute the FFT of each windowed segment
Collect the magnitude (or power) of each FFT into columns of a matrix

The result is a complex matrix with shape (frequency_bins, time_frames). Taking the absolute value gives the magnitude spectrogram.

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

y, sr = librosa.load("audio.wav", sr=22050)
S = np.abs(librosa.stft(y, n_fft=2048, hop_length=512))

fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                                sr=sr, hop_length=512, x_axis='time', y_axis='hz', ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
plt.title("Spectrogram")
plt.tight_layout()
plt.savefig("spectrogram.png")

Key parameters

Parameter	Controls	Tradeoff
n_fft (window size)	Frequency resolution	Larger = better freq resolution, worse time resolution
hop_length (step size)	Time resolution	Smaller = more frames, smoother time axis, more computation
window function	Spectral leakage	Hann balances main-lobe width and side-lobe suppression

With n_fft=2048 at sr=22050, each frequency bin spans ~10.7 Hz, and with hop_length=512, each time frame covers ~23 ms.

Types of spectrograms

Linear spectrogram

Frequency axis is linear (Hz). Good for seeing the full spectrum but wastes visual space on high frequencies where human hearing is less sensitive.

Mel spectrogram

Maps frequencies to the mel scale, which approximates human pitch perception. Low frequencies get more resolution; high frequencies are compressed.

S_mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
S_mel_db = librosa.power_to_db(S_mel, ref=np.max)

This is the most common input for audio ML models (speech recognition, music classification).

Log-frequency spectrogram

Uses a logarithmic frequency axis (like musical octaves). Each octave occupies the same visual height, making harmonic relationships visible. The Constant-Q Transform (CQT) computes this directly.

C = np.abs(librosa.cqt(y, sr=sr))
librosa.display.specshow(librosa.amplitude_to_db(C), sr=sr, x_axis='time', y_axis='cqt_hz')

Decibel scaling

Raw magnitude spectrograms have huge dynamic range — quiet sounds are invisible next to loud ones. Converting to decibels with librosa.amplitude_to_db() or librosa.power_to_db() compresses the range to something visually useful.

Rule of thumb: if your spectrogram looks mostly black with a few bright spots, you probably need dB scaling.

Reading a spectrogram

Horizontal bright bands: Sustained tonal sounds (vowels, instrument notes, hums)
Vertical bright streaks: Transients (drum hits, clicks, consonants like “t” and “k”)
Rising/falling bright lines: Pitch glides (sirens, bird calls, portamento)
Harmonic series: Evenly spaced horizontal lines above a fundamental — the signature of a musical note
Broadband noise: Diffuse brightness across many frequencies (wind, white noise, “s” sounds)

Common misconception

A spectrogram is not a waveform. A waveform shows amplitude over time (1-D). A spectrogram shows frequency content over time (2-D). They contain different information — you cannot perfectly reconstruct one from the other without phase information (though algorithms like Griffin-Lim can approximate).

How it fits with other tools

Generate spectrograms with SciPy (scipy.signal.spectrogram), Librosa, or torchaudio. Visualize with Matplotlib. Feed mel spectrograms to PyTorch/TensorFlow for classification. Combine with sounddevice for live updating spectrograms.

One thing to remember: A spectrogram transforms a 1-D audio signal into a 2-D time-frequency image using the STFT — and the choice between linear, mel, and log-frequency scaling depends on whether your goal is general analysis, ML, or musical pitch work.

pythonspectrogramaudiovisualizationfftstft