Python Noise Reduction — Deep Dive
Classical spectral subtraction
The simplest denoising algorithm subtracts the estimated noise spectrum from the signal spectrum:
import numpy as np
import librosa
def spectral_subtract(y, sr, noise_clip, n_fft=2048, hop_length=512, alpha=2.0, beta=0.01):
"""
Spectral subtraction with oversubtraction factor alpha
and spectral floor beta.
"""
# STFT of signal and noise
S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
N = librosa.stft(noise_clip, n_fft=n_fft, hop_length=hop_length)
S_mag = np.abs(S)
S_phase = np.angle(S)
N_mag_mean = np.mean(np.abs(N), axis=1, keepdims=True)
# Subtract with oversubtraction and floor
clean_mag = S_mag - alpha * N_mag_mean
clean_mag = np.maximum(clean_mag, beta * S_mag) # spectral floor
# Reconstruct
clean = clean_mag * np.exp(1j * S_phase)
return librosa.istft(clean, hop_length=hop_length)
Alpha (oversubtraction factor) accounts for noise estimation errors — typical values are 1.0–4.0. Higher values remove more noise but increase distortion. Beta (spectral floor) prevents the result from going to zero, which would create silent gaps that sound unnatural.
Limitations
Spectral subtraction uses the noisy phase for reconstruction. Since phase estimation is noisy, the reconstructed signal has artifacts, especially at low SNR. More advanced methods address this.
Wiener filtering
Wiener filtering estimates the optimal linear filter that minimizes mean squared error between the clean and estimated signals:
def wiener_filter(y, sr, noise_clip, n_fft=2048, hop_length=512):
S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
N = librosa.stft(noise_clip, n_fft=n_fft, hop_length=hop_length)
S_power = np.abs(S) ** 2
N_power_mean = np.mean(np.abs(N) ** 2, axis=1, keepdims=True)
# Wiener gain: SNR / (SNR + 1)
gain = np.maximum(S_power - N_power_mean, 0) / (S_power + 1e-10)
clean = gain * S
return librosa.istft(clean, hop_length=hop_length)
Wiener filtering produces fewer musical noise artifacts than spectral subtraction because the gain function is smooth. It is the basis for many professional denoising tools.
Iterative Wiener filtering
Run the Wiener filter multiple times, using the previous output to refine the signal power estimate:
def iterative_wiener(y, sr, noise_clip, iterations=3, **kwargs):
clean = y.copy()
for _ in range(iterations):
clean = wiener_filter(clean, sr, noise_clip, **kwargs)
return clean
Convergence typically occurs within 2–3 iterations.
Adaptive noise estimation
When no clean noise sample is available, estimate the noise floor adaptively:
Minimum statistics
Track the minimum power in each frequency band over a sliding window. The noise floor is approximately the minimum power, since speech and music are intermittent:
def minimum_statistics(S_power, window_frames=50):
n_freq, n_time = S_power.shape
noise_est = np.zeros_like(S_power)
for t in range(n_time):
start = max(0, t - window_frames)
noise_est[:, t] = np.min(S_power[:, start:t+1], axis=1) * 1.5 # bias correction
return noise_est
MMSE-based estimation
The Minimum Mean Square Error (MMSE) approach models speech presence probability per time-frequency bin and updates the noise estimate only during speech-absent frames. This is used in professional voice processing (hearing aids, telecommunications).
Deep learning denoising
Neural network denoisers learn the mapping from noisy spectrograms to clean spectrograms (or directly from noisy waveforms to clean waveforms).
RNNoise
RNNoise (by Jean-Marc Valin of Opus codec fame) is a lightweight GRU-based model designed for real-time speech denoising at 48 kHz. It runs in ~5% of a single CPU core.
# Using the rnnoise-python wrapper
import rnnoise
denoiser = rnnoise.RNNoise()
# Process 10ms frames (480 samples at 48kHz)
clean_frame = denoiser.process_frame(noisy_frame)
DTLN (Dual-signal Transformation LSTM Network)
DTLN uses two LSTM stages — one in the STFT domain and one in the time domain — achieving excellent quality with low latency:
import tensorflow as tf
model = tf.saved_model.load("dtln_model")
# Process blocks of audio
clean = model.signatures["serving_default"](tf.constant(noisy_block))
Training a custom denoiser
import torch
import torch.nn as nn
class SimpleDenoiser(nn.Module):
def __init__(self, n_fft=512):
super().__init__()
n_freq = n_fft // 2 + 1
self.lstm = nn.LSTM(n_freq, 256, num_layers=2, batch_first=True, bidirectional=True)
self.fc = nn.Linear(512, n_freq)
self.sigmoid = nn.Sigmoid()
def forward(self, noisy_mag):
# noisy_mag: (batch, time, freq)
lstm_out, _ = self.lstm(noisy_mag)
mask = self.sigmoid(self.fc(lstm_out))
return noisy_mag * mask # ratio mask
# Training data: pairs of (noisy_spectrogram, clean_spectrogram)
# Loss: MSE between predicted and clean magnitude
Training data is generated by mixing clean speech (LibriSpeech, VCTK) with noise (UrbanSound8K, DEMAND dataset) at various SNR levels.
Real-time pipeline
import sounddevice as sd
import numpy as np
from scipy.signal import stft, istft
class RealtimeDenoiser:
def __init__(self, sr=16000, n_fft=512, hop=256):
self.sr = sr
self.n_fft = n_fft
self.hop = hop
self.noise_est = None
self.calibrated = False
def calibrate(self, noise_seconds=2.0):
"""Record ambient noise for calibration."""
print("Recording background noise...")
noise = sd.rec(int(noise_seconds * self.sr), samplerate=self.sr,
channels=1, dtype='float32')
sd.wait()
_, _, Sxx = stft(noise.flatten(), fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
self.noise_est = np.mean(np.abs(Sxx) ** 2, axis=1, keepdims=True)
self.calibrated = True
print("Calibration complete.")
def process_block(self, block):
if not self.calibrated:
return block
_, _, S = stft(block, fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
S_power = np.abs(S) ** 2
gain = np.maximum(S_power - self.noise_est, 0) / (S_power + 1e-10)
clean_S = gain * S
_, clean = istft(clean_S, fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
return clean[:len(block)]
Latency budget
| Component | Latency |
|---|---|
| Input buffer (512 samples @ 16kHz) | 32 ms |
| FFT computation | < 1 ms |
| Wiener gain application | < 1 ms |
| IFFT + overlap-add | < 1 ms |
| Output buffer | 32 ms |
| Total | ~66 ms |
For lower latency, reduce FFT size (at the cost of frequency resolution) or use time-domain neural models.
Quality metrics
| Metric | Measures | Tool |
|---|---|---|
| SNR improvement | How much noise was reduced (dB) | 10 * log10(signal_power / noise_power) |
| PESQ | Perceptual speech quality (1–4.5) | pypesq library |
| STOI | Short-time objective intelligibility (0–1) | pystoi library |
| SI-SDR | Scale-invariant signal-to-distortion ratio | torchmetrics.audio |
Always evaluate with both objective metrics and listening tests — metrics can miss artifacts that ears catch.
Tradeoffs
| Method | Quality | Latency | Complexity | Best for |
|---|---|---|---|---|
| Spectral subtraction | Fair | Low | Simple | Quick prototypes |
| Wiener filter | Good | Low | Moderate | General-purpose |
| noisereduce library | Good | Offline | Simple API | Batch processing |
| RNNoise | Very good (speech) | Very low | Pre-trained | Real-time voice |
| DTLN | Excellent | Low | GPU helpful | Production speech apps |
| Custom neural model | Best | Medium | High | Domain-specific noise |
One thing to remember: Noise reduction spans a spectrum from simple spectral subtraction (fast, artifact-prone) to deep-learning models (high-quality, compute-intensive) — choose based on your latency requirements, noise type, and whether you need real-time processing or batch cleanup.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.