Python Noise Reduction — Deep Dive

Classical spectral subtraction

The simplest denoising algorithm subtracts the estimated noise spectrum from the signal spectrum:

import numpy as np
import librosa

def spectral_subtract(y, sr, noise_clip, n_fft=2048, hop_length=512, alpha=2.0, beta=0.01):
    """
    Spectral subtraction with oversubtraction factor alpha
    and spectral floor beta.
    """
    # STFT of signal and noise
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
    N = librosa.stft(noise_clip, n_fft=n_fft, hop_length=hop_length)
    
    S_mag = np.abs(S)
    S_phase = np.angle(S)
    N_mag_mean = np.mean(np.abs(N), axis=1, keepdims=True)
    
    # Subtract with oversubtraction and floor
    clean_mag = S_mag - alpha * N_mag_mean
    clean_mag = np.maximum(clean_mag, beta * S_mag)  # spectral floor
    
    # Reconstruct
    clean = clean_mag * np.exp(1j * S_phase)
    return librosa.istft(clean, hop_length=hop_length)

Alpha (oversubtraction factor) accounts for noise estimation errors — typical values are 1.0–4.0. Higher values remove more noise but increase distortion. Beta (spectral floor) prevents the result from going to zero, which would create silent gaps that sound unnatural.

Limitations

Spectral subtraction uses the noisy phase for reconstruction. Since phase estimation is noisy, the reconstructed signal has artifacts, especially at low SNR. More advanced methods address this.

Wiener filtering

Wiener filtering estimates the optimal linear filter that minimizes mean squared error between the clean and estimated signals:

def wiener_filter(y, sr, noise_clip, n_fft=2048, hop_length=512):
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
    N = librosa.stft(noise_clip, n_fft=n_fft, hop_length=hop_length)
    
    S_power = np.abs(S) ** 2
    N_power_mean = np.mean(np.abs(N) ** 2, axis=1, keepdims=True)
    
    # Wiener gain: SNR / (SNR + 1)
    gain = np.maximum(S_power - N_power_mean, 0) / (S_power + 1e-10)
    
    clean = gain * S
    return librosa.istft(clean, hop_length=hop_length)

Wiener filtering produces fewer musical noise artifacts than spectral subtraction because the gain function is smooth. It is the basis for many professional denoising tools.

Iterative Wiener filtering

Run the Wiener filter multiple times, using the previous output to refine the signal power estimate:

def iterative_wiener(y, sr, noise_clip, iterations=3, **kwargs):
    clean = y.copy()
    for _ in range(iterations):
        clean = wiener_filter(clean, sr, noise_clip, **kwargs)
    return clean

Convergence typically occurs within 2–3 iterations.

Adaptive noise estimation

When no clean noise sample is available, estimate the noise floor adaptively:

Minimum statistics

Track the minimum power in each frequency band over a sliding window. The noise floor is approximately the minimum power, since speech and music are intermittent:

def minimum_statistics(S_power, window_frames=50):
    n_freq, n_time = S_power.shape
    noise_est = np.zeros_like(S_power)
    
    for t in range(n_time):
        start = max(0, t - window_frames)
        noise_est[:, t] = np.min(S_power[:, start:t+1], axis=1) * 1.5  # bias correction
    
    return noise_est

MMSE-based estimation

The Minimum Mean Square Error (MMSE) approach models speech presence probability per time-frequency bin and updates the noise estimate only during speech-absent frames. This is used in professional voice processing (hearing aids, telecommunications).

Deep learning denoising

Neural network denoisers learn the mapping from noisy spectrograms to clean spectrograms (or directly from noisy waveforms to clean waveforms).

RNNoise

RNNoise (by Jean-Marc Valin of Opus codec fame) is a lightweight GRU-based model designed for real-time speech denoising at 48 kHz. It runs in ~5% of a single CPU core.

# Using the rnnoise-python wrapper
import rnnoise

denoiser = rnnoise.RNNoise()
# Process 10ms frames (480 samples at 48kHz)
clean_frame = denoiser.process_frame(noisy_frame)

DTLN (Dual-signal Transformation LSTM Network)

DTLN uses two LSTM stages — one in the STFT domain and one in the time domain — achieving excellent quality with low latency:

import tensorflow as tf

model = tf.saved_model.load("dtln_model")
# Process blocks of audio
clean = model.signatures["serving_default"](tf.constant(noisy_block))

Training a custom denoiser

import torch
import torch.nn as nn

class SimpleDenoiser(nn.Module):
    def __init__(self, n_fft=512):
        super().__init__()
        n_freq = n_fft // 2 + 1
        self.lstm = nn.LSTM(n_freq, 256, num_layers=2, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(512, n_freq)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, noisy_mag):
        # noisy_mag: (batch, time, freq)
        lstm_out, _ = self.lstm(noisy_mag)
        mask = self.sigmoid(self.fc(lstm_out))
        return noisy_mag * mask  # ratio mask

# Training data: pairs of (noisy_spectrogram, clean_spectrogram)
# Loss: MSE between predicted and clean magnitude

Training data is generated by mixing clean speech (LibriSpeech, VCTK) with noise (UrbanSound8K, DEMAND dataset) at various SNR levels.

Real-time pipeline

import sounddevice as sd
import numpy as np
from scipy.signal import stft, istft

class RealtimeDenoiser:
    def __init__(self, sr=16000, n_fft=512, hop=256):
        self.sr = sr
        self.n_fft = n_fft
        self.hop = hop
        self.noise_est = None
        self.calibrated = False
    
    def calibrate(self, noise_seconds=2.0):
        """Record ambient noise for calibration."""
        print("Recording background noise...")
        noise = sd.rec(int(noise_seconds * self.sr), samplerate=self.sr,
                       channels=1, dtype='float32')
        sd.wait()
        _, _, Sxx = stft(noise.flatten(), fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
        self.noise_est = np.mean(np.abs(Sxx) ** 2, axis=1, keepdims=True)
        self.calibrated = True
        print("Calibration complete.")
    
    def process_block(self, block):
        if not self.calibrated:
            return block
        _, _, S = stft(block, fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
        S_power = np.abs(S) ** 2
        gain = np.maximum(S_power - self.noise_est, 0) / (S_power + 1e-10)
        clean_S = gain * S
        _, clean = istft(clean_S, fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
        return clean[:len(block)]

Latency budget

ComponentLatency
Input buffer (512 samples @ 16kHz)32 ms
FFT computation< 1 ms
Wiener gain application< 1 ms
IFFT + overlap-add< 1 ms
Output buffer32 ms
Total~66 ms

For lower latency, reduce FFT size (at the cost of frequency resolution) or use time-domain neural models.

Quality metrics

MetricMeasuresTool
SNR improvementHow much noise was reduced (dB)10 * log10(signal_power / noise_power)
PESQPerceptual speech quality (1–4.5)pypesq library
STOIShort-time objective intelligibility (0–1)pystoi library
SI-SDRScale-invariant signal-to-distortion ratiotorchmetrics.audio

Always evaluate with both objective metrics and listening tests — metrics can miss artifacts that ears catch.

Tradeoffs

MethodQualityLatencyComplexityBest for
Spectral subtractionFairLowSimpleQuick prototypes
Wiener filterGoodLowModerateGeneral-purpose
noisereduce libraryGoodOfflineSimple APIBatch processing
RNNoiseVery good (speech)Very lowPre-trainedReal-time voice
DTLNExcellentLowGPU helpfulProduction speech apps
Custom neural modelBestMediumHighDomain-specific noise

One thing to remember: Noise reduction spans a spectrum from simple spectral subtraction (fast, artifact-prone) to deep-learning models (high-quality, compute-intensive) — choose based on your latency requirements, noise type, and whether you need real-time processing or batch cleanup.

pythonaudionoise-reductionsignal-processingdspdeep-learning

See Also

  • Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
  • Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
  • Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
  • Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
  • Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.