Python Noise Reduction — Deep Dive

Implement spectral subtraction from scratch, explore Wiener filtering, deep-learning denoising with RNNoise and DTLN, adaptive methods, and real-time noise reduction pipelines in Python.

Classical spectral subtraction

The simplest denoising algorithm subtracts the estimated noise spectrum from the signal spectrum:

import numpy as np
import librosa

def spectral_subtract(y, sr, noise_clip, n_fft=2048, hop_length=512, alpha=2.0, beta=0.01):
    """
    Spectral subtraction with oversubtraction factor alpha
    and spectral floor beta.
    """
    # STFT of signal and noise
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
    N = librosa.stft(noise_clip, n_fft=n_fft, hop_length=hop_length)
    
    S_mag = np.abs(S)
    S_phase = np.angle(S)
    N_mag_mean = np.mean(np.abs(N), axis=1, keepdims=True)
    
    # Subtract with oversubtraction and floor
    clean_mag = S_mag - alpha * N_mag_mean
    clean_mag = np.maximum(clean_mag, beta * S_mag)  # spectral floor
    
    # Reconstruct
    clean = clean_mag * np.exp(1j * S_phase)
    return librosa.istft(clean, hop_length=hop_length)

Alpha (oversubtraction factor) accounts for noise estimation errors — typical values are 1.0–4.0. Higher values remove more noise but increase distortion. Beta (spectral floor) prevents the result from going to zero, which would create silent gaps that sound unnatural.

Limitations

Spectral subtraction uses the noisy phase for reconstruction. Since phase estimation is noisy, the reconstructed signal has artifacts, especially at low SNR. More advanced methods address this.

Wiener filtering

Wiener filtering estimates the optimal linear filter that minimizes mean squared error between the clean and estimated signals:

def wiener_filter(y, sr, noise_clip, n_fft=2048, hop_length=512):
    S = librosa.stft(y, n_fft=n_fft, hop_length=hop_length)
    N = librosa.stft(noise_clip, n_fft=n_fft, hop_length=hop_length)
    
    S_power = np.abs(S) ** 2
    N_power_mean = np.mean(np.abs(N) ** 2, axis=1, keepdims=True)
    
    # Wiener gain: SNR / (SNR + 1)
    gain = np.maximum(S_power - N_power_mean, 0) / (S_power + 1e-10)
    
    clean = gain * S
    return librosa.istft(clean, hop_length=hop_length)

Wiener filtering produces fewer musical noise artifacts than spectral subtraction because the gain function is smooth. It is the basis for many professional denoising tools.

Iterative Wiener filtering

Run the Wiener filter multiple times, using the previous output to refine the signal power estimate:

def iterative_wiener(y, sr, noise_clip, iterations=3, **kwargs):
    clean = y.copy()
    for _ in range(iterations):
        clean = wiener_filter(clean, sr, noise_clip, **kwargs)
    return clean

Convergence typically occurs within 2–3 iterations.

Adaptive noise estimation

When no clean noise sample is available, estimate the noise floor adaptively:

Minimum statistics

Track the minimum power in each frequency band over a sliding window. The noise floor is approximately the minimum power, since speech and music are intermittent:

def minimum_statistics(S_power, window_frames=50):
    n_freq, n_time = S_power.shape
    noise_est = np.zeros_like(S_power)
    
    for t in range(n_time):
        start = max(0, t - window_frames)
        noise_est[:, t] = np.min(S_power[:, start:t+1], axis=1) * 1.5  # bias correction
    
    return noise_est

MMSE-based estimation

The Minimum Mean Square Error (MMSE) approach models speech presence probability per time-frequency bin and updates the noise estimate only during speech-absent frames. This is used in professional voice processing (hearing aids, telecommunications).

Deep learning denoising

Neural network denoisers learn the mapping from noisy spectrograms to clean spectrograms (or directly from noisy waveforms to clean waveforms).

RNNoise

RNNoise (by Jean-Marc Valin of Opus codec fame) is a lightweight GRU-based model designed for real-time speech denoising at 48 kHz. It runs in ~5% of a single CPU core.

# Using the rnnoise-python wrapper
import rnnoise

denoiser = rnnoise.RNNoise()
# Process 10ms frames (480 samples at 48kHz)
clean_frame = denoiser.process_frame(noisy_frame)

DTLN (Dual-signal Transformation LSTM Network)

DTLN uses two LSTM stages — one in the STFT domain and one in the time domain — achieving excellent quality with low latency:

import tensorflow as tf

model = tf.saved_model.load("dtln_model")
# Process blocks of audio
clean = model.signatures["serving_default"](tf.constant(noisy_block))

Training a custom denoiser

import torch
import torch.nn as nn

class SimpleDenoiser(nn.Module):
    def __init__(self, n_fft=512):
        super().__init__()
        n_freq = n_fft // 2 + 1
        self.lstm = nn.LSTM(n_freq, 256, num_layers=2, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(512, n_freq)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, noisy_mag):
        # noisy_mag: (batch, time, freq)
        lstm_out, _ = self.lstm(noisy_mag)
        mask = self.sigmoid(self.fc(lstm_out))
        return noisy_mag * mask  # ratio mask

# Training data: pairs of (noisy_spectrogram, clean_spectrogram)
# Loss: MSE between predicted and clean magnitude

Training data is generated by mixing clean speech (LibriSpeech, VCTK) with noise (UrbanSound8K, DEMAND dataset) at various SNR levels.

Real-time pipeline

import sounddevice as sd
import numpy as np
from scipy.signal import stft, istft

class RealtimeDenoiser:
    def __init__(self, sr=16000, n_fft=512, hop=256):
        self.sr = sr
        self.n_fft = n_fft
        self.hop = hop
        self.noise_est = None
        self.calibrated = False
    
    def calibrate(self, noise_seconds=2.0):
        """Record ambient noise for calibration."""
        print("Recording background noise...")
        noise = sd.rec(int(noise_seconds * self.sr), samplerate=self.sr,
                       channels=1, dtype='float32')
        sd.wait()
        _, _, Sxx = stft(noise.flatten(), fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
        self.noise_est = np.mean(np.abs(Sxx) ** 2, axis=1, keepdims=True)
        self.calibrated = True
        print("Calibration complete.")
    
    def process_block(self, block):
        if not self.calibrated:
            return block
        _, _, S = stft(block, fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
        S_power = np.abs(S) ** 2
        gain = np.maximum(S_power - self.noise_est, 0) / (S_power + 1e-10)
        clean_S = gain * S
        _, clean = istft(clean_S, fs=self.sr, nperseg=self.n_fft, noverlap=self.hop)
        return clean[:len(block)]

Latency budget

Component	Latency
Input buffer (512 samples @ 16kHz)	32 ms
FFT computation	< 1 ms
Wiener gain application	< 1 ms
IFFT + overlap-add	< 1 ms
Output buffer	32 ms
Total	~66 ms

For lower latency, reduce FFT size (at the cost of frequency resolution) or use time-domain neural models.

Quality metrics

Metric	Measures	Tool
SNR improvement	How much noise was reduced (dB)	`10 * log10(signal_power / noise_power)`
PESQ	Perceptual speech quality (1–4.5)	`pypesq` library
STOI	Short-time objective intelligibility (0–1)	`pystoi` library
SI-SDR	Scale-invariant signal-to-distortion ratio	`torchmetrics.audio`

Always evaluate with both objective metrics and listening tests — metrics can miss artifacts that ears catch.

Tradeoffs

Method	Quality	Latency	Complexity	Best for
Spectral subtraction	Fair	Low	Simple	Quick prototypes
Wiener filter	Good	Low	Moderate	General-purpose
noisereduce library	Good	Offline	Simple API	Batch processing
RNNoise	Very good (speech)	Very low	Pre-trained	Real-time voice
DTLN	Excellent	Low	GPU helpful	Production speech apps
Custom neural model	Best	Medium	High	Domain-specific noise

One thing to remember: Noise reduction spans a spectrum from simple spectral subtraction (fast, artifact-prone) to deep-learning models (high-quality, compute-intensive) — choose based on your latency requirements, noise type, and whether you need real-time processing or batch cleanup.

pythonaudionoise-reductionsignal-processingdspdeep-learning