Python Pydub Audio Processing — Core Concepts

Understand Pydub's AudioSegment model, slicing operations, volume control, format conversion, and effect chaining for practical audio manipulation in Python.

What Pydub does

Pydub provides a high-level interface for audio file manipulation in Python. It handles loading, slicing, concatenating, volume adjustment, format conversion, and basic effects. Under the hood it relies on FFmpeg (or libav) for codec support and stores audio data as raw PCM samples in memory.

Install with pip install pydub. Ensure FFmpeg is on your system path.

AudioSegment — the core object

Everything revolves around AudioSegment, which holds the raw audio data plus metadata (sample rate, bit depth, channels):

from pydub import AudioSegment

audio = AudioSegment.from_file("podcast.mp3")
print(len(audio))          # duration in milliseconds
print(audio.frame_rate)    # e.g. 44100
print(audio.channels)      # 1 (mono) or 2 (stereo)
print(audio.sample_width)  # bytes per sample (2 = 16-bit)

AudioSegments are immutable — every operation returns a new object, leaving the original unchanged.

Slicing and concatenation

Slicing uses millisecond indices, just like Python string slicing:

first_ten_seconds = audio[:10000]
last_five_seconds = audio[-5000:]
middle = audio[10000:20000]

Concatenate with the + operator:

combined = first_ten_seconds + last_five_seconds

Insert silence with AudioSegment.silent(duration=2000) for a two-second gap.

Volume control

Adjust volume in decibels:

louder = audio + 6    # roughly double perceived loudness
quieter = audio - 10  # significantly quieter

Fade in and out:

faded = audio.fade_in(2000).fade_out(3000)

Normalize to a target peak level:

from pydub.effects import normalize
normalized = normalize(audio)

Overlaying audio

The overlay method mixes two audio segments together, playing them simultaneously:

music = AudioSegment.from_file("background.mp3")
voice = AudioSegment.from_file("narration.wav")

# lower the music volume, then overlay
music_quiet = music - 12
mixed = music_quiet.overlay(voice, position=0)

The position parameter (in ms) controls when the overlay starts. The result length matches the longer segment by default; set loop=True to repeat the shorter one.

Format conversion and export

Export to any FFmpeg-supported format:

audio.export("output.wav", format="wav")
audio.export("output.ogg", format="ogg", codec="libvorbis")
audio.export("output.mp3", format="mp3", bitrate="192k")

Convert between sample rates and channel counts:

mono = audio.set_channels(1)
resampled = audio.set_frame_rate(22050)

Splitting and silence detection

Pydub includes utilities for splitting audio on silence:

from pydub.silence import split_on_silence

chunks = split_on_silence(audio,
    min_silence_len=700,    # ms of silence to trigger split
    silence_thresh=-40,     # dBFS threshold
    keep_silence=200        # ms of silence to keep at edges
)

This is useful for chopping recordings into sentences or removing dead air from podcasts.

Raw sample access

Access raw bytes with audio.raw_data or convert to a NumPy array for DSP work:

import numpy as np
samples = np.array(audio.get_array_of_samples())

After processing, wrap samples back into an AudioSegment:

processed = AudioSegment(
    samples.tobytes(),
    frame_rate=audio.frame_rate,
    sample_width=audio.sample_width,
    channels=audio.channels
)

Practical considerations

Pydub loads the entire audio file into memory as raw PCM. A 5-minute stereo WAV at 44.1 kHz occupies roughly 50 MB. For very long files, consider processing in chunks. All operations are CPU-bound since they manipulate arrays in Python — for performance-critical DSP, extract samples to NumPy and use vectorized operations.

The one thing to remember: Pydub wraps audio manipulation in Pythonic operators — + to join, [] to slice, +/- for volume — making common audio tasks feel as natural as working with strings.

pythonpydubaudioprocessing