Python Speech Recognition — Deep Dive

How the library works internally

When you call recognizer.listen(source), the library reads audio chunks from the microphone in a loop. It computes the RMS energy of each chunk and compares it to recognizer.energy_threshold. Once energy exceeds the threshold, recording begins. Recording stops after recognizer.pause_threshold seconds (default 0.8) of below-threshold energy.

The captured frames are concatenated into an AudioData object containing raw PCM bytes. For recognition, the library converts this to the format required by the chosen engine — typically 16-bit mono WAV at 16 kHz — and sends it via HTTP or processes it locally.

Energy threshold tuning

The default energy threshold (300) works in quiet rooms but fails in noisy environments. adjust_for_ambient_noise sets the threshold dynamically:

with sr.Microphone() as source:
    recognizer.adjust_for_ambient_noise(source, duration=2)
    print(f"Energy threshold: {recognizer.energy_threshold}")

For continuously running applications, enable dynamic energy adjustment:

recognizer.dynamic_energy_threshold = True
recognizer.dynamic_energy_adjustment_damping = 0.15
recognizer.dynamic_energy_ratio = 1.5

This makes the threshold adapt over time, tracking slowly changing noise floors like HVAC systems or traffic.

Background listening

For non-blocking capture, use listen_in_background:

def callback(recognizer, audio):
    try:
        text = recognizer.recognize_google(audio)
        print(f"Heard: {text}")
    except sr.UnknownValueError:
        pass

stop_listening = recognizer.listen_in_background(
    sr.Microphone(), callback, phrase_time_limit=10
)

# ... do other work ...
# stop_listening(wait_for_stop=True)  # when done

This spawns a daemon thread that continuously captures audio and fires the callback for each detected phrase. The stop_listening callable terminates the thread cleanly.

Processing long audio files

The record() method loads the entire file into memory, which is impractical for hour-long recordings. Process in chunks instead:

with sr.AudioFile("long_lecture.wav") as source:
    duration = source.DURATION
    chunk_duration = 30  # seconds
    
    transcript_parts = []
    offset = 0
    while offset < duration:
        audio = recognizer.record(source, duration=chunk_duration)
        try:
            text = recognizer.recognize_google(audio)
            transcript_parts.append(text)
        except sr.UnknownValueError:
            transcript_parts.append("[inaudible]")
        offset += chunk_duration
    
    full_transcript = " ".join(transcript_parts)

The trade-off: shorter chunks are more reliable (engines have time limits) but lose context at boundaries, potentially splitting words. Overlap chunks by a few seconds and deduplicate to mitigate this.

Streaming recognition with Google Cloud

For real-time transcription, Google Cloud Speech-to-Text supports streaming via gRPC. While speech_recognition does not expose streaming directly, you can use the google-cloud-speech library alongside it:

from google.cloud import speech

client = speech.SpeechClient()
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    enable_automatic_punctuation=True,
)
streaming_config = speech.StreamingRecognitionConfig(
    config=config, interim_results=True
)

def audio_generator():
    with sr.Microphone(sample_rate=16000) as source:
        while True:
            buffer = source.stream.read(4096)
            yield speech.StreamingRecognizeRequest(audio_content=buffer)

responses = client.streaming_recognize(streaming_config, audio_generator())
for response in responses:
    for result in response.results:
        if result.is_final:
            print(result.alternatives[0].transcript)

This provides sub-second latency, interim results for live display, and punctuation prediction.

Whisper integration

OpenAI’s Whisper runs locally and excels at accuracy across languages and accents:

audio = recognizer.record(source)
text = recognizer.recognize_whisper(audio, model="base", language="english")

Model sizes trade accuracy for speed:

ModelParametersEnglish WERSpeed (RTF)
tiny39M~7.6%0.03x
base74M~5.0%0.07x
small244M~3.4%0.2x
medium769M~2.9%0.5x
large1550M~2.5%1.0x

For GPU acceleration, install openai-whisper with CUDA support. The large model on a modern GPU transcribes in near real-time.

Pre-processing audio for better accuracy

Recognition accuracy depends heavily on audio quality. Pre-process with Pydub or NumPy:

from pydub import AudioSegment
from pydub.effects import normalize, high_pass_filter

audio = AudioSegment.from_file("recording.mp3")
audio = high_pass_filter(audio, cutoff=200)   # remove low rumble
audio = normalize(audio)                       # consistent volume
audio = audio.set_channels(1)                  # mono
audio = audio.set_frame_rate(16000)            # standard ASR rate
audio.export("clean.wav", format="wav")

Additional techniques:

  • Noise reduction — use noisereduce library with a noise profile sample
  • Voice activity detection — trim non-speech segments with webrtcvad before sending to the engine
  • Normalization — consistent RMS levels improve recognition across different recording conditions

Error handling patterns

Production systems need robust error handling beyond basic try/except:

import time
from functools import wraps

def retry_recognition(max_retries=3, backoff=1.0):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except sr.RequestError:
                    if attempt < max_retries - 1:
                        time.sleep(backoff * (2 ** attempt))
                    else:
                        raise
            return None
        return wrapper
    return decorator

@retry_recognition(max_retries=3)
def transcribe(recognizer, audio):
    return recognizer.recognize_google(audio)

For UnknownValueError, implement fallback chains — try Google first, fall back to Whisper, then to Sphinx:

def transcribe_with_fallback(recognizer, audio):
    engines = [
        ("google", lambda: recognizer.recognize_google(audio)),
        ("whisper", lambda: recognizer.recognize_whisper(audio, model="base")),
        ("sphinx", lambda: recognizer.recognize_sphinx(audio)),
    ]
    for name, recognize in engines:
        try:
            return recognize()
        except (sr.UnknownValueError, sr.RequestError):
            continue
    return "[transcription failed]"

Building a voice command system

Combine continuous listening with command parsing:

import re

COMMANDS = {
    r"play (.+)": lambda m: play_song(m.group(1)),
    r"set timer (\d+) minutes": lambda m: set_timer(int(m.group(1))),
    r"what('s| is) the (time|weather)": lambda m: get_info(m.group(2)),
}

def process_command(text):
    text = text.lower().strip()
    for pattern, handler in COMMANDS.items():
        match = re.match(pattern, text)
        if match:
            handler(match)
            return True
    return False

def callback(recognizer, audio):
    try:
        text = recognizer.recognize_google(audio)
        if not process_command(text):
            print(f"Unknown command: {text}")
    except sr.UnknownValueError:
        pass

stop = recognizer.listen_in_background(sr.Microphone(), callback)

Performance and resource considerations

  • Network latency — cloud engines add 200-1000ms round-trip; batch non-urgent transcription
  • Memory — Whisper large loads ~3GB into VRAM; use smaller models on constrained devices
  • CPU — Sphinx runs on CPU but is the least accurate; Whisper on CPU is slow for large models
  • ConcurrencyRecognizer is not thread-safe; create separate instances per thread
  • Rate limits — Google Web Speech has undocumented rate limits; for heavy use, switch to Cloud with a paid quota

Testing speech recognition code

Testing audio code is challenging because microphone input is non-deterministic. Strategies:

  • Record test fixtures as WAV files and use AudioFile in tests
  • Mock the recognition engine to return known strings
  • Use AudioData constructor directly with known PCM bytes
  • Measure word error rate (WER) against reference transcripts for accuracy regression testing

The one thing to remember: Building reliable speech-to-text means layering audio pre-processing, engine selection, error handling with fallbacks, and threshold tuning on top of SpeechRecognition’s simple API — the library handles plumbing, but production quality comes from the pipeline around it.

pythonspeech-recognitionaudionlpasr

See Also

  • Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
  • Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
  • Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
  • Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
  • Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.