Python Speech Recognition — Deep Dive
How the library works internally
When you call recognizer.listen(source), the library reads audio chunks from the microphone in a loop. It computes the RMS energy of each chunk and compares it to recognizer.energy_threshold. Once energy exceeds the threshold, recording begins. Recording stops after recognizer.pause_threshold seconds (default 0.8) of below-threshold energy.
The captured frames are concatenated into an AudioData object containing raw PCM bytes. For recognition, the library converts this to the format required by the chosen engine — typically 16-bit mono WAV at 16 kHz — and sends it via HTTP or processes it locally.
Energy threshold tuning
The default energy threshold (300) works in quiet rooms but fails in noisy environments. adjust_for_ambient_noise sets the threshold dynamically:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source, duration=2)
print(f"Energy threshold: {recognizer.energy_threshold}")
For continuously running applications, enable dynamic energy adjustment:
recognizer.dynamic_energy_threshold = True
recognizer.dynamic_energy_adjustment_damping = 0.15
recognizer.dynamic_energy_ratio = 1.5
This makes the threshold adapt over time, tracking slowly changing noise floors like HVAC systems or traffic.
Background listening
For non-blocking capture, use listen_in_background:
def callback(recognizer, audio):
try:
text = recognizer.recognize_google(audio)
print(f"Heard: {text}")
except sr.UnknownValueError:
pass
stop_listening = recognizer.listen_in_background(
sr.Microphone(), callback, phrase_time_limit=10
)
# ... do other work ...
# stop_listening(wait_for_stop=True) # when done
This spawns a daemon thread that continuously captures audio and fires the callback for each detected phrase. The stop_listening callable terminates the thread cleanly.
Processing long audio files
The record() method loads the entire file into memory, which is impractical for hour-long recordings. Process in chunks instead:
with sr.AudioFile("long_lecture.wav") as source:
duration = source.DURATION
chunk_duration = 30 # seconds
transcript_parts = []
offset = 0
while offset < duration:
audio = recognizer.record(source, duration=chunk_duration)
try:
text = recognizer.recognize_google(audio)
transcript_parts.append(text)
except sr.UnknownValueError:
transcript_parts.append("[inaudible]")
offset += chunk_duration
full_transcript = " ".join(transcript_parts)
The trade-off: shorter chunks are more reliable (engines have time limits) but lose context at boundaries, potentially splitting words. Overlap chunks by a few seconds and deduplicate to mitigate this.
Streaming recognition with Google Cloud
For real-time transcription, Google Cloud Speech-to-Text supports streaming via gRPC. While speech_recognition does not expose streaming directly, you can use the google-cloud-speech library alongside it:
from google.cloud import speech
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
enable_automatic_punctuation=True,
)
streaming_config = speech.StreamingRecognitionConfig(
config=config, interim_results=True
)
def audio_generator():
with sr.Microphone(sample_rate=16000) as source:
while True:
buffer = source.stream.read(4096)
yield speech.StreamingRecognizeRequest(audio_content=buffer)
responses = client.streaming_recognize(streaming_config, audio_generator())
for response in responses:
for result in response.results:
if result.is_final:
print(result.alternatives[0].transcript)
This provides sub-second latency, interim results for live display, and punctuation prediction.
Whisper integration
OpenAI’s Whisper runs locally and excels at accuracy across languages and accents:
audio = recognizer.record(source)
text = recognizer.recognize_whisper(audio, model="base", language="english")
Model sizes trade accuracy for speed:
| Model | Parameters | English WER | Speed (RTF) |
|---|---|---|---|
| tiny | 39M | ~7.6% | 0.03x |
| base | 74M | ~5.0% | 0.07x |
| small | 244M | ~3.4% | 0.2x |
| medium | 769M | ~2.9% | 0.5x |
| large | 1550M | ~2.5% | 1.0x |
For GPU acceleration, install openai-whisper with CUDA support. The large model on a modern GPU transcribes in near real-time.
Pre-processing audio for better accuracy
Recognition accuracy depends heavily on audio quality. Pre-process with Pydub or NumPy:
from pydub import AudioSegment
from pydub.effects import normalize, high_pass_filter
audio = AudioSegment.from_file("recording.mp3")
audio = high_pass_filter(audio, cutoff=200) # remove low rumble
audio = normalize(audio) # consistent volume
audio = audio.set_channels(1) # mono
audio = audio.set_frame_rate(16000) # standard ASR rate
audio.export("clean.wav", format="wav")
Additional techniques:
- Noise reduction — use
noisereducelibrary with a noise profile sample - Voice activity detection — trim non-speech segments with
webrtcvadbefore sending to the engine - Normalization — consistent RMS levels improve recognition across different recording conditions
Error handling patterns
Production systems need robust error handling beyond basic try/except:
import time
from functools import wraps
def retry_recognition(max_retries=3, backoff=1.0):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except sr.RequestError:
if attempt < max_retries - 1:
time.sleep(backoff * (2 ** attempt))
else:
raise
return None
return wrapper
return decorator
@retry_recognition(max_retries=3)
def transcribe(recognizer, audio):
return recognizer.recognize_google(audio)
For UnknownValueError, implement fallback chains — try Google first, fall back to Whisper, then to Sphinx:
def transcribe_with_fallback(recognizer, audio):
engines = [
("google", lambda: recognizer.recognize_google(audio)),
("whisper", lambda: recognizer.recognize_whisper(audio, model="base")),
("sphinx", lambda: recognizer.recognize_sphinx(audio)),
]
for name, recognize in engines:
try:
return recognize()
except (sr.UnknownValueError, sr.RequestError):
continue
return "[transcription failed]"
Building a voice command system
Combine continuous listening with command parsing:
import re
COMMANDS = {
r"play (.+)": lambda m: play_song(m.group(1)),
r"set timer (\d+) minutes": lambda m: set_timer(int(m.group(1))),
r"what('s| is) the (time|weather)": lambda m: get_info(m.group(2)),
}
def process_command(text):
text = text.lower().strip()
for pattern, handler in COMMANDS.items():
match = re.match(pattern, text)
if match:
handler(match)
return True
return False
def callback(recognizer, audio):
try:
text = recognizer.recognize_google(audio)
if not process_command(text):
print(f"Unknown command: {text}")
except sr.UnknownValueError:
pass
stop = recognizer.listen_in_background(sr.Microphone(), callback)
Performance and resource considerations
- Network latency — cloud engines add 200-1000ms round-trip; batch non-urgent transcription
- Memory — Whisper large loads ~3GB into VRAM; use smaller models on constrained devices
- CPU — Sphinx runs on CPU but is the least accurate; Whisper on CPU is slow for large models
- Concurrency —
Recognizeris not thread-safe; create separate instances per thread - Rate limits — Google Web Speech has undocumented rate limits; for heavy use, switch to Cloud with a paid quota
Testing speech recognition code
Testing audio code is challenging because microphone input is non-deterministic. Strategies:
- Record test fixtures as WAV files and use
AudioFilein tests - Mock the recognition engine to return known strings
- Use
AudioDataconstructor directly with known PCM bytes - Measure word error rate (WER) against reference transcripts for accuracy regression testing
The one thing to remember: Building reliable speech-to-text means layering audio pre-processing, engine selection, error handling with fallbacks, and threshold tuning on top of SpeechRecognition’s simple API — the library handles plumbing, but production quality comes from the pipeline around it.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.