Python Speech Recognition — Core Concepts

Understand the SpeechRecognition library's Recognizer and AudioData model, supported engines, microphone capture, and error handling for speech-to-text in Python.

What SpeechRecognition does

The speech_recognition library provides a unified Python interface for converting spoken audio into text. It abstracts away the differences between recognition engines — you write the same code whether you use Google, Sphinx, Wit.ai, or Whisper.

Install with pip install SpeechRecognition. For microphone support, also install pyaudio.

The Recognizer object

Recognizer is the central class. It holds configuration (energy thresholds, pause timing) and provides methods for each supported engine:

import speech_recognition as sr

recognizer = sr.Recognizer()

Key engine methods:

recognize_google(audio) — Google Web Speech API (free, no key needed for basic use)
recognize_google_cloud(audio, credentials_json) — Google Cloud Speech-to-Text
recognize_sphinx(audio) — CMU Sphinx (offline, no network required)
recognize_whisper(audio, model) — OpenAI Whisper (local, high accuracy)
recognize_wit(audio, key) — Wit.ai
recognize_azure(audio, key) — Microsoft Azure Speech

All methods accept an AudioData object and return a string of recognized text.

Capturing from a microphone

The Microphone class wraps PyAudio to capture live audio:

with sr.Microphone() as source:
    recognizer.adjust_for_ambient_noise(source, duration=1)
    print("Speak now...")
    audio = recognizer.listen(source)

adjust_for_ambient_noise samples the room for the given duration and sets the energy threshold so that background hum is ignored. listen() blocks until speech is detected, records until silence returns, and yields an AudioData object.

Key parameters on listen():

timeout — max seconds to wait for speech to start (raises WaitTimeoutError)
phrase_time_limit — max seconds to record once speech starts

Loading from files

For pre-recorded audio, use AudioFile:

with sr.AudioFile("lecture.wav") as source:
    audio = recognizer.record(source)
    # or record a specific segment:
    # audio = recognizer.record(source, offset=10, duration=30)

Supported formats include WAV, AIFF, and FLAC natively. For MP3 or other formats, Pydub can convert to WAV first, or you can install ffmpeg and let the library handle conversion.

Recognition and error handling

try:
    text = recognizer.recognize_google(audio)
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Could not understand the audio")
except sr.RequestError as e:
    print(f"Engine error: {e}")

UnknownValueError means the engine received audio but could not match it to words. RequestError means the network request failed (for cloud-based engines).

Language support

Most engines accept a language parameter using BCP-47 codes:

text = recognizer.recognize_google(audio, language="es-ES")  # Spanish
text = recognizer.recognize_google(audio, language="ja-JP")  # Japanese

Available languages depend on the engine. Google supports over 120 languages; Sphinx supports fewer but works offline.

AudioData internals

AudioData stores raw PCM audio as bytes, with a sample rate and sample width. You can export it for other uses:

wav_bytes = audio.get_wav_data()
raw_bytes = audio.get_raw_data()
flac_bytes = audio.get_flac_data()

This makes it easy to save recordings, pipe audio to other libraries, or debug by listening to what was actually captured.

Choosing an engine

Engine	Network	Cost	Accuracy	Setup
Google Web	Yes	Free (limited)	Good	None
Google Cloud	Yes	Paid	Excellent	API key
Whisper	No	Free	Excellent	`pip install openai-whisper`
Sphinx	No	Free	Fair	`pip install pocketsphinx`
Azure	Yes	Paid	Excellent	API key

For quick prototyping, Google Web works without configuration. For production offline use, Whisper offers the best accuracy. Sphinx is lightweight but less accurate on diverse accents.

The one thing to remember: SpeechRecognition gives you a single API surface — Recognizer, AudioData, and engine methods — that works the same way whether audio comes from a microphone or a file, and whether the engine runs locally or in the cloud.

pythonspeech-recognitionaudionlp