Python Speech Recognition — Core Concepts
What SpeechRecognition does
The speech_recognition library provides a unified Python interface for converting spoken audio into text. It abstracts away the differences between recognition engines — you write the same code whether you use Google, Sphinx, Wit.ai, or Whisper.
Install with pip install SpeechRecognition. For microphone support, also install pyaudio.
The Recognizer object
Recognizer is the central class. It holds configuration (energy thresholds, pause timing) and provides methods for each supported engine:
import speech_recognition as sr
recognizer = sr.Recognizer()
Key engine methods:
recognize_google(audio)— Google Web Speech API (free, no key needed for basic use)recognize_google_cloud(audio, credentials_json)— Google Cloud Speech-to-Textrecognize_sphinx(audio)— CMU Sphinx (offline, no network required)recognize_whisper(audio, model)— OpenAI Whisper (local, high accuracy)recognize_wit(audio, key)— Wit.airecognize_azure(audio, key)— Microsoft Azure Speech
All methods accept an AudioData object and return a string of recognized text.
Capturing from a microphone
The Microphone class wraps PyAudio to capture live audio:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source, duration=1)
print("Speak now...")
audio = recognizer.listen(source)
adjust_for_ambient_noise samples the room for the given duration and sets the energy threshold so that background hum is ignored. listen() blocks until speech is detected, records until silence returns, and yields an AudioData object.
Key parameters on listen():
timeout— max seconds to wait for speech to start (raisesWaitTimeoutError)phrase_time_limit— max seconds to record once speech starts
Loading from files
For pre-recorded audio, use AudioFile:
with sr.AudioFile("lecture.wav") as source:
audio = recognizer.record(source)
# or record a specific segment:
# audio = recognizer.record(source, offset=10, duration=30)
Supported formats include WAV, AIFF, and FLAC natively. For MP3 or other formats, Pydub can convert to WAV first, or you can install ffmpeg and let the library handle conversion.
Recognition and error handling
try:
text = recognizer.recognize_google(audio)
print(f"You said: {text}")
except sr.UnknownValueError:
print("Could not understand the audio")
except sr.RequestError as e:
print(f"Engine error: {e}")
UnknownValueError means the engine received audio but could not match it to words. RequestError means the network request failed (for cloud-based engines).
Language support
Most engines accept a language parameter using BCP-47 codes:
text = recognizer.recognize_google(audio, language="es-ES") # Spanish
text = recognizer.recognize_google(audio, language="ja-JP") # Japanese
Available languages depend on the engine. Google supports over 120 languages; Sphinx supports fewer but works offline.
AudioData internals
AudioData stores raw PCM audio as bytes, with a sample rate and sample width. You can export it for other uses:
wav_bytes = audio.get_wav_data()
raw_bytes = audio.get_raw_data()
flac_bytes = audio.get_flac_data()
This makes it easy to save recordings, pipe audio to other libraries, or debug by listening to what was actually captured.
Choosing an engine
| Engine | Network | Cost | Accuracy | Setup |
|---|---|---|---|---|
| Google Web | Yes | Free (limited) | Good | None |
| Google Cloud | Yes | Paid | Excellent | API key |
| Whisper | No | Free | Excellent | pip install openai-whisper |
| Sphinx | No | Free | Fair | pip install pocketsphinx |
| Azure | Yes | Paid | Excellent | API key |
For quick prototyping, Google Web works without configuration. For production offline use, Whisper offers the best accuracy. Sphinx is lightweight but less accurate on diverse accents.
The one thing to remember: SpeechRecognition gives you a single API surface — Recognizer, AudioData, and engine methods — that works the same way whether audio comes from a microphone or a file, and whether the engine runs locally or in the cloud.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.