Python Speech Recognition — Core Concepts

What SpeechRecognition does

The speech_recognition library provides a unified Python interface for converting spoken audio into text. It abstracts away the differences between recognition engines — you write the same code whether you use Google, Sphinx, Wit.ai, or Whisper.

Install with pip install SpeechRecognition. For microphone support, also install pyaudio.

The Recognizer object

Recognizer is the central class. It holds configuration (energy thresholds, pause timing) and provides methods for each supported engine:

import speech_recognition as sr

recognizer = sr.Recognizer()

Key engine methods:

  • recognize_google(audio) — Google Web Speech API (free, no key needed for basic use)
  • recognize_google_cloud(audio, credentials_json) — Google Cloud Speech-to-Text
  • recognize_sphinx(audio) — CMU Sphinx (offline, no network required)
  • recognize_whisper(audio, model) — OpenAI Whisper (local, high accuracy)
  • recognize_wit(audio, key) — Wit.ai
  • recognize_azure(audio, key) — Microsoft Azure Speech

All methods accept an AudioData object and return a string of recognized text.

Capturing from a microphone

The Microphone class wraps PyAudio to capture live audio:

with sr.Microphone() as source:
    recognizer.adjust_for_ambient_noise(source, duration=1)
    print("Speak now...")
    audio = recognizer.listen(source)

adjust_for_ambient_noise samples the room for the given duration and sets the energy threshold so that background hum is ignored. listen() blocks until speech is detected, records until silence returns, and yields an AudioData object.

Key parameters on listen():

  • timeout — max seconds to wait for speech to start (raises WaitTimeoutError)
  • phrase_time_limit — max seconds to record once speech starts

Loading from files

For pre-recorded audio, use AudioFile:

with sr.AudioFile("lecture.wav") as source:
    audio = recognizer.record(source)
    # or record a specific segment:
    # audio = recognizer.record(source, offset=10, duration=30)

Supported formats include WAV, AIFF, and FLAC natively. For MP3 or other formats, Pydub can convert to WAV first, or you can install ffmpeg and let the library handle conversion.

Recognition and error handling

try:
    text = recognizer.recognize_google(audio)
    print(f"You said: {text}")
except sr.UnknownValueError:
    print("Could not understand the audio")
except sr.RequestError as e:
    print(f"Engine error: {e}")

UnknownValueError means the engine received audio but could not match it to words. RequestError means the network request failed (for cloud-based engines).

Language support

Most engines accept a language parameter using BCP-47 codes:

text = recognizer.recognize_google(audio, language="es-ES")  # Spanish
text = recognizer.recognize_google(audio, language="ja-JP")  # Japanese

Available languages depend on the engine. Google supports over 120 languages; Sphinx supports fewer but works offline.

AudioData internals

AudioData stores raw PCM audio as bytes, with a sample rate and sample width. You can export it for other uses:

wav_bytes = audio.get_wav_data()
raw_bytes = audio.get_raw_data()
flac_bytes = audio.get_flac_data()

This makes it easy to save recordings, pipe audio to other libraries, or debug by listening to what was actually captured.

Choosing an engine

EngineNetworkCostAccuracySetup
Google WebYesFree (limited)GoodNone
Google CloudYesPaidExcellentAPI key
WhisperNoFreeExcellentpip install openai-whisper
SphinxNoFreeFairpip install pocketsphinx
AzureYesPaidExcellentAPI key

For quick prototyping, Google Web works without configuration. For production offline use, Whisper offers the best accuracy. Sphinx is lightweight but less accurate on diverse accents.

The one thing to remember: SpeechRecognition gives you a single API surface — Recognizer, AudioData, and engine methods — that works the same way whether audio comes from a microphone or a file, and whether the engine runs locally or in the cloud.

pythonspeech-recognitionaudionlp

See Also

  • Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
  • Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
  • Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
  • Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
  • Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.