Python Voice Assistant Integration — Deep Dive
Voice Pipeline Architecture
A production voice assistant is an audio pipeline with three stages: capture → process → synthesize. Each stage introduces latency and potential errors. The engineering challenge is minimizing both while maintaining a natural conversational feel.
Microphone → VAD → STT → NLU + Dialog → NLG → TTS → Speaker
↑ |
└────── Turn management ─────────────────┘
Speech-to-Text (STT)
Local STT with Whisper
OpenAI’s Whisper model provides excellent accuracy without sending audio to the cloud:
import whisper
import numpy as np
import sounddevice as sd
model = whisper.load_model("base") # tiny, base, small, medium, large
def record_audio(duration: float = 5.0, sample_rate: int = 16000) -> np.ndarray:
audio = sd.rec(
int(duration * sample_rate),
samplerate=sample_rate,
channels=1,
dtype="float32",
)
sd.wait()
return audio.flatten()
def transcribe(audio: np.ndarray) -> str:
result = model.transcribe(audio, language="en", fp16=False)
return result["text"].strip()
Whisper model sizes and tradeoffs:
| Model | Parameters | VRAM | Speed (rel.) | WER (English) |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~32x | 7.6% |
| base | 74M | ~1 GB | ~16x | 5.0% |
| small | 244M | ~2 GB | ~6x | 3.4% |
| medium | 769M | ~5 GB | ~2x | 2.9% |
| large | 1.5B | ~10 GB | 1x | 2.7% |
For real-time applications, tiny or base provide the best latency-accuracy tradeoff.
Streaming STT with faster-whisper
The faster-whisper library uses CTranslate2 for optimized inference and supports streaming:
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
def transcribe_streaming(audio_path: str) -> str:
segments, info = model.transcribe(
audio_path,
beam_size=5,
language="en",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
return " ".join(segment.text for segment in segments)
Cloud STT with Google
For highest accuracy and language support:
from google.cloud import speech_v1
import io
def transcribe_google(audio_bytes: bytes, sample_rate: int = 16000) -> str:
client = speech_v1.SpeechClient()
audio = speech_v1.RecognitionAudio(content=audio_bytes)
config = speech_v1.RecognitionConfig(
encoding=speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=sample_rate,
language_code="en-US",
model="latest_long",
enable_automatic_punctuation=True,
speech_contexts=[
speech_v1.SpeechContext(
phrases=["book a flight", "cancel reservation", "check status"],
boost=15.0,
)
],
)
response = client.recognize(config=config, audio=audio)
return " ".join(r.alternatives[0].transcript for r in response.results)
The speech_contexts parameter biases recognition toward domain-specific phrases, significantly improving accuracy for specialized vocabulary.
Voice Activity Detection (VAD)
Silero VAD
Silero VAD is a lightweight model that detects speech segments in audio:
import torch
model_vad, utils = torch.hub.load(
"snakers4/silero-vad", "silero_vad", force_reload=False
)
get_speech_timestamps, _, read_audio, *_ = utils
def detect_speech_segments(audio_path: str) -> list[dict]:
wav = read_audio(audio_path, sampling_rate=16000)
timestamps = get_speech_timestamps(wav, model_vad, sampling_rate=16000)
return timestamps # [{"start": 1000, "end": 5000}, ...]
Endpoint Detection for Turn Management
Detecting when the user has finished speaking is critical. Too early and you cut them off; too late and the conversation feels sluggish:
class EndpointDetector:
def __init__(self, silence_threshold_ms: int = 700, min_speech_ms: int = 300):
self.silence_threshold = silence_threshold_ms
self.min_speech = min_speech_ms
self.speech_started = False
self.speech_duration_ms = 0
self.silence_duration_ms = 0
def process_frame(self, is_speech: bool, frame_duration_ms: int = 30) -> str:
if is_speech:
self.speech_started = True
self.speech_duration_ms += frame_duration_ms
self.silence_duration_ms = 0
return "speaking"
elif self.speech_started:
self.silence_duration_ms += frame_duration_ms
if (self.silence_duration_ms >= self.silence_threshold
and self.speech_duration_ms >= self.min_speech):
self.reset()
return "endpoint"
return "pause"
return "silence"
def reset(self):
self.speech_started = False
self.speech_duration_ms = 0
self.silence_duration_ms = 0
Text-to-Speech (TTS)
Offline TTS with pyttsx3
Simple, no-network-required TTS:
import pyttsx3
engine = pyttsx3.init()
engine.setProperty("rate", 175) # Words per minute
engine.setProperty("volume", 0.9)
# List available voices
voices = engine.getProperty("voices")
engine.setProperty("voice", voices[1].id) # Often female voice
def speak(text: str):
engine.say(text)
engine.runAndWait()
Neural TTS with Coqui
High-quality open-source neural TTS:
from TTS.api import TTS
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=False)
def synthesize(text: str, output_path: str = "response.wav"):
tts.tts_to_file(text=text, file_path=output_path)
return output_path
Cloud TTS with Streaming
For the lowest perceived latency, stream TTS audio as it generates:
from google.cloud import texttospeech_v1
import pyaudio
def stream_tts(text: str):
client = texttospeech_v1.TextToSpeechClient()
synthesis_input = texttospeech_v1.SynthesisInput(text=text)
voice = texttospeech_v1.VoiceSelectionParams(
language_code="en-US",
name="en-US-Neural2-F",
)
audio_config = texttospeech_v1.AudioConfig(
audio_encoding=texttospeech_v1.AudioEncoding.LINEAR16,
sample_rate_hertz=24000,
)
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
# Play audio
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
stream.write(response.audio_content)
stream.stop_stream()
stream.close()
p.terminate()
Complete Voice Assistant Pipeline
Async Pipeline with Latency Optimization
import asyncio
from dataclasses import dataclass
from typing import AsyncGenerator
@dataclass
class VoiceConfig:
stt_model: str = "base"
tts_model: str = "neural"
silence_threshold_ms: int = 700
sample_rate: int = 16000
class VoiceAssistant:
def __init__(self, chatbot, config: VoiceConfig = VoiceConfig()):
self.chatbot = chatbot
self.config = config
self.stt = WhisperModel(config.stt_model, device="cpu", compute_type="int8")
self.vad = EndpointDetector(silence_threshold_ms=config.silence_threshold_ms)
self.is_listening = False
async def process_utterance(self, audio: np.ndarray) -> str:
# STT
segments, _ = self.stt.transcribe(audio, beam_size=3, language="en")
transcript = " ".join(s.text for s in segments).strip()
if not transcript:
return ""
# Chatbot processing
response = await self.chatbot.handle(transcript)
# TTS (synthesize audio)
audio_path = synthesize(response.text)
return audio_path
async def conversation_loop(self):
print("Voice assistant ready. Speak now...")
self.is_listening = True
while self.is_listening:
# Record until endpoint detected
audio = await self._record_until_endpoint()
if audio is None:
continue
# Process and respond
response_audio = await self.process_utterance(audio)
if response_audio:
play_audio(response_audio)
async def _record_until_endpoint(self) -> np.ndarray | None:
frames = []
frame_duration_ms = 30
frame_size = int(self.config.sample_rate * frame_duration_ms / 1000)
stream = sd.InputStream(
samplerate=self.config.sample_rate, channels=1, dtype="float32"
)
stream.start()
try:
while True:
audio_frame, _ = stream.read(frame_size)
is_speech = detect_speech(audio_frame)
status = self.vad.process_frame(is_speech, frame_duration_ms)
if status in ("speaking", "pause"):
frames.append(audio_frame)
elif status == "endpoint":
break
await asyncio.sleep(0.001)
finally:
stream.stop()
stream.close()
if frames:
return np.concatenate(frames).flatten()
return None
Wake Word Detection
Using Porcupine
import pvporcupine
import struct
def listen_for_wake_word(keyword: str = "porcupine"):
porcupine = pvporcupine.create(keywords=[keyword])
pa = pyaudio.PyAudio()
stream = pa.open(
rate=porcupine.sample_rate,
channels=1,
format=pyaudio.paInt16,
input=True,
frames_per_buffer=porcupine.frame_length,
)
print(f"Listening for '{keyword}'...")
try:
while True:
pcm = stream.read(porcupine.frame_length)
pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)
keyword_index = porcupine.process(pcm)
if keyword_index >= 0:
print("Wake word detected!")
return True
finally:
stream.close()
pa.terminate()
porcupine.delete()
Latency Budget
For a natural-feeling voice assistant, target under 2 seconds total response time:
| Stage | Target | Optimization |
|---|---|---|
| Audio capture | 700ms | Endpoint detection tune silence threshold |
| STT | 200-500ms | Use tiny/base Whisper, int8 quantization |
| Chatbot | 50-200ms | Template responses, cached model |
| TTS | 200-500ms | Stream audio, use neural cache |
| Total | ~1.5s |
Latency Optimization Techniques
- Speculative STT: Start transcribing before the user finishes speaking, using partial results.
- TTS pre-warming: Keep the TTS model loaded and ready; avoid cold-start per request.
- Response chunking: Split long responses into sentences and start TTS on the first sentence while the chatbot finishes generating.
- Audio compression: Use Opus codec for network transmission to reduce bandwidth and latency.
Error Recovery
Voice interactions need graceful error handling:
class VoiceErrorHandler:
def __init__(self, max_retries: int = 2):
self.max_retries = max_retries
self.retry_count = 0
def handle_empty_transcript(self) -> str:
self.retry_count += 1
if self.retry_count <= self.max_retries:
return "I didn't catch that. Could you say it again?"
self.retry_count = 0
return "I'm having trouble hearing you. Try moving closer to the microphone."
def handle_low_confidence(self, transcript: str, confidence: float) -> str:
if confidence < 0.5:
return f"Did you say '{transcript}'?"
return transcript # Accept and proceed
def handle_timeout(self) -> str:
return "Are you still there? Say something if you need help."
The one thing to remember: A production voice assistant optimizes every millisecond across the STT→chatbot→TTS pipeline, using streaming processing, VAD-based endpoint detection, and speculative execution to keep total response time under two seconds.
See Also
- Python Chatbot Architecture Discover how Python chatbots are built from simple building blocks that listen, think, and reply — like a friendly robot pen-pal.
- Python Conversation Memory Discover how chatbots remember what you said five minutes ago — and why some forget everything the moment you close the window.
- Python Dialog Management See how chatbots remember where they are in a conversation — like a waiter who never forgets your order.
- Python Intent Classification Find out how chatbots figure out what you actually want when you type a message — even if you say it in a weird way.
- Python Rasa Framework Meet Rasa — the free toolkit that lets anyone build a chatbot that actually understands conversations, not just keywords.