Python Voice Assistant Integration — Deep Dive

Build production voice assistants in Python with streaming STT, low-latency TTS, wake word detection, and real-time audio pipelines.

Voice Pipeline Architecture

A production voice assistant is an audio pipeline with three stages: capture → process → synthesize. Each stage introduces latency and potential errors. The engineering challenge is minimizing both while maintaining a natural conversational feel.

Microphone → VAD → STT → NLU + Dialog → NLG → TTS → Speaker
              ↑                                        |
              └────── Turn management ─────────────────┘

Speech-to-Text (STT)

Local STT with Whisper

OpenAI’s Whisper model provides excellent accuracy without sending audio to the cloud:

import whisper
import numpy as np
import sounddevice as sd

model = whisper.load_model("base")  # tiny, base, small, medium, large

def record_audio(duration: float = 5.0, sample_rate: int = 16000) -> np.ndarray:
    audio = sd.rec(
        int(duration * sample_rate),
        samplerate=sample_rate,
        channels=1,
        dtype="float32",
    )
    sd.wait()
    return audio.flatten()

def transcribe(audio: np.ndarray) -> str:
    result = model.transcribe(audio, language="en", fp16=False)
    return result["text"].strip()

Whisper model sizes and tradeoffs:

Model	Parameters	VRAM	Speed (rel.)	WER (English)
tiny	39M	~1 GB	~32x	7.6%
base	74M	~1 GB	~16x	5.0%
small	244M	~2 GB	~6x	3.4%
medium	769M	~5 GB	~2x	2.9%
large	1.5B	~10 GB	1x	2.7%

For real-time applications, tiny or base provide the best latency-accuracy tradeoff.

Streaming STT with faster-whisper

The faster-whisper library uses CTranslate2 for optimized inference and supports streaming:

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

def transcribe_streaming(audio_path: str) -> str:
    segments, info = model.transcribe(
        audio_path,
        beam_size=5,
        language="en",
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=500),
    )
    return " ".join(segment.text for segment in segments)

Cloud STT with Google

For highest accuracy and language support:

from google.cloud import speech_v1
import io

def transcribe_google(audio_bytes: bytes, sample_rate: int = 16000) -> str:
    client = speech_v1.SpeechClient()
    audio = speech_v1.RecognitionAudio(content=audio_bytes)
    config = speech_v1.RecognitionConfig(
        encoding=speech_v1.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=sample_rate,
        language_code="en-US",
        model="latest_long",
        enable_automatic_punctuation=True,
        speech_contexts=[
            speech_v1.SpeechContext(
                phrases=["book a flight", "cancel reservation", "check status"],
                boost=15.0,
            )
        ],
    )
    response = client.recognize(config=config, audio=audio)
    return " ".join(r.alternatives[0].transcript for r in response.results)

The speech_contexts parameter biases recognition toward domain-specific phrases, significantly improving accuracy for specialized vocabulary.

Voice Activity Detection (VAD)

Silero VAD

Silero VAD is a lightweight model that detects speech segments in audio:

import torch

model_vad, utils = torch.hub.load(
    "snakers4/silero-vad", "silero_vad", force_reload=False
)
get_speech_timestamps, _, read_audio, *_ = utils

def detect_speech_segments(audio_path: str) -> list[dict]:
    wav = read_audio(audio_path, sampling_rate=16000)
    timestamps = get_speech_timestamps(wav, model_vad, sampling_rate=16000)
    return timestamps  # [{"start": 1000, "end": 5000}, ...]

Endpoint Detection for Turn Management

Detecting when the user has finished speaking is critical. Too early and you cut them off; too late and the conversation feels sluggish:

class EndpointDetector:
    def __init__(self, silence_threshold_ms: int = 700, min_speech_ms: int = 300):
        self.silence_threshold = silence_threshold_ms
        self.min_speech = min_speech_ms
        self.speech_started = False
        self.speech_duration_ms = 0
        self.silence_duration_ms = 0

    def process_frame(self, is_speech: bool, frame_duration_ms: int = 30) -> str:
        if is_speech:
            self.speech_started = True
            self.speech_duration_ms += frame_duration_ms
            self.silence_duration_ms = 0
            return "speaking"
        elif self.speech_started:
            self.silence_duration_ms += frame_duration_ms
            if (self.silence_duration_ms >= self.silence_threshold
                    and self.speech_duration_ms >= self.min_speech):
                self.reset()
                return "endpoint"
            return "pause"
        return "silence"

    def reset(self):
        self.speech_started = False
        self.speech_duration_ms = 0
        self.silence_duration_ms = 0

Text-to-Speech (TTS)

Offline TTS with pyttsx3

Simple, no-network-required TTS:

import pyttsx3

engine = pyttsx3.init()
engine.setProperty("rate", 175)  # Words per minute
engine.setProperty("volume", 0.9)

# List available voices
voices = engine.getProperty("voices")
engine.setProperty("voice", voices[1].id)  # Often female voice

def speak(text: str):
    engine.say(text)
    engine.runAndWait()

Neural TTS with Coqui

High-quality open-source neural TTS:

from TTS.api import TTS

tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC", progress_bar=False)

def synthesize(text: str, output_path: str = "response.wav"):
    tts.tts_to_file(text=text, file_path=output_path)
    return output_path

Cloud TTS with Streaming

For the lowest perceived latency, stream TTS audio as it generates:

from google.cloud import texttospeech_v1
import pyaudio

def stream_tts(text: str):
    client = texttospeech_v1.TextToSpeechClient()
    synthesis_input = texttospeech_v1.SynthesisInput(text=text)
    voice = texttospeech_v1.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Neural2-F",
    )
    audio_config = texttospeech_v1.AudioConfig(
        audio_encoding=texttospeech_v1.AudioEncoding.LINEAR16,
        sample_rate_hertz=24000,
    )

    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )

    # Play audio
    p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
    stream.write(response.audio_content)
    stream.stop_stream()
    stream.close()
    p.terminate()

Complete Voice Assistant Pipeline

Async Pipeline with Latency Optimization

import asyncio
from dataclasses import dataclass
from typing import AsyncGenerator

@dataclass
class VoiceConfig:
    stt_model: str = "base"
    tts_model: str = "neural"
    silence_threshold_ms: int = 700
    sample_rate: int = 16000

class VoiceAssistant:
    def __init__(self, chatbot, config: VoiceConfig = VoiceConfig()):
        self.chatbot = chatbot
        self.config = config
        self.stt = WhisperModel(config.stt_model, device="cpu", compute_type="int8")
        self.vad = EndpointDetector(silence_threshold_ms=config.silence_threshold_ms)
        self.is_listening = False

    async def process_utterance(self, audio: np.ndarray) -> str:
        # STT
        segments, _ = self.stt.transcribe(audio, beam_size=3, language="en")
        transcript = " ".join(s.text for s in segments).strip()

        if not transcript:
            return ""

        # Chatbot processing
        response = await self.chatbot.handle(transcript)

        # TTS (synthesize audio)
        audio_path = synthesize(response.text)

        return audio_path

    async def conversation_loop(self):
        print("Voice assistant ready. Speak now...")
        self.is_listening = True

        while self.is_listening:
            # Record until endpoint detected
            audio = await self._record_until_endpoint()
            if audio is None:
                continue

            # Process and respond
            response_audio = await self.process_utterance(audio)
            if response_audio:
                play_audio(response_audio)

    async def _record_until_endpoint(self) -> np.ndarray | None:
        frames = []
        frame_duration_ms = 30
        frame_size = int(self.config.sample_rate * frame_duration_ms / 1000)

        stream = sd.InputStream(
            samplerate=self.config.sample_rate, channels=1, dtype="float32"
        )
        stream.start()

        try:
            while True:
                audio_frame, _ = stream.read(frame_size)
                is_speech = detect_speech(audio_frame)
                status = self.vad.process_frame(is_speech, frame_duration_ms)

                if status in ("speaking", "pause"):
                    frames.append(audio_frame)
                elif status == "endpoint":
                    break

                await asyncio.sleep(0.001)
        finally:
            stream.stop()
            stream.close()

        if frames:
            return np.concatenate(frames).flatten()
        return None

Wake Word Detection

Using Porcupine

import pvporcupine
import struct

def listen_for_wake_word(keyword: str = "porcupine"):
    porcupine = pvporcupine.create(keywords=[keyword])
    pa = pyaudio.PyAudio()
    stream = pa.open(
        rate=porcupine.sample_rate,
        channels=1,
        format=pyaudio.paInt16,
        input=True,
        frames_per_buffer=porcupine.frame_length,
    )

    print(f"Listening for '{keyword}'...")
    try:
        while True:
            pcm = stream.read(porcupine.frame_length)
            pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)
            keyword_index = porcupine.process(pcm)
            if keyword_index >= 0:
                print("Wake word detected!")
                return True
    finally:
        stream.close()
        pa.terminate()
        porcupine.delete()

Latency Budget

For a natural-feeling voice assistant, target under 2 seconds total response time:

Stage	Target	Optimization
Audio capture	700ms	Endpoint detection tune silence threshold
STT	200-500ms	Use tiny/base Whisper, int8 quantization
Chatbot	50-200ms	Template responses, cached model
TTS	200-500ms	Stream audio, use neural cache
Total	~1.5s

Latency Optimization Techniques

Speculative STT: Start transcribing before the user finishes speaking, using partial results.
TTS pre-warming: Keep the TTS model loaded and ready; avoid cold-start per request.
Response chunking: Split long responses into sentences and start TTS on the first sentence while the chatbot finishes generating.
Audio compression: Use Opus codec for network transmission to reduce bandwidth and latency.

Error Recovery

Voice interactions need graceful error handling:

class VoiceErrorHandler:
    def __init__(self, max_retries: int = 2):
        self.max_retries = max_retries
        self.retry_count = 0

    def handle_empty_transcript(self) -> str:
        self.retry_count += 1
        if self.retry_count <= self.max_retries:
            return "I didn't catch that. Could you say it again?"
        self.retry_count = 0
        return "I'm having trouble hearing you. Try moving closer to the microphone."

    def handle_low_confidence(self, transcript: str, confidence: float) -> str:
        if confidence < 0.5:
            return f"Did you say '{transcript}'?"
        return transcript  # Accept and proceed

    def handle_timeout(self) -> str:
        return "Are you still there? Say something if you need help."

The one thing to remember: A production voice assistant optimizes every millisecond across the STT→chatbot→TTS pipeline, using streaming processing, VAD-based endpoint detection, and speculative execution to keep total response time under two seconds.

pythonvoice-assistantspeech-recognitionttschatbotsstreaming