Python Text to Speech pyttsx3 — Deep Dive

Driver architecture

pyttsx3 uses a driver abstraction to interface with platform-specific speech engines. When you call pyttsx3.init(driverName=None), the library auto-detects the platform and loads the appropriate driver:

  • sapi5 (Windows) — wraps the COM-based Speech API 5. Communicates with the speech engine via win32com.client.Dispatch("SAPI.SpVoice"). Supports SAPI-compliant voices from Microsoft, third-party vendors like Ivona, and OneCore voices in Windows 10+.
  • nsss (macOS) — wraps NSSpeechSynthesizer from the AppKit framework via PyObjC. Accesses system voices including the high-quality “enhanced” and “premium” variants.
  • espeak (Linux) — wraps the espeak-ng command-line synthesizer. Communicates by spawning processes or using the espeak C library via ctypes.

You can force a specific driver: pyttsx3.init(driverName='espeak'). This is useful on systems where multiple engines are available.

Engine singleton and threading

pyttsx3.init() returns a singleton — calling it multiple times gives the same Engine instance. This is important for threading: the engine is not thread-safe. Calling say() or runAndWait() from multiple threads simultaneously causes crashes or deadlocks.

Safe patterns for multi-threaded applications:

import threading
import queue

class TTSWorker:
    def __init__(self):
        self.queue = queue.Queue()
        self.thread = threading.Thread(target=self._run, daemon=True)
        self.thread.start()
    
    def _run(self):
        engine = pyttsx3.init()
        while True:
            text = self.queue.get()
            if text is None:
                break
            engine.say(text)
            engine.runAndWait()
    
    def speak(self, text):
        self.queue.put(text)
    
    def stop(self):
        self.queue.put(None)

tts = TTSWorker()
tts.speak("Thread-safe speech")

This dedicates one thread to TTS and funnels all requests through a queue. The engine lives entirely within that thread.

Voice management in depth

Voices expose several properties beyond name and ID:

for voice in engine.getProperty('voices'):
    print(f"ID:       {voice.id}")
    print(f"Name:     {voice.name}")
    print(f"Languages: {voice.languages}")
    print(f"Gender:   {voice.gender}")
    print(f"Age:      {voice.age}")

On Windows, voice IDs are registry paths like HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_DAVID_11.0. On macOS, they are bundle identifiers like com.apple.speech.synthesis.voice.Alex.

To filter voices by language:

english_voices = [v for v in voices if any('en' in str(l) for l in v.languages)]

Installing additional voices varies by platform:

  • Windows — Settings → Time & Language → Speech → Add voices
  • macOS — System Preferences → Accessibility → Spoken Content → System Voice → Manage Voices
  • Linuxsudo apt install espeak-ng-data for additional language packs

Rate and prosody tuning

Speech rate is measured in words per minute. The default is approximately 200 WPM:

# Comfortable listening speed
engine.setProperty('rate', 160)

# Fast narration
engine.setProperty('rate', 250)

# Slow, deliberate speech
engine.setProperty('rate', 100)

The espeak driver also supports pitch adjustment through its command-line flags, though pyttsx3 does not expose this directly. Workaround for espeak:

import subprocess

def speak_with_pitch(text, pitch=50, rate=150):
    subprocess.run(['espeak-ng', '-p', str(pitch), '-s', str(rate), text])

SSML and pronunciation control

pyttsx3 does not support SSML (Speech Synthesis Markup Language) directly. However, the underlying SAPI5 engine on Windows does. You can inject SSML by accessing the COM object:

import win32com.client

speaker = win32com.client.Dispatch("SAPI.SpVoice")
ssml = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    <prosody rate="slow" pitch="low">
        This is slow and low-pitched speech.
    </prosody>
    <break time="500ms"/>
    <say-as interpret-as="spell-out">API</say-as>
</speak>'''
speaker.Speak(ssml, 0)

For cross-platform pronunciation hints, pre-process text to replace abbreviations and symbols:

PRONUNCIATIONS = {
    "API": "A P I",
    "SQL": "sequel",
    "GIF": "jif",
    "etc.": "et cetera",
    "e.g.": "for example",
}

def preprocess(text):
    for abbrev, spoken in PRONUNCIATIONS.items():
        text = text.replace(abbrev, spoken)
    return text

File output strategies

save_to_file writes audio synchronously during runAndWait():

engine.save_to_file("Chapter one content here", "chapter1.wav")
engine.save_to_file("Chapter two content here", "chapter2.wav")
engine.runAndWait()

Both files are written during the single runAndWait() call. Output format depends on the driver:

  • SAPI5 — WAV by default; some voices support MP3 via stream configuration
  • nsss — AIFF by default
  • espeak — WAV by default

For consistent MP3 output, generate WAV first and convert with Pydub:

from pydub import AudioSegment

engine.save_to_file(text, "temp.wav")
engine.runAndWait()

audio = AudioSegment.from_wav("temp.wav")
audio.export("output.mp3", format="mp3", bitrate="192k")

Batch audiobook generation

Generate an audiobook from structured text:

import json
from pathlib import Path

chapters = json.loads(Path("chapters.json").read_text())
engine = pyttsx3.init()
engine.setProperty('rate', 160)

for i, chapter in enumerate(chapters):
    filename = f"audiobook/chapter_{i+1:02d}.wav"
    engine.save_to_file(chapter['text'], filename)

engine.runAndWait()

For long texts, split into paragraphs and add pauses between them:

def text_to_utterances(text, pause_ms=800):
    paragraphs = text.strip().split('\n\n')
    utterances = []
    for p in paragraphs:
        utterances.append(p.strip())
        utterances.append(f"[pause:{pause_ms}]")  # marker for post-processing
    return utterances

Since pyttsx3 does not support inline pauses natively, generate separate audio files per paragraph and concatenate with Pydub, inserting silence segments between them.

Integration with speech recognition

Build a conversational loop combining pyttsx3 with the SpeechRecognition library:

import speech_recognition as sr
import pyttsx3

recognizer = sr.Recognizer()
engine = pyttsx3.init()

def speak(text):
    engine.say(text)
    engine.runAndWait()

def listen():
    with sr.Microphone() as source:
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source, timeout=5)
        return recognizer.recognize_google(audio)

speak("What would you like to know?")
while True:
    try:
        query = listen()
        response = generate_answer(query)
        speak(response)
    except sr.WaitTimeoutError:
        speak("I didn't hear anything. Try again.")
    except sr.UnknownValueError:
        speak("Sorry, I couldn't understand that.")

Performance characteristics

Synthesis latency varies by platform and voice:

PlatformVoice TypeLatency (first word)Throughput
Windows SAPI5Standard~50msFast
Windows SAPI5OneCore Neural~200msModerate
macOS nsssEnhanced~100msFast
Linux espeakDefault~20msVery fast

For latency-sensitive applications, pre-generate common responses and cache the audio files. Only synthesize dynamically for unpredictable content.

Common issues and fixes

  • Engine hangs on runAndWait() — usually caused by calling from a thread that is not the engine’s owner thread; use the queue pattern described above
  • No voices found on Linux — install espeak-ng: sudo apt install espeak-ng
  • Garbled audio on macOS — ensure only one NSSpeechSynthesizer instance exists; the singleton pattern in pyttsx3 handles this, but creating multiple engines via different methods breaks it
  • Rate changes not taking effect — set properties before calling say(), not after
  • Memory with very long texts — break text into chunks of a few hundred words each to keep memory usage stable

Alternatives comparison

LibraryOfflineQualityLanguagesSetup
pyttsx3YesOS-dependentManyMinimal
gTTSNoGood (Google)50+API needed
Coqui TTSYesNeural quality20+Heavy (models)
Azure TTSNoExcellent100+API key + cost
ElevenLabsNoPremium29+API key + cost

pyttsx3 wins on simplicity and offline operation. For production applications where voice quality matters, neural TTS engines are worth the extra complexity.

The one thing to remember: pyttsx3’s value is zero-dependency offline TTS with a three-line API — but production use requires thread-safe wrappers, text preprocessing for pronunciation, and often a Pydub post-processing step for consistent audio output across platforms.

pythonpyttsx3ttstext-to-speechautomation

See Also

  • Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
  • Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
  • Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
  • Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
  • Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.