Python Text to Speech pyttsx3 — Deep Dive
Driver architecture
pyttsx3 uses a driver abstraction to interface with platform-specific speech engines. When you call pyttsx3.init(driverName=None), the library auto-detects the platform and loads the appropriate driver:
- sapi5 (Windows) — wraps the COM-based Speech API 5. Communicates with the speech engine via
win32com.client.Dispatch("SAPI.SpVoice"). Supports SAPI-compliant voices from Microsoft, third-party vendors like Ivona, and OneCore voices in Windows 10+. - nsss (macOS) — wraps
NSSpeechSynthesizerfrom the AppKit framework via PyObjC. Accesses system voices including the high-quality “enhanced” and “premium” variants. - espeak (Linux) — wraps the espeak-ng command-line synthesizer. Communicates by spawning processes or using the espeak C library via ctypes.
You can force a specific driver: pyttsx3.init(driverName='espeak'). This is useful on systems where multiple engines are available.
Engine singleton and threading
pyttsx3.init() returns a singleton — calling it multiple times gives the same Engine instance. This is important for threading: the engine is not thread-safe. Calling say() or runAndWait() from multiple threads simultaneously causes crashes or deadlocks.
Safe patterns for multi-threaded applications:
import threading
import queue
class TTSWorker:
def __init__(self):
self.queue = queue.Queue()
self.thread = threading.Thread(target=self._run, daemon=True)
self.thread.start()
def _run(self):
engine = pyttsx3.init()
while True:
text = self.queue.get()
if text is None:
break
engine.say(text)
engine.runAndWait()
def speak(self, text):
self.queue.put(text)
def stop(self):
self.queue.put(None)
tts = TTSWorker()
tts.speak("Thread-safe speech")
This dedicates one thread to TTS and funnels all requests through a queue. The engine lives entirely within that thread.
Voice management in depth
Voices expose several properties beyond name and ID:
for voice in engine.getProperty('voices'):
print(f"ID: {voice.id}")
print(f"Name: {voice.name}")
print(f"Languages: {voice.languages}")
print(f"Gender: {voice.gender}")
print(f"Age: {voice.age}")
On Windows, voice IDs are registry paths like HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Speech\Voices\Tokens\TTS_MS_EN-US_DAVID_11.0. On macOS, they are bundle identifiers like com.apple.speech.synthesis.voice.Alex.
To filter voices by language:
english_voices = [v for v in voices if any('en' in str(l) for l in v.languages)]
Installing additional voices varies by platform:
- Windows — Settings → Time & Language → Speech → Add voices
- macOS — System Preferences → Accessibility → Spoken Content → System Voice → Manage Voices
- Linux —
sudo apt install espeak-ng-datafor additional language packs
Rate and prosody tuning
Speech rate is measured in words per minute. The default is approximately 200 WPM:
# Comfortable listening speed
engine.setProperty('rate', 160)
# Fast narration
engine.setProperty('rate', 250)
# Slow, deliberate speech
engine.setProperty('rate', 100)
The espeak driver also supports pitch adjustment through its command-line flags, though pyttsx3 does not expose this directly. Workaround for espeak:
import subprocess
def speak_with_pitch(text, pitch=50, rate=150):
subprocess.run(['espeak-ng', '-p', str(pitch), '-s', str(rate), text])
SSML and pronunciation control
pyttsx3 does not support SSML (Speech Synthesis Markup Language) directly. However, the underlying SAPI5 engine on Windows does. You can inject SSML by accessing the COM object:
import win32com.client
speaker = win32com.client.Dispatch("SAPI.SpVoice")
ssml = '''<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
<prosody rate="slow" pitch="low">
This is slow and low-pitched speech.
</prosody>
<break time="500ms"/>
<say-as interpret-as="spell-out">API</say-as>
</speak>'''
speaker.Speak(ssml, 0)
For cross-platform pronunciation hints, pre-process text to replace abbreviations and symbols:
PRONUNCIATIONS = {
"API": "A P I",
"SQL": "sequel",
"GIF": "jif",
"etc.": "et cetera",
"e.g.": "for example",
}
def preprocess(text):
for abbrev, spoken in PRONUNCIATIONS.items():
text = text.replace(abbrev, spoken)
return text
File output strategies
save_to_file writes audio synchronously during runAndWait():
engine.save_to_file("Chapter one content here", "chapter1.wav")
engine.save_to_file("Chapter two content here", "chapter2.wav")
engine.runAndWait()
Both files are written during the single runAndWait() call. Output format depends on the driver:
- SAPI5 — WAV by default; some voices support MP3 via stream configuration
- nsss — AIFF by default
- espeak — WAV by default
For consistent MP3 output, generate WAV first and convert with Pydub:
from pydub import AudioSegment
engine.save_to_file(text, "temp.wav")
engine.runAndWait()
audio = AudioSegment.from_wav("temp.wav")
audio.export("output.mp3", format="mp3", bitrate="192k")
Batch audiobook generation
Generate an audiobook from structured text:
import json
from pathlib import Path
chapters = json.loads(Path("chapters.json").read_text())
engine = pyttsx3.init()
engine.setProperty('rate', 160)
for i, chapter in enumerate(chapters):
filename = f"audiobook/chapter_{i+1:02d}.wav"
engine.save_to_file(chapter['text'], filename)
engine.runAndWait()
For long texts, split into paragraphs and add pauses between them:
def text_to_utterances(text, pause_ms=800):
paragraphs = text.strip().split('\n\n')
utterances = []
for p in paragraphs:
utterances.append(p.strip())
utterances.append(f"[pause:{pause_ms}]") # marker for post-processing
return utterances
Since pyttsx3 does not support inline pauses natively, generate separate audio files per paragraph and concatenate with Pydub, inserting silence segments between them.
Integration with speech recognition
Build a conversational loop combining pyttsx3 with the SpeechRecognition library:
import speech_recognition as sr
import pyttsx3
recognizer = sr.Recognizer()
engine = pyttsx3.init()
def speak(text):
engine.say(text)
engine.runAndWait()
def listen():
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source)
audio = recognizer.listen(source, timeout=5)
return recognizer.recognize_google(audio)
speak("What would you like to know?")
while True:
try:
query = listen()
response = generate_answer(query)
speak(response)
except sr.WaitTimeoutError:
speak("I didn't hear anything. Try again.")
except sr.UnknownValueError:
speak("Sorry, I couldn't understand that.")
Performance characteristics
Synthesis latency varies by platform and voice:
| Platform | Voice Type | Latency (first word) | Throughput |
|---|---|---|---|
| Windows SAPI5 | Standard | ~50ms | Fast |
| Windows SAPI5 | OneCore Neural | ~200ms | Moderate |
| macOS nsss | Enhanced | ~100ms | Fast |
| Linux espeak | Default | ~20ms | Very fast |
For latency-sensitive applications, pre-generate common responses and cache the audio files. Only synthesize dynamically for unpredictable content.
Common issues and fixes
- Engine hangs on
runAndWait()— usually caused by calling from a thread that is not the engine’s owner thread; use the queue pattern described above - No voices found on Linux — install espeak-ng:
sudo apt install espeak-ng - Garbled audio on macOS — ensure only one
NSSpeechSynthesizerinstance exists; the singleton pattern in pyttsx3 handles this, but creating multiple engines via different methods breaks it - Rate changes not taking effect — set properties before calling
say(), not after - Memory with very long texts — break text into chunks of a few hundred words each to keep memory usage stable
Alternatives comparison
| Library | Offline | Quality | Languages | Setup |
|---|---|---|---|---|
| pyttsx3 | Yes | OS-dependent | Many | Minimal |
| gTTS | No | Good (Google) | 50+ | API needed |
| Coqui TTS | Yes | Neural quality | 20+ | Heavy (models) |
| Azure TTS | No | Excellent | 100+ | API key + cost |
| ElevenLabs | No | Premium | 29+ | API key + cost |
pyttsx3 wins on simplicity and offline operation. For production applications where voice quality matters, neural TTS engines are worth the extra complexity.
The one thing to remember: pyttsx3’s value is zero-dependency offline TTS with a three-line API — but production use requires thread-safe wrappers, text preprocessing for pronunciation, and often a Pydub post-processing step for consistent audio output across platforms.
See Also
- Python Arcade Library Think of a magical art table that draws your game characters, listens when you press buttons, and cleans up the mess — that's Python Arcade.
- Python Audio Fingerprinting Ever wonder how Shazam identifies a song from just a few seconds of noisy audio? Audio fingerprinting is the magic behind it, and Python can do it too.
- Python Barcode Generation Picture the stripy labels on grocery items to understand how Python can create those machine-readable barcodes from numbers.
- Python Cellular Automata Imagine a checkerboard where each square follows simple rules to turn on or off — and suddenly complex patterns emerge like magic.
- Python Godot Gdscript Bridge Imagine speaking English to a friend who speaks French, with a translator in the middle — that's how Python talks to the Godot game engine.