Speech synthesis models
Our models are built for different use cases — from expressive English narration to native-quality multilingual synthesis.
Simba 3.0
Streaming-native English speech with rich expressivity
Our flagship streaming-native model. Lower time-to-first-byte than previous generations, with finer-grained emotional control, zero-shot voice cloning, and SSML prosody. Currently English only; multilingual support is in development.
Emotion Control
Generate the same text with different emotional expressions. Simba models emotion at the prosody level — not just speed and pitch, but the subtle rhythmic and tonal patterns that convey feeling.
“Every moment of light and dark is a miracle.”
Simba 1.6
Native-quality speech across 50+ languages
Built for non-English and mixed-language input across 50+ locales. Locale-specific voices preserve natural pronunciation and prosody, with the same voice cloning, emotion, and SSML support as the English models. Non-streaming synthesis.
Multilingual Synthesis
Native-quality speech across 50+ locales. Each language uses voices recorded in that locale for natural pronunciation and prosody, with mixed-language input handled automatically.
Zero-Shot Voice Cloning
Clone any voice from a short reference clip. Simba captures speaker identity — timbre, cadence, and micro-expressions — from as little as 10 seconds of audio.
All models, one API
Access every model through the same endpoint. Switch between models with a single parameter change.
from speechify import Speechify
client = Speechify() # uses SPEECHIFY_API_KEY env var
response = client.tts.audio.speech(
input='<speak><speechify:style emotion="cheerful">Every moment of light and dark is a miracle.</speechify:style></speak>',
voice_id="george",
model="simba-3.0",
audio_format="mp3",
)
with open("output.mp3", "wb") as f:
f.write(response.audio_data)