Speech synthesis models

Our models are built for different use cases — from expressive English narration to native-quality multilingual synthesis.

Model

Simba 3.0

Streaming-native English speech with rich expressivity

Our flagship streaming-native model. Lower time-to-first-byte than previous generations, with finer-grained emotional control, zero-shot voice cloning, and SSML prosody. Currently English only; multilingual support is in development.

<300ms
Latency
English
Languages
1,000+
Voices
24kHz
Sample rate
Streaming-native architecture Emotional expression Zero-shot voice cloning SSML prosody control
Demo

Emotion Control

Generate the same text with different emotional expressions. Simba models emotion at the prosody level — not just speed and pitch, but the subtle rhythmic and tonal patterns that convey feeling.

Neutral
Happy
Sad
Excited
Calm
Mystery

“Every moment of light and dark is a miracle.”

Model

Simba 1.6

Native-quality speech across 50+ languages

Built for non-English and mixed-language input across 50+ locales. Locale-specific voices preserve natural pronunciation and prosody, with the same voice cloning, emotion, and SSML support as the English models. Non-streaming synthesis.

<750ms
Latency
50+
Languages
1,000+
Voices
24kHz
Sample rate
50+ languages Mixed-language input Zero-shot voice cloning Emotional expression SSML prosody control
Demo

Multilingual Synthesis

Native-quality speech across 50+ locales. Each language uses voices recorded in that locale for natural pronunciation and prosody, with mixed-language input handled automatically.

Demo

Zero-Shot Voice Cloning

Clone any voice from a short reference clip. Simba captures speaker identity — timbre, cadence, and micro-expressions — from as little as 10 seconds of audio.

Reference Original speaker
Clone Simba output

All models, one API

Access every model through the same endpoint. Switch between models with a single parameter change.

python
from speechify import Speechify

client = Speechify()  # uses SPEECHIFY_API_KEY env var

response = client.tts.audio.speech(
    input='<speak><speechify:style emotion="cheerful">Every moment of light and dark is a miracle.</speechify:style></speak>',
    voice_id="george",
    model="simba-3.0",
    audio_format="mp3",
)

with open("output.mp3", "wb") as f:
    f.write(response.audio_data)