Speech synthesis models

Our models are built for different use cases — from expressive English narration to native-quality multilingual synthesis.

Model

Simba 3.0

Streaming-native English speech with rich expressivity

Our flagship streaming-native model. Lower time-to-first-byte than previous generations, with finer-grained emotional control, zero-shot voice cloning, and SSML prosody. Currently English only; multilingual support is in development.

<300ms

Latency

English

Languages

1,000+

Voices

24kHz

Sample rate

Streaming-native architecture Emotional expression Zero-shot voice cloning SSML prosody control

Demo

Emotion Control

Generate the same text with different emotional expressions. Simba models emotion at the prosody level — not just speed and pitch, but the subtle rhythmic and tonal patterns that convey feeling.

Neutral

Happy

Sad

Excited

Calm

Mystery

“Every moment of light and dark is a miracle.”

Model

Simba 1.6

Native-quality speech across 50+ languages

Built for non-English and mixed-language input across 50+ locales. Locale-specific voices preserve natural pronunciation and prosody, with the same voice cloning, emotion, and SSML support as the English models. Non-streaming synthesis.

<750ms

Latency

50+

Languages

1,000+

Voices

24kHz

Sample rate

50+ languages Mixed-language input Zero-shot voice cloning Emotional expression SSML prosody control

Demo

Multilingual Synthesis

Native-quality speech across 50+ locales. Each language uses voices recorded in that locale for natural pronunciation and prosody, with mixed-language input handled automatically.

Demo

Zero-Shot Voice Cloning

Clone any voice from a short reference clip. Simba captures speaker identity — timbre, cadence, and micro-expressions — from as little as 10 seconds of audio.

Reference Original speaker

Clone Simba output

All models, one API

Access every model through the same endpoint. Switch between models with a single parameter change.

Explore the API Documentation

python

from speechify import Speechify

client = Speechify()  # uses SPEECHIFY_API_KEY env var

response = client.tts.audio.speech(
    input='<speak><speechify:style emotion="cheerful">Every moment of light and dark is a miracle.</speechify:style></speak>',
    voice_id="george",
    model="simba-3.0",
    audio_format="mp3",
)

with open("output.mp3", "wb") as f:
    f.write(response.audio_data)