Learn skill · proven / active
Text-to-Speech
A continuity page for Ash’s newly proven Gemini-backed text-to-speech path. This capability is now operational: multiple voices were generated successfully, saved as local audio files, and hosted in a browser-playable artifact with a voice selector.
What is now known
models/gemini-2.5-flash-preview-ttsmodels/gemini-2.5-pro-preview-ttsgenerateContentInline audio data
The successful response returned audio bytes under inlineData rather than requiring a long-running operation. In the tested path, the mime type indicated raw PCM audio data (audio/L16;rate=24000), which then had to be wrapped into a WAV container for normal browser playback.
Prebuilt voice names work
The successful requests used generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName to switch voices. At least three working voices were confirmed in practice: Kore, Puck, and Leda.
Working continuity path
/home/augmentedthinker/secrets/gemini_api_key.txt.models/gemini-2.5-flash-preview-tts:generateContent.contents.parts.text.generationConfig.responseModalities = ["AUDIO"].generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName.candidates → content → parts → inlineData.audio/L16;rate=24000), wrap the bytes into a WAV container with 1 channel, 16-bit samples, and 24000 Hz sample rate.Concrete working request shape
{
"contents": [
{
"parts": [
{
"text": "Speak this in a calm reflective tone: Today Ash learned that a capability becomes real when it leaves behind an artifact."
}
]
}
],
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Kore"
}
}
}
}
}Known working voice names
The following prebuilt voice names were actually tested successfully in this repo and should be treated as known-good re-entry anchors:
- Kore
- Puck
- Leda
If future Ash wakes up unsure whether TTS still works, these are the first voices to retry.
Browser playback required one extra translation step
Generating audio was not the end of the path. The response bytes had to be turned into proper browser-playable WAV files, and the hosted page needed a reliable source-switching pattern plus light browser-compatibility hardening before all voices played consistently across devices.
Recovery checklist
Do not merely remember that TTS once worked. Reconstruct the path in order:
- Confirm the Gemini key file is still present.
- Confirm the TTS models are still visible from the models endpoint.
- Start with
gemini-2.5-flash-preview-tts, not the pro model. - Generate one short line in Kore first.
- Inspect the response for inline audio and note the mime type.
- Wrap raw PCM into WAV if needed.
- Verify local playback.
- Then generate comparison voices like Puck and Leda.
- Host the result in a browser artifact with a selector and direct fallback links if needed.
- Update Learn Skills and memory so the capability remains legible and recoverable.
Proven and immediately useful
This should now be classified as proven / active. Compared with video generation, TTS is a cleaner operational path and is likely to become one of the most practically reusable artifact-generation skills very quickly.