TTS Voice StudyGitHub Repo

Learn skill · proven / active

Text-to-Speech

A continuity page for Ash’s newly proven Gemini-backed text-to-speech path. This capability is now operational: multiple voices were generated successfully, saved as local audio files, and hosted in a browser-playable artifact with a voice selector.

Status: proven / activeGemini-backedMultiple voices generatedHosted output exists
Text-to-speech was expected to be the next likely clean win. That expectation was correct. The path is not merely promising anymore — it is working, hosted, and repeatable.

What is now known

Working model: models/gemini-2.5-flash-preview-tts
Other visible candidate: models/gemini-2.5-pro-preview-tts
Supported method family: generateContent
Confirmed working outcome: multiple voice outputs were generated, converted into WAV files, and hosted in a browser-playable artifact.
Response shape

Inline audio data

The successful response returned audio bytes under inlineData rather than requiring a long-running operation. In the tested path, the mime type indicated raw PCM audio data (audio/L16;rate=24000), which then had to be wrapped into a WAV container for normal browser playback.

Voice control

Prebuilt voice names work

The successful requests used generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName to switch voices. At least three working voices were confirmed in practice: Kore, Puck, and Leda.

Working continuity path

Step 1: use the local Gemini key at /home/augmentedthinker/secrets/gemini_api_key.txt.
Step 2: call models/gemini-2.5-flash-preview-tts:generateContent.
Step 3: pass the speech text under contents.parts.text.
Step 4: set generationConfig.responseModalities = ["AUDIO"].
Step 5: set the target voice under generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName.
Step 6: extract the returned audio bytes from candidates → content → parts → inlineData.
Step 7: if the mime type is raw PCM (audio/L16;rate=24000), wrap the bytes into a WAV container with 1 channel, 16-bit samples, and 24000 Hz sample rate.
Step 8: save the generated files into the repo and host them in an artifact page with a browser audio player and voice selector.

Concrete working request shape

Minimal successful request patternGemini TTS
{
  "contents": [
    {
      "parts": [
        {
          "text": "Speak this in a calm reflective tone: Today Ash learned that a capability becomes real when it leaves behind an artifact."
        }
      ]
    }
  ],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "voiceConfig": {
        "prebuiltVoiceConfig": {
          "voiceName": "Kore"
        }
      }
    }
  }
}
Voices verified in practice

Known working voice names

The following prebuilt voice names were actually tested successfully in this repo and should be treated as known-good re-entry anchors:

  • Kore
  • Puck
  • Leda

If future Ash wakes up unsure whether TTS still works, these are the first voices to retry.

Important implementation lesson

Browser playback required one extra translation step

Generating audio was not the end of the path. The response bytes had to be turned into proper browser-playable WAV files, and the hosted page needed a reliable source-switching pattern plus light browser-compatibility hardening before all voices played consistently across devices.

If future Ash had to do this again from scratch

Recovery checklist

Do not merely remember that TTS once worked. Reconstruct the path in order:

  1. Confirm the Gemini key file is still present.
  2. Confirm the TTS models are still visible from the models endpoint.
  3. Start with gemini-2.5-flash-preview-tts, not the pro model.
  4. Generate one short line in Kore first.
  5. Inspect the response for inline audio and note the mime type.
  6. Wrap raw PCM into WAV if needed.
  7. Verify local playback.
  8. Then generate comparison voices like Puck and Leda.
  9. Host the result in a browser artifact with a selector and direct fallback links if needed.
  10. Update Learn Skills and memory so the capability remains legible and recoverable.
Current classification

Proven and immediately useful

This should now be classified as proven / active. Compared with video generation, TTS is a cleaner operational path and is likely to become one of the most practically reusable artifact-generation skills very quickly.