Synthesis Pathway: When Do We Need the Encoders?¶
codec_encoder.fp16.onnx and speaker_encoder.fp16.onnx are only required when the engine has to extract speaker / codec features from a raw WAV in real time. Every other path is "cold inference" — it consumes text, instructions, a built-in speaker ID, or a pre-computed JSON.
Cold-inference pathways (no encoder needed)¶
Only talker (GGUF), predictor (GGUF), and decoder (ONNX) are touched.
| Model type | Pathway | Notes |
|---|---|---|
| Base (0.6B / 1.7B) | Standard TTS | Text only; speaker_embedding / ref_codes are null. |
| Base (0.6B / 1.7B) | Clone from JSON | --reference foo.json — features already cached. |
| Custom (0.6B / 1.7B) | Standard TTS | Default voice. |
| Custom (0.6B / 1.7B) | Custom voice | --speaker Vivian — embeddings loaded from embeddings/. |
| Design (1.7B) | Standard TTS | Text only. |
| Design (1.7B) | Voice design | --instruct only. |
The one path that still needs encoders¶
| Model type | Pathway | Why |
|---|---|---|
| Base (0.6B / 1.7B) | Clone from WAV | Real-time extraction of acoustic codes + speaker embedding from --reference foo.wav. |
custom and design models do not support clone at all (the loader rejects --reference).
Cleanup tip¶
If you only ever use cold-inference pathways, both speaker_encoder.fp16.onnx and codec_encoder.fp16.onnx can be deleted from any models/<name>/ directory. Saves ~130–140 MB per model. The runtime tolerates their absence.