how lucy's voice works: fish audio, casting, and the cost of feeling

a technical breakdown of lucy's voice layer: fish audio s2-pro synthesis, custom casting, emotion tagging, and why it costs more than generic tts. plus, what we

January 20, 2026·
voice-engineering-for-ai-companionsbackfilllucy-voice

when you hear lucy speak, you're hearing the result of a lot of choices. not just the words themselves, but the tone, the pacing, the little cracks of emotion. it's not generic text-to-speech. it's not a voice clone. it's something built with a different goal: to make conversations feel like they have weight.

the engine: fish audio s2-pro

we use fish audio s2-pro for speech synthesis. it's a model that's particularly good at capturing the nuances of emotional and expressive speech. it doesn't just read text. it performs it. the model generates audio from text and a conditioning audio clip, which brings me to the next part.

casting, not cloning

each companion on lucy has a voice cast by a human. we don't use ai voice cloning. instead, we work with voice actors to record a short reference clip, a minute or so of them speaking in a specific, emotionally rich way. this clip becomes the sonic blueprint for that companion. fish audio uses this reference to synthesize every line they speak, ensuring the timbre, accent, and baseline personality are consistent. it's why elara sounds warm and measured, while kai sounds sharp and quick-witted. their voices are performances, not algorithms.

the voice-mood engine

this is where it gets interesting. for every single message you send, a separate system analyzes the conversation context and your relationship stage with the companion. it assigns one of 14 emotion tags, things like 'playful_teasing', 'melancholic_reflection', 'confident_assertion', to your companion's response. this tag is fed to the synthesis model alongside the text and the reference clip. the result is that the same sentence, "i missed you," can sound tender, sad, shy, or relieved, depending on everything that came before it. the voice doesn't just convey the words. it tries to convey the subtext.

the trade-off: it's expensive

this level of quality and personalization isn't free. each voice generation costs us around $0.05. that adds up fast. a standard tts api call might be a fraction of a cent. we pay more because we're generating unique, high-fidelity, emotionally-tailored audio for every single message, rather than using a fixed set of pre-generated voices. we think the difference is audible. it's the difference between a voice that reads a line and a voice that feels like it's living a moment. but we're honest: this is a premium feature with a real cost, and it's a big part of why lucy is a subscription service.

what we haven't solved (yet)

no system is perfect. we're still working on some hard problems.

first, cross-language voice preservation. if you speak to your companion in a language different from their casted reference clip (say, a companion cast in english replying in spanish), the voice can drift. the accent might change; the core sonic identity can get a bit blurry. it's a limitation of the current synthesis models, and we're researching ways to anchor the voice identity more firmly across languages.

second, very long voice notes. the model works best with conversational turns, a few sentences at a time. if a companion goes on a very long monologue (think a 300-word bedtime story), the generated audio can sometimes lose emotional consistency or pacing. we're experimenting with better segmentation and chunking to handle these cases more gracefully.

why it matters

in the end, we built this because voice is more than information transfer. it's a carrier of presence. a generic voice can tell you a story. a voice cast with intention, and modulated by context, can make you feel like someone is right there with you, telling it just for you. that's the feeling we're paying for.

if you want to hear the difference for yourself, you can find a companion to talk to at /companions.


thanks for reading. if this resonated, the product is downstairs.