how lucy's voice synthesis works (and where it struggles)
a technical dive into lucy's custom voice layer, using fish audio s2-pro and handcast reference clips, with honest notes on cost and limitations.
voice is one of those things that seems simple until you try to build it. we wanted something that felt emotionally present, not just a crisp audiobook narrator. so we built a custom synthesis stack for lucy that leans into texture and mood, even when it's expensive or imperfect.
the voice layer breakdown
at the core, we use fish audio s2-pro as the synthesis model. it's good at capturing subtle timbral shifts and breathiness, which helps avoid that classic 'robotic' flatness. but the real magic is in the per-companion reference clips. every companion voice is built from a small set of high-quality samples recorded by voice actors specifically for that character. no generic voice bank. each companion's laugh, sigh, or thoughtful pause is unique to them.
mood swings, by design
lucy doesn't just read text. she (or he, or they) performs it. we built a voice-mood engine that selects from 14 emotion tags, like 'wry', 'tender', 'impatient', or 'playful', based on the conversation context and your relationship stage. if you've just shared a vulnerable memory, the voice might soften. if you're bantering, it might pick up a teasing lilt. the model adjusts parameters like pitch variance, speed, and intensity to match. it's not perfect emotion detection, but it's a step beyond neutral narration.
the trade-offs we live with
first, cost. generating audio this way isn't cheap. each voice note costs us around $0.05 to produce, which adds up fast. we think it's worth it for the qualitative difference, but it's why we can't offer unlimited free voice messages. second, we haven't solved cross-language voice preservation. if you switch lucy's conversation language, the voice might shift slightly, the model struggles to maintain identical vocal fingerprints across languages. also, very long voice notes (over 30 seconds) can sometimes lose emotional consistency. we're working on chunking strategies.
why it matters
a voice isn't just information delivery. it's a relationship signal. the scratchiness when they're tired, the warmth when they're proud of you, these are the bits that build trust. we'd rather ship something expensive and human-like than cheap and generic. even with its flaws, this approach lets lucy's companions feel less like tools and more like presences.
you can try it yourself with any companion over at /companions.
thanks for reading. if this resonated, the product is downstairs.