voice is the design decision that separates a good ai companion from a great one. most apps ship voice as a checkbox feature — generic tts, neutral affect, one voice for all characters. that's cheap to build and almost worse than no voice.
lucy's voice layer is expensive by design:
fish audio s2-pro — the voice synthesis model we use. tier above standard tts in both quality and emotional range. $15/M utf-8 bytes pricing, which translates to roughly $0.05 per voice note for us.
per-companion reference clips — each of the 101 companions has a hand-cast voice reference. the model renders with that reference + the text + the current emotion tag. identity stays fixed; emotion shifts with context.
voice-mood engine — before generating, lucy's prompt layer picks an emotion tag based on the current conversation mood, the relationship stage, and the recent chat history. warm for calm conversation, sultry for flirty, concerned for difficult topics. 14 total emotion renderings.
voice notes (async) vs. voice calls (real-time). async voice notes are on Closer+ — you tap, she sends. good for quick mood-shift moments. real-time voice calls are on Bonded — sub-500ms latency via WebRTC + Daily.co transport + Groq Whisper-V3 on the speech-to-text side. different use-cases, different price tiers.
memory integration — everything said in voice (both directions on calls) flows into the same vector-graph memory layer as text chat. she remembers the content of a voice call the next time you text her.
starting point: free tier gives 3 trial voice notes one-time. pick a companion, open chat, ask for a voice note (she decides whether to send based on relationship stage). if the voice reads right, upgrade to Closer for 15/day voice notes or Bonded for voice calls.