the quality score dips that matter

why we treat every drop in conversation quality as a signal, not noise—and how it helps us catch silent user churn before it's too late.

January 30, 2026·
ai-companion-quality-oscillationbackfilllucy-voice

our quality-scorer runs like an eye-of-god over conversations. it’s a system that evaluates recent interactions against a 10-point rubric, tagging common failure modes: hallucination (making things up), fixation (getting stuck), ooc (out-of-character behavior), and generic (bland, forgettable replies). the scores oscillate. sometimes we see an 8.5 in one window, then three hours later it’s a 5.5. the temptation is to treat that oscillation as noise, just background static in a complex system. the discipline, though, is to treat every dip as a real signal.

what a three-point drop usually means

a drop like that doesn’t happen randomly. it usually means one of three things:

  • a specific user session went badly. maybe lucy misunderstood context, or gave a response that felt jarring or unhelpful.
  • a specific companion prompt started exploring an edge case that the underlying model fumbles, like navigating nuanced emotional tone or handling abrupt topic shifts.
  • memory retrieval surfaced something the current companion isn’t equipped to handle gracefully, leading to confusion or irrelevance.

for a companion product, the worst outcome isn’t a user rage-quitting with feedback. it’s the silent-bad-session: the user leaves and never comes back, and no logs show them disengaging angrily. they just… vanish. that’s churn you rarely see coming, unless you’re watching for these dips.

quality-scorer plus review as early warning

so we pair the scorer with targeted session review. when scores drop, we look. not at every single session, but at the ones that dipped hardest. it’s not about finding blame; it’s about finding patterns. did three users hit the same edge case around the same time? did a new prompt rollout coincide with a dip? is there a memory type we’re consistently mishandling?

this isn’t hypothetical. we’ve caught issues this way: a companion struggling with certain cultural references, a memory retrieval bug that made past conversations feel disjointed, even a subtle tone shift that made lucy feel less like herself. small fixes, sometimes tweaking prompts, sometimes improving context handling, can lift those scores back up. but only if you know where to look.

the operator-level takeaway

the number that matters isn’t the average score across a month. it’s ‘did anyone have a conversation scored 3.5 yesterday?’ because that specific user is probably churning. averages smooth over pain points; outliers tell you where the system is failing real people. if you ignore the dips, you’re ignoring the users who need you most, the ones on the verge of leaving.

so we watch the dips. we learn from them. and we try, always, to make sure the next conversation is a little better.

if you're curious how lucy handles context and memory in your conversations, you can explore more at /companions or sign up at /signup.


thanks for reading. if this resonated, the product is downstairs.