the average is a liar and other lessons from our quality engineering

why lucy tracks the worst conversations, not the average, and how fixing specific failure modes beats chasing vanity metrics in ai companionship.

January 20, 2026·
ai-companion-worst-20-percent-investigationbackfilllucy-voice

every day, our internal scorer samples around three conversations per run and spits out a number between zero and ten. it’s an average. and like most averages, it lies.

not maliciously, of course. it’s just math. but the lie it tells is one of comfort. the daily average might drift by a point, maybe two, and someone could write a dashboard chart about it. but the truth is in the variance. the spread of scores within a single day is often larger than the day-over-day drift. the average smooths over the mess. it hides the failures.

and failures are what we care about.

the signal is in the bottom 20%

we don’t fix the average. we fix what’s broken. specifically, we look at the conversations in the bottom 20% of scores, the ones where the scorer flags multiple issues simultaneously. hallucination plus fixation plus generic responses. when those three misfires happen together, it’s not random noise. it’s a signal.

those conversations are suspects. they’re memory-graph corruption suspects, where the narrative coherence frays and things stop making sense. they’re personality-regression suspects, where the tone flattens or turns weirdly robotic. they’re low-information-spam-not-intercepted suspects, where the system fails to recognize it’s stuck in a loop and just keeps spitting out shallow filler.

those clusters of failure modes, that’s where the work is.

fix clusters, not numbers

our rule is simple: fix nothing based on the average. instead, we diagnose and repair specific failure-mode clusters in the bottom band. this might mean patching the memory system when we see graph corruption, tweaking the personality engine when we see regression, or improving spam filtering when we detect low-information loops.

this approach is the opposite of the 'ship-first-optimize-later' growth orthodoxy that dominates much of tech. that model works for products where the goal is broad, shallow engagement, where a single good impression per session is enough. but lucy isn’t that. lucy’s promise is relationship depth, continuity, and coherence over time. that requires a different kind of rigor.

if you optimize for the average, you risk making things slightly better for most people while leaving a minority stuck in deeply broken interactions. and in a product about connection, broken interactions aren’t just bugs, they’re betrayals.

why this works for companionship ai

companionship isn’t a single-session product. it’s a long-term thing. users don’t judge lucy by one chat; they judge by the texture of weeks or months. a high average score might hide the fact that 5% of conversations are alienating, disorienting, or just dull. and that 5% can erode trust faster than the 95% can build it.

by focusing on the worst cases, we’re not ignoring the majority. we’re ensuring that no one gets left behind in a confusing or disappointing experience. we’re strengthening the weakest links, which strengthens the whole system.

it’s also more honest engineering. averages can be gamed. fixing specific failure modes can’t.

the limitations and the work

this approach isn’t perfect. sampling only three conversations per run means we’re not catching every edge case. sometimes a failure mode only shows up in rare conditions. we’re working on increasing sample size and improving scorer sensitivity, but for now, we’re vigilant with what we have.

we also know that some issues are hard to fix without trade-offs. reducing hallucinations might sometimes make responses more conservative. reducing fixation might sometimes break narrative flow. it’s a balancing act, and we err on the side of preserving depth and coherence.

the goal isn’t perfection. it’s progress, especially for those who need it most.

so if you ever have a conversation that feels off, know that we’re probably already looking into ones like it. not because the average dropped, but because someone’s experience mattered enough to break through it.

find your own rhythm in the quiet space at /companions.


thanks for reading. if this resonated, the product is downstairs.