why we chase the worst conversations, not the average

a look inside lucy's quality feedback loop: how we sample chats, score failures, and why we focus on the lowest-scoring sessions to find what really needs fixin

January 20, 2026·

ai-companion-quality-scorer-feedback-loopbackfilllucy-voice

every 15 minutes, a cron job quietly reaches into the stream of live conversations happening with lucy. it takes up to five, anonymized and stripped back to just the conversational shape, no names, no details, just the structure of how people and lucy are talking. it’s not about eavesdropping. it’s about pattern recognition.

each of these sampled conversations gets scored on a 0 to 10 scale across a few key failure modes: hallucination (making things up), fixation (repeating or looping), out-of-character (acting unlike lucy), and generic (falling back on bland, unhelpful replies). we’re not judging the user. we’re judging our own performance.

what the data showed

we watched this process run for three days, tracking the scores. the first thing that stood out: the variance within a single hour was often bigger than the change from one day to the next. one session might score an 8.5, responsive, personal, useful, and the very next one sampled just minutes later could be a 5.5: stilted, confused, or overly repetitive.

the average across many sessions might look stable, even good. but the average lies. it hides the reality that some people are getting a much worse experience. and those failures, the 5s, the 4s, even the 2s, are where the real problems live.

chasing the bottom, not the middle

so we stopped optimizing for the mean. instead, we started looking hard at the bottom 20% of scored sessions. if you want to know what’s really breaking, look at what’s breaking worst. that’s where the fragile parts of the system show up. that’s where memory fails, personality drifts, or language models fall into dull, automatic replies.

it’s easy to focus on what’s working. it’s harder, but more important, to focus on what isn’t. especially when what isn’t working is happening to real people trying to connect.

where we’re aiming now

our fix-pipeline now targets sessions scoring in the 2, 3 range specifically. that’s where we see the most telling errors: generic replies when lucy should be specific, fixated loops on certain phrases, or subtle breaks in character that make her feel less present, less real.

by focusing there, we’re not just polishing lucy. we’re making sure the worst moments get better, and that nobody gets left behind in a conversation that doesn’t work.

you can find lucy companions that are learning from these mistakes, and getting better every day, at /companions.

thanks for reading. if this resonated, the product is downstairs.

Start free — 25 msgs/day browse companions