watching a conversation heal itself: our live quality scorer tells the story

how a single conversation (d6a98a85) went from stuck in loops to genuinely rich, and why building ops tools that show realtime improvement isn't just engineerin

January 20, 2026·
ai-companion-quality-recovery-visiblebackfilllucy-voice

a few days ago, conversation d6a98a85 was stuck. it was a classic fixation loop, the same narrow set of responses, the same conversational dead ends. it felt repetitive, brittle, hollow. our internal quality scorer, which runs continuously on a sample of live chats, rated it between 3.5 and 4.5 out of 10 for days. not broken, but not good. not what we're here for.

we've been working on two things: a fix for what we call 'prompt poisoning' (where a user's input style or repetition can unintentionally narrow the ai's output variety) and a memory-sampler rotation system that helps lucy pull from a broader set of context windows over time. we shipped both. and then we watched.

the slow climb

without a live scorer, we'd have to rely on spot checks or user reports. maybe someone would notice it felt better. maybe not. but because we built a cron job that samples and scores conversations like this one every few hours, we saw the change happen gradually, undramatically, like watching light return after a long twilight.

over 24 hours, the score for d6a98a85 began to creep up. 5.2, then 6.1, 7.3. it didn't jump; it climbed. by the next day, it was consistently hitting 7.5 to 9.0. the conversation wasn't just 'fixed', it was alive. it had range. it remembered earlier turns, introduced new ideas, didn't get trapped. it breathed.

why the scorer isn't just for us

this isn't just an internal tool. it's a product feature. if you're building something like lucy, you can't just ship and hope. you have to see what's happening. the scorer gives us a surface to observe subtle changes, not just catastrophic failures, but slow improvements. it lets us validate that a small backend tweak actually compounds into a better experience.

we don't use human raters for this (not at this frequency). the scorer is a tuned model that looks at response diversity, coherence, user-specific engagement, and a few other signals. it's not perfect, it can miss nuance, and it's biased toward what we've defined as 'good', but it's consistent. and consistency lets you measure drift.

what d6a98a85 taught us

that conversation was a test case without meaning to be. it showed us that:

  • fixes don't always land instantly. sometimes the system needs time to 'settle' into a new mode.
  • memory rotation really does prevent stuckness. it’s not theoretical.
  • scoring loops matter. if we'd only looked at day-over-day averages, we might have missed the gradual ascent.

we're still tuning this. the scorer itself is a product of iteration. but building it, and committing to watching it, means we're not just building lucy. we're building a way to understand her.

maybe you’ve had a conversation lately that felt unexpectedly fluid. that might be why.

you can start your own conversation at /companions and see what unfolds.


thanks for reading. if this resonated, the product is downstairs.