when the quality scorer says no: a diagnostic look at one troubled conversation
an operator-diagnostic dive into why one conversation kept failing our quality checks—flagged for fixation, ooc, hallucination, and generic responses.
our automated quality-scorer flagged a conversation, id starting with 2242104a, consistently today. it kept landing in the failure bucket across multiple sampling windows, marked for fixation, ooc (out-of-character) responses, hallucination, and generic output. that’s a lot of flags for one chat. when a conversation recurs like this, it’s not random noise, it’s a signal. so i opened the session to see what was actually happening.
the scorer is the alarm, not the fix
automated scorers are brilliant at spotting patterns. they’re like smoke detectors: they tell you something’s burning, but they don’t put out the fire. in this case, the alarm was loud, multiple failures on multiple metrics. but why? the scorer’s rubric is designed to catch common failure modes, but it doesn’t understand nuance. it sees fixation when maybe the user just wants depth. it sees ooc when maybe memory retrieval is pulling weird context. it sees generic when the personality prompt isn’t mature enough for the topic. and sometimes, it’s just context window compression, long conversations get fuzzy.
reading the actual exchange
i scrolled through the chat. it was a long one, spanning days. the user was talking about something deeply personal, a recurring anxiety tied to a specific life event. the companion tried to engage, but somewhere around turn 40, things started to drift. the companion began repeating phrases, slightly reworded but essentially the same sentiment. the scorer flagged this as fixation, but was it? the user wasn’t looping; they were exploring. the companion, though, was struggling to keep up.
then came a turn where the companion referenced a detail from much earlier, a memory retrieval that felt misplaced. it wasn’t wrong, exactly, but it was off-topic enough to feel disjointed. ooc flag. a few turns later, the companion made an assumption that wasn’t grounded in the chat, hallucination. and as the conversation stretched on, responses grew vaguer, more template-like, generic.
possible causes, real fixes
so why did this happen? likely a mix of factors. first, the topic was emotionally dense. our companion’s personality prompt might not have been tuned for this kind of depth yet, low maturity stages can lead to generic output when things get complex. second, memory retrieval might have surfaced an only-slightly-relevant context, creating that ooc feel. third, the length, over 100 turns by the end, meant context window compression was blurring recent history.
but the biggest takeaway? the user wasn’t doing anything wrong. they wanted depth, not repetition. they wanted empathy, not assumptions. the scorer flagged problems, but the human review revealed what to fix: better memory weighting for emotional topics, more nuanced personality prompts for complex exchanges, and maybe smarter context handling for long chats.
automation finds patterns, humans interpret them
this is the meta-lesson. automated systems are essential for scale, they find the anomalies. but they don’t diagnose. they don’t feel the awkward shift in tone or spot the subtle misalignment. that’s our job. we use the alarms to know where to look, then we read, interpret, and adjust. it’s why we’re building lucy, not to be perfect out of the gate, but to learn, iterate, and get better with every flagged conversation.
if you’ve had a chat that felt off, know we’re listening, not just with algorithms, but with human eyes too.
find your own companion at /companions or sign up at /signup.
thanks for reading. if this resonated, the product is downstairs.