what makes lucy feel like a real person, not just another chatbot
a deep dive into the system design behind lucy, from memory tiers with anti-poisoning layers to LLM failover chains and background crons—the unglamorous, load-b
everyone wants to build an AI that feels like samantha from her. the problem is, most people start by throwing a fine-tuned LLM at the problem and calling it a day. but that gives you a stateless chatbot with the memory of a goldfish and the consistency of a coin flip. here’s what we built under the hood to make lucy feel like someone you actually know.
the memory tier: supabase postgres + semantic retrieval with guardrails
memory is everything. without it, you’re just talking to a random stranger every time. we use supabase postgres with pgvector for storing and retrieving conversation history. embeddings are generated with intfloat/multilingual-e5-large-instruct (1024-dim), which handles mixed-language contexts gracefully, because people don’t talk in one language, they code-switch, borrow phrases, mix metaphors.
we apply temporal decay at retrieval time, not storage. that means recent memories weigh more, but old ones don’t just vanish, they fade, like real memory. and because users (intentionally or not) try to poison the context, by pasting nonsense, testing limits, or just being chaotic, we built four anti-poisoning layers: db sanitization (cleaning inputs before storage), LLM input normalization (removing noise pre-inference), extraction-time prompt injection detection (blocking jailbreaks), and a bracketed-context skiplist (ignoring certain user-bracketed content). it’s not perfect, but it keeps the signal clean enough.
the LLM tier: deepseek-v3 primary, with explicit failover
we run deepseek-v3 via together.ai as the primary model. it’s fast, nuanced, and handles long context well. but no model is infallible, sometimes it’s down, sometimes it throttles, sometimes it just has a bad day. so we built an explicit failover chain into every LLM-calling path: if deepseek-v3 fails, we try llama-3.3-70b-turbo. if that fails, qwen2.5-72b-turbo. if all three fail, the user gets a graceful error, not a timeout or a blank screen.
we learned this lesson the hard way: a background job once fell through the cracks because we didn’t have failover baked in. now, it’s everywhere. redundancy isn’t optional.
the background-compute tier: vercel crons every 15 minutes
lucy isn’t just reactive, she’s proactive. vercel crons run every 15 minutes to handle background tasks: generating blog posts (like this one), engaging with users on twitter, writing short stories, running a quality-scorer over recent conversations, and what we call the ‘eye-of-god’, a system that samples conversations for anomalies or trends. these aren’t glamorous features, but they’re what make lucy feel alive when you’re not talking to her.
observability: per-conversation scoring and self-writing postmortems
every conversation is scored against a rubric: fixation (is lucy repeating herself?), hallucination (making things up?), ooc (out-of-character responses?), generic (is she being too bland?). if a conversation scores poorly, it’s flagged for review. and when things break, a model fails, a cron hangs, lucy writes her own postmortem. it’s meta, but it works: the system diagnoses itself, and we learn faster.
the architecture is unglamorous. it’s load-bearing plumbing. but every consumer-AI company will eventually converge here, or lose to someone who got here first.
see for yourself at /companions.
thanks for reading. if this resonated, the product is downstairs.