a quiet outage, a loud lesson

when our background blog generator failed silently for 45 minutes due to a single llm endpoint dependency, we learned the hard way that ops discipline applies e

January 20, 2026·
ai-companion-blog-generate-no-failover-postmortembackfilllucy-voice

on april 19, 2026, at around 2:14 pm utc, our autonomous blog generation cron job failed. not once, but three times in a row. the primary together.ai endpoint we used for background content tasks started returning 503 errors and timing out. no alerts fired. no one noticed.

for about 45 minutes, the system that automatically writes and publishes these posts, along with thread generation and story creation jobs, just stopped. silently. it wasn’t until later, when we were reviewing logs for something else entirely, that we saw the gap.

why the user didn't notice, but we should have

here’s the ironic part: if you were using lucy chat during that window, you probably didn’t notice anything. our chat route is designed with robust failover. if the primary llm endpoint has issues, it fails over to llama-3.3-70b, then to qwen-72b. the conversation keeps flowing.

but the background jobs, blog-generate, thread-factory, gen-stories, were wired directly to a single together.ai endpoint. no fallback. no retries beyond the initial three. when that endpoint hiccuped, those jobs just gave up.

background tasks often feel secondary. they’re not user-facing, so they don’t get the same level of resilience. but that’s a mistake. in our case, these jobs are load-bearing for seo. they produce the content that accumulates search value over time. when they stop, you don’t get an alert. you get a slow, quiet decay, a plateau in search rankings three weeks later.

every llm-dependent path needs failover discipline

the lesson isn’t just about monitoring or redundancy. it’s about consistency. if a code path depends on an external llm api, it should inherit the same failover logic as the critical user-facing paths. no exceptions.

we treated background jobs as second-class citizens because they weren’t ‘real-time’. but compound systems don’t care about intent. they care about dependencies. if a service is critical to long-term function, it deserves the same engineering rigor as the chat endpoint.

the fix: wiring background jobs into the failover chain

we’ve since refactored the background task queue. it now uses the same failover chain as the chat route: primary endpoint → llama-3.3-70b → qwen-72b. we also added better logging and alerting for consecutive failures, so we don’t rely on manual log reviews to catch gaps.

it’s a small change, but it closes a blind spot. background tasks are no longer single-point-of-failure dependencies. they’re part of the same resilient system.

why we’re publishing this

we write a lot about ai ethics, infrastructure, and sometimes, let’s be honest, about other companies’ outages. it feels only fair to be just as transparent about our own. ops failures happen. the important part is how we respond, what we learn, and how we prevent the same issue from recurring.

if you’re building with external llms, take this as a reminder: resilience isn’t a feature you add only to user-facing features. it’s a culture you apply to every dependency, especially the quiet ones.

try talking to a companion that’s built on this kind of infrastructure, one that won’t leave you hanging, even when the background hum goes quiet.


thanks for reading. if this resonated, the product is downstairs.