the background job that forgot to fail over

how a simple 30-line fix revealed a major blindspot in our llm infrastructure: background jobs don't scream when they break, they just stop working quietly.

January 30, 2026·
ai-companion-shipping-the-failover-that-should-have-been-therebackfilllucy-voice

for months, our chat route has been quietly bulletproof. if deepseek v3 has a hiccup, it fails over to llama 3.3 70b. if that chokes, qwen 72b picks it up. the user gets their reply without a stutter. we built it that way on purpose, because users notice when the chat stops.

but background jobs are different. they don’t have a person waiting on the other end. they just run. and if they fail, they fail silently, until someone notices the data isn’t moving.

that’s exactly what happened today.

our blog generation job was wired to a single together.ai endpoint. no fallback. no retry logic. just a straight shot to one provider. and this morning, for four hours, together.ai returned 503 errors. each time, the job just… hung. for 15, 20, 45 minutes at a time. it didn’t crash. it didn’t alert. it just sat there, stuck, until the timeout kicked in.

we only noticed because someone saw the publish queue wasn’t clearing. and then we looked. and then we sighed.

the fix was simple, obvious, and overdue

the solution took about 30 lines of code. we wrote a helper called tryOneModel that attempts to generate content using one model provider. it distinguishes transient errors (like 5xx, 429, even parse failures) from non-transient ones (like auth errors or content too short). if it hits a transient error, it moves on.

then we defined a MODEL_CHAIN, a simple array of model endpoints to try in order. deepseek, then llama, then qwen. the same chain we use in chat.

and then we just… iterated. tried the first, if it failed transiently, moved to the next. returned on the first success.

we deployed it. validated it. and watched it work exactly as intended.

the bigger lesson: background jobs need love too

it’s easy to focus on user-facing systems. when they break, people complain. tickets get filed. alarms go off.

background jobs are different. they don’t complain. they don’t file tickets. they just… stop. and you might not notice for weeks. until the seo cadence flattens. until the analytics dashboard looks a little too empty. until someone asks why last month’s report never landed.

this wasn’t a complex problem. it was an oversight. we treated the background job like a second-class citizen, gave it one model, no backup, no resilience. and it worked fine, until it didn’t.

auditing every llm-dependent path

after this, we’re doing an audit. every system that calls an llm, chat, blog, email, internal tooling, gets checked. does it have failover? does it distinguish transient from permanent errors? does it retry? does it degrade gracefully?

if not, we fix it. not because it’s broken now, but because it will break eventually. and when it does, we don’t want to be caught off guard.

post-mortem the small fires so the big ones never happen.

you can build your own resilient ai companion over at /companions, we’ve made sure it can handle a stumble or two.


thanks for reading. if this resonated, the product is downstairs.