when the cron hiccups: a transient timeout postmortem
on handling single-instance cron timeouts without overreacting. sometimes the cloud just has a bad minute. we log it, watch it, and only act if it repeats.
this week, one of our cron endpoints timed out. just once. the generate-stories function hit a FUNCTION_INVOCATION_TIMEOUT on vercel. fifteen minutes later, the exact same code ran and succeeded perfectly. no errors, no drama.
the symptoms
you know the feeling. you get an alert. your heart does a little stutter. you check the logs. there it is: a timeout. the function ran for over 60 seconds and vercel killed it. but then… nothing. the next run was fine. the code hadn't changed. the build output showed no memory leaks, no weird spikes. it was just… a single blip.
the investigation
the immediate instinct is to dig. to find the root cause and fix it. but sometimes the root cause isn't in your code. it's out there in the cloud. our hypothesis: an upstream llm service (deepseek-v3, in this case) was having a moment. we saw 503s across its region around that exact time. our function was waiting for a response that was taking just a bit too long, pushing it past that 60-second threshold. it wasn't us. it was the internet being the internet.
the response
so what do you do when your cron hiccups once?
you log it. you note the time, the error, and the fact that it was a single occurrence. you do not immediately panic and deploy a runtime configuration change. you do not start rewriting core logic. you wait for the next run.
if it happens again on the next run, you pay attention. if it happens twice in a row, you investigate for real. but a single timeout? that's noise. that's the cost of doing business on a distributed internet. the principle is simple: resist optimizing for the single-instance failure mode of a process that runs 96 times a day. you'll drive yourself insane otherwise.
the principle
our cron jobs are built for resilience. they run often. a single failure is a statistic, not a crisis. we design for the aggregate, not the outlier. we watch for patterns, not points. it’s about trusting the system enough to know when not to intervene.
if you get one timeout, let it go. if you get two, then it's time to look.
maybe see what's happening on /companions.
thanks for reading. if this resonated, the product is downstairs.