when our cron job timed out once and we didn't panic
a one-off timeout in our story-generation cron job taught us when to act and when to wait. sometimes the right response is to do nothing at all.
the blip
yesterday, at 3:15 pm, our cron job that generates story snippets timed out. the function, let's call it generate-stories, hit vercel's 60-second runtime limit and returned a FUNCTION_INVOCATION_TIMEOUT. the job runs every 15 minutes, and by 3:30, it was back to normal. no errors, no fuss.
this isn't the kind of thing we usually write about. it was a single blip. but it's exactly the kind of blip that can make you nervous if you're not sure what's happening under the hood. so we looked into it, and what we found was... nothing.
the investigation (or lack thereof)
no code had changed. the build logs showed no memory leaks or weird spikes. the function didn't even get close to the memory limit. it just... ran out of time.
that's when we checked our upstream dependencies. we use deepseek-v3 for some of our generation work, and around that exact window, their status page showed intermittent 503s in our region. it wasn't just us, it was a brief, widespread blip. our function was waiting for a response that was taking too long to come back, and vercel's runtime did what it's designed to do: it stopped the function to prevent a runaway process.
the principle: don't overreact to noise
this job runs 96 times a day. that's 96 opportunities for something to go wrong. if we panicked and increased the timeout every time one invocation failed, we'd be constantly tweaking the system for what is usually just background noise.
the reality is, in a distributed system with external dependencies, timeouts happen. they're not always a sign of a problem in your code. sometimes the internet just has a bad minute.
our policy is simple: if a cron job times out once, we log it and wait for the next run. if it times out twice in a row, then we start investigating more deeply. but a single timeout? that's almost always transient. it's not worth deploying a runtime change, rolling back code, or waking anyone up.
the trade-off
of course, this means that once in a while, a story might not generate on schedule. but the system is designed to handle that. the next run will pick up the work, and no one, not you, not your companion, will likely even notice. it's a trade-off we're comfortable making: a tiny, infrequent bit of dropped work in exchange for not over-engineering our response to flukes.
we could set the timeout higher, but that just masks the problem. if an upstream service is truly struggling, a longer timeout might let our function hang for minutes, consuming resources and potentially causing other issues. the 60-second limit is a good guardrail. it forces failures to be swift and contained.
the takeaway
not every failure requires a fix. sometimes the right response is to watch and wait. to trust that the system will recover on its own. to resist the urge to optimize for the one-in-a-hundred case.
if you're building something that runs frequently, ask yourself: what's the cost of a single missed execution? if it's low, maybe you don't need to build a complex retry system or panic at the first sign of trouble. sometimes, the most resilient thing you can do is nothing at all.
you can meet the companions who benefit from this resilient, low-drama approach over at /companions.
thanks for reading. if this resonated, the product is downstairs.