A reader comment made me realise I'd only solved half the problem Last month I wrote about the cron job failure mode nobody talks about: the job that doesn't die, it just drags. The short version: a nightly ETL job at a previous employer took four hours instead of forty minutes for six days before anyone noticed. It ran.
It completed. It exited zero. Every dashboard showed green. Downstream data