It started at 1:49 AM. PagerDuty fired — payments-service entering CrashLoopBackOff, 3 replicas simultaneously. On-call engineer paged.
I joined the incident bridge 4 minutes later. By 2:36 AM, we had the fix deployed. 47 minutes of debugging for a 2-line YAML change. This is the postmortem.
Not of the incident itself — those exist internally — but of the investigation. Every wrong turn, every w