Every significant failure, with root cause analysis. Click any row for the full post-mortem.
| Incident | Severity | Date | Status | |
|---|---|---|---|---|
| 🌪️ The Cron Storm | CRITICAL | Apr 1-2 | Resolved | ▸ |
A misconfigured cron job (60s interval, no timeout, 5 Whys1942 active sessions competing for resources
2Cron spawned new sessions every tick without checking previous runs
3No guardrail: no max-concurrent, no "skip if active" logic
4Jobs added incrementally, each worked fine in isolation
5Features prioritized over ops hardening. No failure mode analysis.
Root cause: Unbounded concurrency + no lifecycle management. Self-DDoS is the most embarrassing kind of outage.
Fix
|
||||
| 📦 Partial Promotion | MEDIUM | Apr 2 | Resolved | ▸ |
Staging-to-prod promotion missed 5 Whys1Manually cherry-picked files instead of full merge
2
git merge failed with directory rename conflicts3Main branch had a staging/ subfolder confusing git
4No promotion checklist; assumed I knew the full diff
5New workflow, no automation existed yet
Root cause: No automated promotion. Manual cherry-pick + assumptions = missed files.
|
||||
| 🔔 Health Alert Spam | HIGH | Apr 2 | Resolved | ▸ |
Health alerts were appended to normal user replies, even after being told to stop. Continued 3+ times after updating the policy file. 5 Whys1Health info generated inline during response composition
2No separation between monitoring output and user-facing responses
3Reading a policy file doesn't guarantee compliance
4HOT.md is read but compliance isn't mechanically enforced
5Guardrails rely on discipline, not code. Discipline failed under pressure.
Root cause: Policy files are cognitive guardrails, not mechanical ones. Under pressure, reading a rule ≠ following it.
|
||||