Incident Postmortem - Jan 5th Outage
AI Summary
Blameless postmortem for the 47-minute outage on January 5th that affected all customers.
Timeline: At 2:15 PM UTC, the database connection pool was exhausted due to a query that wasn't releasing connections properly. The issue was introduced in a deployment at 1:45 PM. Detection took 12 minutes (2:27 PM) when alerts fired. Mitigation (rollback) completed at 3:02 PM.
Root cause: A new analytics query was added without proper connection timeout handling. Under load, the query could hang indefinitely, never returning its connection to the pool.
What went well: Alerting worked correctly, team responded within 5 minutes of alert, rollback process was smooth.
What could improve: The problematic query wasn't caught in code review or staging testing. Need better load testing for database-intensive changes.
Action items focused on preventing similar issues and improving detection time.
Key Points
- Outage duration: 47 minutes, affecting all customers
- Root cause: database connection pool exhaustion from hanging query
- Detection time: 12 minutes (goal is under 5 minutes)
- Rollback process worked smoothly
- Gap identified: need load testing for DB-intensive changes
- New query timeout policy to be enforced in code review
Suggested Tasks3 items
Implement mandatory query timeouts in ORM config
Assignee: Mike Thompson
Add connection pool monitoring dashboard
Assignee: David Kim
Create load testing checklist for DB changes
Assignee: Sarah Chen