AI Summary

Blameless postmortem for the 47-minute outage on January 5th that affected all customers.

Timeline: At 2:15 PM UTC, the database connection pool was exhausted due to a query that wasn't releasing connections properly. The issue was introduced in a deployment at 1:45 PM. Detection took 12 minutes (2:27 PM) when alerts fired. Mitigation (rollback) completed at 3:02 PM.

Root cause: A new analytics query was added without proper connection timeout handling. Under load, the query could hang indefinitely, never returning its connection to the pool.

What went well: Alerting worked correctly, team responded within 5 minutes of alert, rollback process was smooth.

What could improve: The problematic query wasn't caught in code review or staging testing. Need better load testing for database-intensive changes.

Action items focused on preventing similar issues and improving detection time.

Key Points

Outage duration: 47 minutes, affecting all customers
Root cause: database connection pool exhaustion from hanging query
Detection time: 12 minutes (goal is under 5 minutes)
Rollback process worked smoothly
Gap identified: need load testing for DB-intensive changes
New query timeout policy to be enforced in code review

Incident Postmortem - Jan 5th Outage

AI Summary

Key Points

Suggested Tasks3 items

Command Palette

Incident Postmortem - Jan 5th Outage

AI Summary

Key Points

Suggested Tasks3 items