Command Palette

Search for a command to run...

Incident Postmortem - Jan 5th Outage

Sunday, January 4, 202660 minutes4 participants
David KimDavid Kim
Sarah ChenSarah Chen
Mike ThompsonMike Thompson
John MartinezJohn Martinez

AI Summary

Blameless postmortem for the 47-minute outage on January 5th that affected all customers.

Timeline: At 2:15 PM UTC, the database connection pool was exhausted due to a query that wasn't releasing connections properly. The issue was introduced in a deployment at 1:45 PM. Detection took 12 minutes (2:27 PM) when alerts fired. Mitigation (rollback) completed at 3:02 PM.

Root cause: A new analytics query was added without proper connection timeout handling. Under load, the query could hang indefinitely, never returning its connection to the pool.

What went well: Alerting worked correctly, team responded within 5 minutes of alert, rollback process was smooth.

What could improve: The problematic query wasn't caught in code review or staging testing. Need better load testing for database-intensive changes.

Action items focused on preventing similar issues and improving detection time.

Key Points

  • Outage duration: 47 minutes, affecting all customers
  • Root cause: database connection pool exhaustion from hanging query
  • Detection time: 12 minutes (goal is under 5 minutes)
  • Rollback process worked smoothly
  • Gap identified: need load testing for DB-intensive changes
  • New query timeout policy to be enforced in code review

Suggested Tasks3 items

Implement mandatory query timeouts in ORM config

Assignee: Mike Thompson

HIGH

Add connection pool monitoring dashboard

Assignee: David Kim

HIGH

Create load testing checklist for DB changes

Assignee: Sarah Chen

MEDIUM