What it is
The Database Outage Runbook is the ordered, no-panic sequence an on-call DBA follows when the database is down or degraded — detect and declare, triage the cause, stabilize and restore service, recover and verify integrity, and close out and learn. The whole point is to pre-decide these steps now so the incident is a checklist you execute, not a decision tree you improvise at 3am. It is structured around five phases, plus a table of common failure modes with their immediate actions and a 'decide these before the incident' Q&A.
The first phase is disciplined detection. It confirms the alert is real — checking the dashboard, running a trivial SELECT 1, testing from a second host to rule out a network or DNS issue — then classifies severity (total outage, partial, or degraded), declares the incident, assigns an Incident Commander, and starts a timestamped log. Triage then checks the obvious killers first — disk full, connection pool exhausted, CPU or memory pinned, primary unreachable — and looks at recent changes, because a deploy, migration, or config push in the last hour is the prime suspect.
Stabilization is where the runbook prevents well-meaning mistakes. If disk is full, free space before anything else — never delete data files. If connections are exhausted, kill idle-in-transaction sessions and runaway queries, then raise the pooler before raising max_connections. If the primary is dead, promote a healthy replica and confirm writes succeed. If a bad deploy caused it, roll back rather than patch forward under pressure. Recovery then handles data loss via point-in-time recovery, re-establishes replication, and verifies integrity, before a blameless postmortem turns the incident into durable fixes.
What it's used for
Teams use the outage runbook to make incident response fast, ordered, and mistake-resistant when the database is down and the pressure is highest. The concrete jobs it does:
- ✓ Confirming the incident is real and scoping it — checking dashboards, running SELECT 1, testing from a second host to rule out network/DNS, then classifying severity as total, partial, or degraded.
- ✓ Declaring and coordinating — opening the incident channel, assigning an Incident Commander, starting a timestamped log, and noting blast radius (which services and customers, read or write).
- ✓ Triaging cause systematically — checking the obvious killers (disk full, pool exhausted, CPU/memory pinned, primary unreachable) and recent changes first, since a deploy or migration in the last hour is the prime suspect.
- ✓ Stabilizing safely — freeing disk before anything else (never deleting data files), killing idle-in-transaction and runaway sessions before raising max_connections, and promoting a healthy replica if the primary is dead.
- ✓ Rolling back bad changes — reverting a deploy or migration rather than patching forward under pressure, then confirming recovery by exercising the real user path, not just SELECT 1.
- ✓ Recovering data integrity — restoring from backup plus WAL to a point in time before the incident (PITR), re-establishing replication, and reconciling row counts against a known-good reference.
- ✓ Closing out and learning — writing a blameless postmortem within 48 hours with a timeline and root cause, and filing concrete, owned action items so the same failure cannot recur silently.
Who uses it
An outage is an all-hands moment, so the runbook is written for the on-call responders and the coordinators who run the incident, with every role knowing its part in advance.
Context & good to know
The defining principle of an outage runbook is that decisions made calmly in advance beat decisions made in panic at 3am. Under the pressure of a live outage, judgment degrades and the temptation to take shortcuts — delete a file to free disk, raise max_connections without addressing the leak, patch forward instead of rolling back — is exactly when those shortcuts do the most damage. By sequencing the response into detect, triage, stabilize, recover, and learn, the runbook keeps the responder on rails when instinct would steer them wrong.
The 'check recent changes first' instinct is one of the runbook's highest-value habits. The overwhelming majority of database outages are triggered by a change — a deploy, a migration, a config push, a schema change — in the hour before the incident. Training responders to look there first, rather than starting from first principles, dramatically shortens time to root cause. And the corresponding rule — roll back the change rather than patch forward under pressure — prevents a single failure from compounding into several.
Failover and point-in-time recovery are the runbook's most consequential and most dangerous moves, which is why it pushes teams to decide them before the incident. Failover under pressure with an undocumented process causes split-brain and double outages, so the runbook insists the procedure be copy-pasteable and the authority pre-assigned. PITR only works if WAL archiving is healthy and the team has rehearsed the replay — a backup never restored is unproven, and an outage is the worst time to discover it.
Spotsaas offers this runbook in its database-management resources because incident readiness is a real, often-decisive dimension of operating a database in production. Whether a team runs PostgreSQL, MySQL, MongoDB, or a managed engine like Amazon Aurora with automated failover, the human sequence — confirm, coordinate, stabilize safely, recover, and learn blamelessly — is what determines whether an outage is a contained 20-minute event or a cascading, multi-hour disaster. A written runbook is the difference.