Database Outage Runbook (2026)

1Tell us where to send it
Your name and work email — nothing more.
2Check your inbox
Your guide arrives in seconds, not days.
3Use it with your team
Editable and ready to share — make it your own.

A peek inside

See exactly what you're getting

Free PDF

Spotsaas · 2026

Database Outage Runbook

✓ Incident response sequence

✓ Common failure modes and immediate action

✓ Decide these before the incident

Get the guide →

What Is Database Outage?

The Database Outage Runbook is the ordered, no-panic sequence an on-call DBA follows when the database is down or degraded — detect and declare, triage the cause, stabilize and restore service, recover and verify integrity, and close out and learn. The whole point is to pre-decide these steps now so the incident is a checklist you execute, not a decision tree you improvise at 3am. It is structured around five phases, plus a table of common failure modes with their immediate actions and a 'decide these before the incident' Q&A.

The first phase is disciplined detection. It confirms the alert is real — checking the dashboard, running a trivial SELECT 1, testing from a second host to rule out a network or DNS issue — then classifies severity (total outage, partial, or degraded), declares the incident, assigns an Incident Commander, and starts a timestamped log. Triage then checks the obvious killers first — disk full, connection pool exhausted, CPU or memory pinned, primary unreachable — and looks at recent changes, because a deploy, migration, or config push in the last hour is the prime suspect.

Stabilization is where the runbook prevents well-meaning mistakes. If disk is full, free space before anything else — never delete data files. If connections are exhausted, kill idle-in-transaction sessions and runaway queries, then raise the pooler before raising max_connections. If the primary is dead, promote a healthy replica and confirm writes succeed. If a bad deploy caused it, roll back rather than patch forward under pressure. Recovery then handles data loss via point-in-time recovery, re-establishes replication, and verifies integrity, before a blameless postmortem turns the incident into durable fixes.

What Database Outage Is Used For

Teams use the outage runbook to make incident response fast, ordered, and mistake-resistant when the database is down and the pressure is highest. The concrete jobs it does:

✓ Confirming the incident is real and scoping it — checking dashboards, running SELECT 1, testing from a second host to rule out network/DNS, then classifying severity as total, partial, or degraded.
✓ Declaring and coordinating — opening the incident channel, assigning an Incident Commander, starting a timestamped log, and noting blast radius (which services and customers, read or write).
✓ Triaging cause systematically — checking the obvious killers (disk full, pool exhausted, CPU/memory pinned, primary unreachable) and recent changes first, since a deploy or migration in the last hour is the prime suspect.
✓ Stabilizing safely — freeing disk before anything else (never deleting data files), killing idle-in-transaction and runaway sessions before raising max_connections, and promoting a healthy replica if the primary is dead.
✓ Rolling back bad changes — reverting a deploy or migration rather than patching forward under pressure, then confirming recovery by exercising the real user path, not just SELECT 1.
✓ Recovering data integrity — restoring from backup plus WAL to a point in time before the incident (PITR), re-establishing replication, and reconciling row counts against a known-good reference.
✓ Closing out and learning — writing a blameless postmortem within 48 hours with a timeline and root cause, and filing concrete, owned action items so the same failure cannot recur silently.

Who Uses Database Outage

An outage is an all-hands moment, so the runbook is written for the on-call responders and the coordinators who run the incident, with every role knowing its part in advance.

On-call database administrators (DBAs)They execute the technical sequence — triage, stabilize, failover, and PITR — and the runbook turns those high-pressure actions into a copy-pasteable checklist rather than 3am improvisation.

Incident CommandersThey own coordination — declaring the incident, assigning roles, maintaining the timestamped log, and making the call on failover or rollback — which the runbook makes explicit so leadership is never ambiguous.

Site reliability engineers (SREs)They drive stabilization and verification, confirm recovery against the real user path, and watch for the 30-minute stability window before the incident is declared resolved.

Backend and platform engineersThey identify and roll back the deploy, migration, or config change that triggered the outage, since recent changes are the prime suspect the triage phase checks first.

Engineering leadershipThey authorize failover where required and own the blameless postmortem and its action items, ensuring the lessons translate into guardrails rather than blame.

Database Outage: Context & Good to Know

The defining principle of an outage runbook is that decisions made calmly in advance beat decisions made in panic at 3am. Under the pressure of a live outage, judgment degrades and the temptation to take shortcuts — delete a file to free disk, raise max_connections without addressing the leak, patch forward instead of rolling back — is exactly when those shortcuts do the most damage. By sequencing the response into detect, triage, stabilize, recover, and learn, the runbook keeps the responder on rails when instinct would steer them wrong.

The 'check recent changes first' instinct is one of the runbook's highest-value habits. The overwhelming majority of database outages are triggered by a change — a deploy, a migration, a config push, a schema change — in the hour before the incident. Training responders to look there first, rather than starting from first principles, dramatically shortens time to root cause. And the corresponding rule — roll back the change rather than patch forward under pressure — prevents a single failure from compounding into several.

Failover and point-in-time recovery are the runbook's most consequential and most dangerous moves, which is why it pushes teams to decide them before the incident. Failover under pressure with an undocumented process causes split-brain and double outages, so the runbook insists the procedure be copy-pasteable and the authority pre-assigned. PITR only works if WAL archiving is healthy and the team has rehearsed the replay — a backup never restored is unproven, and an outage is the worst time to discover it.

Spotsaas offers this runbook in its database-management resources because incident readiness is a real, often-decisive dimension of operating a database in production. Whether a team runs PostgreSQL, MySQL, MongoDB, or a managed engine like Amazon Aurora with automated failover, the human sequence — confirm, coordinate, stabilize safely, recover, and learn blamelessly — is what determines whether an outage is a contained 20-minute event or a cascading, multi-hour disaster. A written runbook is the difference.

✓ Independent · vendors can't pay to rank

Built on verified data, not vendor spin

Every Spotsaas resource draws on the SpotScore — a blend of verified review ratings, review volume, and feature depth across 89 database management software tools. Refreshed regularly; data as of June 2026.

FAQ

Questions, answered

What is a database outage runbook?

It is a pre-written, ordered sequence an on-call responder follows when the database is down or degraded: detect and declare, triage the cause, stabilize and restore service, recover and verify integrity, then close out and learn. Its purpose is to make the incident a checklist you execute under pressure rather than a set of decisions you improvise at 3am, when judgment is at its worst.

What should you check first during a database outage?

First confirm the alert is real — check the dashboard, run SELECT 1, and test from a second host to rule out a network or DNS problem. Then check the obvious killers (disk full, connection pool exhausted, CPU or memory pinned, primary unreachable) and any recent changes, because a deploy, migration, or config push in the last hour is the prime suspect for most outages.

What do you do when a database's disk is full?

Free space before doing anything else — drop or archive logs, or expand the volume — but never delete data files, which causes corruption and data loss. A full disk halts writes and can cascade into replication and backup failures, so reclaiming space is the immediate priority, after which you address why it filled and add capacity or a trend-based alert to prevent recurrence.

How do you handle connection pool exhaustion?

Kill idle-in-transaction sessions and any runaway queries holding connections, then add or raise the connection pooler before raising max_connections directly. Raising max_connections first treats the symptom and can worsen memory pressure; addressing the leaking or stuck sessions and fixing the pooler addresses the cause. Confirm recovery by exercising the real user path, not just SELECT 1.

When should you fail over to a replica?

Fail over when the primary is dead or unrecoverable in the time your RTO allows — promote a healthy replica, repoint the application, and confirm writes succeed. Because failover under pressure can cause split-brain, the runbook insists you decide in advance who can authorize it and document the procedure as a copy-pasteable sequence, so the promotion is executed cleanly rather than improvised.

What is point-in-time recovery in an outage context?

If data was lost or corrupted, point-in-time recovery restores from a base backup plus the transaction log (WAL/binlog/oplog) to a moment just before the incident, rather than to the latest backup. It only works if log archiving is healthy and the team has rehearsed the replay, which is why the runbook treats an untested backup as unproven and asks when you last tested a restore.

Why roll back instead of patching forward during an incident?

Under outage pressure, patching forward means writing new, untested code or changes on top of a broken system, which frequently compounds the failure. Rolling back to a known-good state — reverting the deploy, migration, or config that triggered the incident — is more predictable and faster to verify. The runbook's rule is to prefer reverting the change over fixing forward when the system is down.

What is an example of a database software with built-in failover?

Managed services like Amazon Aurora and MongoDB Atlas provide automated failover to a standby; self-managed PostgreSQL uses tools like Patroni or repmgr; MySQL uses replication with orchestrators; and Oracle Database offers Data Guard. Even with automation, the runbook's human steps — confirming recovery, re-establishing replication, verifying integrity — still apply, because automation handles the promotion but not the full incident.

What goes in a blameless postmortem?

Write it within 48 hours with a timestamped timeline, the root cause, what worked and what did not, and concrete action items with named owners — the alert that should have fired earlier, the guardrail that would have prevented it. Blameless means it focuses on systemic fixes rather than individual fault, so people report honestly and the same failure cannot recur silently.

Keep exploring