What it is
The Major-Incident Runbook is a pre-defined playbook for the worst moments in support — when production is down, a security event is unfolding, or a customer-facing outage is spreading and the worst possible time to invent a process is right now, in the middle of it. The runbook defines major-incident roles, severity criteria for deciding whether to declare, the declare-to-resolve flow, the communication cadence to customers and stakeholders, and the post-incident review that turns an outage into a permanent fix. Aligned to ITIL major-incident management and modern incident-command practice, it's designed to be filled in with your contacts and channels and then rehearsed before you need it.
The template is a PDF structured around the phases of a major incident. A severity classification table defines SEV1 through lower tiers with criteria, examples, response actions, and update cadence, so the declare-or-not decision is a rule, not a debate. An ICS-style roles table assigns who owns what — the Incident Commander coordinates and decides but does not troubleshoot hands-on, delegating that — with backups. A five-step declare-to-resolve flow runs from detect and declare through assemble, mitigate, recover, and review. A communication cadence table specifies who tells which audience, through which channel, by when. And a post-incident review checklist captures the timeline, impact, root cause, and follow-ups.
It exists because major incidents are high-stakes, high-pressure, and rare enough that no one has the process memorized — so without a runbook, teams improvise roles, communicate inconsistently, and lose precious time deciding who's in charge while customers and revenue bleed. By pre-deciding the roles, the severity thresholds, the comms cadence, and the recovery flow, the runbook lets a team respond fast and coordinated under pressure, and its blameless post-incident review ensures each outage makes the system more resilient rather than just surviving it.
What it's used for
Teams use a major-incident runbook to respond to critical, customer-facing incidents in a fast, coordinated, role-clear way — and to learn from each one. It's applied to:
- ✓ Classifying severity to decide whether to declare a major incident — SEV1 for a full outage or data/security breach with broad impact and no workaround, lower severities for narrower or worked-around issues — so the decision is rule-based, not improvised.
- ✓ Assigning ICS-style roles before the incident: an Incident Commander who decides and coordinates (but delegates hands-on troubleshooting), a Comms Lead, technical responders, and named backups for each.
- ✓ Running the declare-to-resolve flow — detect and declare, assemble and stabilize, mitigate and communicate and escalate, recover and confirm, then review and learn — so the response has a known sequence.
- ✓ Driving a disciplined communication cadence — first external update within 15 minutes of a SEV1 via status page and in-app banner, then every 30 minutes until mitigated — so customers aren't left in silence.
- ✓ Coordinating internal and external audiences in parallel, with a clear owner for each channel, so engineering, leadership, and customers all get the right information at the right interval.
- ✓ Paging the right on-call responders and opening a war room or incident channel immediately on a SEV1, so the response assembles fast instead of waiting on someone to figure out who to call.
- ✓ Running a blameless post-incident review — reconstructing the timeline with timestamps, quantifying impact, identifying root cause and contributing factors, and assigning follow-up actions — so the outage produces a permanent fix.
Who uses it
A major incident pulls in responders, communicators, and decision-makers across support and engineering, and the runbook gives each a pre-assigned role so no one is improvising under pressure.
Context & good to know
A runbook (or playbook) is the defining artifact of mature incident response, and the major-incident version is the highest-stakes of all. It's grounded in ITIL major-incident management and the incident-command system (ICS) used in emergency response, which is where the Incident Commander role comes from. The single most important idea the runbook encodes is role separation: the IC coordinates and decides but does not put their hands on the keyboard. When the most senior person dives into debugging, no one is left holding the overall picture — communication stalls, decisions queue up, and the incident drags. Separating coordination from troubleshooting is what keeps a high-pressure response fast and clear.
Severity classification is what makes the declare-or-not decision a rule rather than a debate held while the clock runs. A SEV1 — full outage or data/security breach, no workaround, broad customer impact — triggers an immediate declaration, paging on-call and the IC and opening a war room, with updates every 15-30 minutes. Lower severities get proportionate responses. Pre-defining these thresholds means a responder at 2 a.m. doesn't have to argue about whether something 'counts'; they check the criteria and act. This connects directly to the priority matrix and SLA policy — a SEV1 incident is the operational response to what those frameworks would classify as the most critical, broadest-impact tickets.
Communication cadence is where incidents are won or lost in the eyes of customers. The runbook specifies that external customers get a first update within 15 minutes of a SEV1 via status page and in-app banner, then every 30 minutes until mitigated, with a named Comms Lead owning it. Silence during an outage is what turns a technical problem into a trust problem and floods the support queue with anxious contacts. A disciplined cadence — even when the update is 'we're still investigating' — reassures customers that the team is on it and dramatically reduces inbound volume, which lets agents focus on genuinely affected customers rather than the panicked majority.
The post-incident review is what separates teams that survive incidents from teams that get better because of them. The runbook's review checklist is explicitly blameless — focused on systems, not people — and reconstructs the timeline with timestamps, quantifies customer and business impact, identifies root cause and contributing factors, and captures what worked and what slowed the response, ending in assigned follow-up actions. Blamelessness matters because a culture of blame makes responders hide information, which is exactly what you can't afford during the next incident. Rehearsing the runbook before it's needed — a tabletop exercise or drill — is the final piece: a runbook no one has practiced is just a document, and the goal is muscle memory that holds up when production is actually down.