Major-Incident Runbook (2026)

1Tell us where to send it
Your name and work email — nothing more.
2Check your inbox
Your guide arrives in seconds, not days.
3Use it with your team
Editable and ready to share — make it your own.

A peek inside

See exactly what you're getting

Free PDF

Spotsaas · 2026

Major-Incident Runbook

✓ Severity classification (declare or not)

✓ Major-incident roles (ICS-style)

✓ Declare-to-resolve flow

✓ Communication cadence & channels

Get the guide →

What Is Major-Incident Runbook?

The Major-Incident Runbook is a pre-defined playbook for the worst moments in support — when production is down, a security event is unfolding, or a customer-facing outage is spreading and the worst possible time to invent a process is right now, in the middle of it. The runbook defines major-incident roles, severity criteria for deciding whether to declare, the declare-to-resolve flow, the communication cadence to customers and stakeholders, and the post-incident review that turns an outage into a permanent fix. Aligned to ITIL major-incident management and modern incident-command practice, it's designed to be filled in with your contacts and channels and then rehearsed before you need it.

The template is a PDF structured around the phases of a major incident. A severity classification table defines SEV1 through lower tiers with criteria, examples, response actions, and update cadence, so the declare-or-not decision is a rule, not a debate. An ICS-style roles table assigns who owns what — the Incident Commander coordinates and decides but does not troubleshoot hands-on, delegating that — with backups. A five-step declare-to-resolve flow runs from detect and declare through assemble, mitigate, recover, and review. A communication cadence table specifies who tells which audience, through which channel, by when. And a post-incident review checklist captures the timeline, impact, root cause, and follow-ups.

It exists because major incidents are high-stakes, high-pressure, and rare enough that no one has the process memorized — so without a runbook, teams improvise roles, communicate inconsistently, and lose precious time deciding who's in charge while customers and revenue bleed. By pre-deciding the roles, the severity thresholds, the comms cadence, and the recovery flow, the runbook lets a team respond fast and coordinated under pressure, and its blameless post-incident review ensures each outage makes the system more resilient rather than just surviving it.

What Major-Incident Runbook Is Used For

Teams use a major-incident runbook to respond to critical, customer-facing incidents in a fast, coordinated, role-clear way — and to learn from each one. It's applied to:

✓ Classifying severity to decide whether to declare a major incident — SEV1 for a full outage or data/security breach with broad impact and no workaround, lower severities for narrower or worked-around issues — so the decision is rule-based, not improvised.
✓ Assigning ICS-style roles before the incident: an Incident Commander who decides and coordinates (but delegates hands-on troubleshooting), a Comms Lead, technical responders, and named backups for each.
✓ Running the declare-to-resolve flow — detect and declare, assemble and stabilize, mitigate and communicate and escalate, recover and confirm, then review and learn — so the response has a known sequence.
✓ Driving a disciplined communication cadence — first external update within 15 minutes of a SEV1 via status page and in-app banner, then every 30 minutes until mitigated — so customers aren't left in silence.
✓ Coordinating internal and external audiences in parallel, with a clear owner for each channel, so engineering, leadership, and customers all get the right information at the right interval.
✓ Paging the right on-call responders and opening a war room or incident channel immediately on a SEV1, so the response assembles fast instead of waiting on someone to figure out who to call.
✓ Running a blameless post-incident review — reconstructing the timeline with timestamps, quantifying impact, identifying root cause and contributing factors, and assigning follow-up actions — so the outage produces a permanent fix.

Who Uses Major-Incident Runbook

A major incident pulls in responders, communicators, and decision-makers across support and engineering, and the runbook gives each a pre-assigned role so no one is improvising under pressure.

Incident Commander (IC)They own the incident overall — decisions, coordination, declaring and resolving — and explicitly do not do hands-on troubleshooting, delegating it so they keep the whole picture in view.

Support managers and leadsThey often serve as IC or Comms Lead for customer-facing incidents and own the decision to declare based on the severity criteria.

On-call engineers and technical respondersThey do the hands-on diagnosis and mitigation; the runbook ensures they're paged fast and freed to focus on the fix while the IC handles coordination.

Communications leadThey own customer and stakeholder updates — status page, in-app banner, internal channels — on the cadence the runbook prescribes, so the team speaks with one consistent voice.

Support agentsThey field the surge of customer contacts during an outage and rely on the runbook's comms cadence and status updates to give customers consistent, accurate information.

Engineering and leadership stakeholdersThey receive structured internal updates and participate in the blameless post-incident review that turns the outage into durable system improvements.

Major-Incident Runbook: Context & Good to Know

A runbook (or playbook) is the defining artifact of mature incident response, and the major-incident version is the highest-stakes of all. It's grounded in ITIL major-incident management and the incident-command system (ICS) used in emergency response, which is where the Incident Commander role comes from. The single most important idea the runbook encodes is role separation: the IC coordinates and decides but does not put their hands on the keyboard. When the most senior person dives into debugging, no one is left holding the overall picture — communication stalls, decisions queue up, and the incident drags. Separating coordination from troubleshooting is what keeps a high-pressure response fast and clear.

Severity classification is what makes the declare-or-not decision a rule rather than a debate held while the clock runs. A SEV1 — full outage or data/security breach, no workaround, broad customer impact — triggers an immediate declaration, paging on-call and the IC and opening a war room, with updates every 15-30 minutes. Lower severities get proportionate responses. Pre-defining these thresholds means a responder at 2 a.m. doesn't have to argue about whether something 'counts'; they check the criteria and act. This connects directly to the priority matrix and SLA policy — a SEV1 incident is the operational response to what those frameworks would classify as the most critical, broadest-impact tickets.

Communication cadence is where incidents are won or lost in the eyes of customers. The runbook specifies that external customers get a first update within 15 minutes of a SEV1 via status page and in-app banner, then every 30 minutes until mitigated, with a named Comms Lead owning it. Silence during an outage is what turns a technical problem into a trust problem and floods the support queue with anxious contacts. A disciplined cadence — even when the update is 'we're still investigating' — reassures customers that the team is on it and dramatically reduces inbound volume, which lets agents focus on genuinely affected customers rather than the panicked majority.

The post-incident review is what separates teams that survive incidents from teams that get better because of them. The runbook's review checklist is explicitly blameless — focused on systems, not people — and reconstructs the timeline with timestamps, quantifies customer and business impact, identifies root cause and contributing factors, and captures what worked and what slowed the response, ending in assigned follow-up actions. Blamelessness matters because a culture of blame makes responders hide information, which is exactly what you can't afford during the next incident. Rehearsing the runbook before it's needed — a tabletop exercise or drill — is the final piece: a runbook no one has practiced is just a document, and the goal is muscle memory that holds up when production is actually down.

✓ Independent · vendors can't pay to rank

Built on verified data, not vendor spin

Every Spotsaas resource draws on the SpotScore — a blend of verified review ratings, review volume, and feature depth across 147 help desk software tools. Refreshed regularly; data as of June 2026.

FAQ

Questions, answered

What is a major-incident runbook?

A major-incident runbook is a pre-defined playbook for responding to the most critical incidents — a production outage, a security event, or a spreading customer-facing failure. It defines severity criteria for declaring an incident, the roles and who owns what, the step-by-step declare-to-resolve flow, the communication cadence to customers and stakeholders, and the post-incident review. Its whole purpose is to pre-decide everything you can't afford to improvise in the middle of a high-pressure incident, so the team responds fast and coordinated instead of figuring out who's in charge while customers bleed.

What does an Incident Commander do?

The Incident Commander (IC) owns the incident overall — decisions, coordination, and declaring and resolving it — but explicitly does not do hands-on troubleshooting, which they delegate to technical responders. This separation is deliberate: if the most senior person dives into debugging, no one is left holding the overall picture, communication stalls, and decisions queue up. The IC's job is to keep the whole response moving — directing responders, approving comms, and deciding when the incident is mitigated and resolved.

When should I declare a major incident?

Declare based on severity criteria, not gut feel. A SEV1 — a full outage or data/security breach with broad customer impact and no workaround — warrants immediate declaration: page on-call and the Incident Commander, open a war room, and start customer updates. Lower-severity issues with narrower impact or a viable workaround get proportionate responses without a full declaration. Pre-defining these thresholds means a responder doesn't have to debate whether something 'counts' while the clock is running — they check the criteria and act.

How fast and how often should we communicate during an outage?

For a SEV1, send the first external update within about 15 minutes via your status page and in-app banner, then update every 30 minutes until the issue is mitigated, with a named Comms Lead owning it. Silence during an outage turns a technical problem into a trust problem and floods your support queue with anxious contacts. A disciplined cadence — even when the update is just 'we're still investigating' — reassures customers and dramatically reduces inbound volume, freeing agents to help the genuinely affected.

What is a blameless post-incident review?

A blameless post-incident review analyzes what happened with the focus on systems and processes, not on assigning fault to individuals. It reconstructs the timeline with timestamps (detection, declaration, key actions, resolution), quantifies customer and business impact, identifies root cause and contributing factors, notes what worked and what slowed the response, and assigns follow-up actions. Blamelessness is practical, not just kind: a culture of blame makes responders hide information, which is exactly what you can't afford during the next incident.

What roles are needed for major-incident response?

At minimum an Incident Commander (coordinates and decides, doesn't troubleshoot), a Communications Lead (owns customer and stakeholder updates), and technical responders (do the hands-on diagnosis and fix), each with a named backup. The runbook uses an ICS-style (incident-command system) role structure borrowed from emergency response. Assigning these roles before an incident — and naming backups — means the response assembles instantly instead of losing time deciding who does what under pressure.

What are the steps of the declare-to-resolve flow?

Five steps: (1) detect and declare — recognize the incident and formally declare it; (2) assemble and stabilize — page the IC and responders, open the war room, and stop the bleeding; (3) mitigate, communicate, and escalate — work the fix while keeping customers and stakeholders updated on cadence; (4) recover and confirm — restore service and verify it's genuinely working; (5) review and learn — run the blameless post-incident review. Following a known sequence keeps a chaotic situation organized.

How is a major incident different from a regular high-priority ticket?

A regular P1 ticket follows the normal SLA and escalation path within the support team. A major incident is a coordinated, cross-functional response declared when impact is broad and severe enough to need an Incident Commander, a war room, parallel customer communication, and engineering on-call — it transcends a single ticket. The severity criteria draw the line: a single customer's critical issue is a P1, but a full outage affecting all users is a major incident that triggers the runbook. The two frameworks connect, but the response scale is different.

Why rehearse a runbook before an incident?

A runbook no one has practiced is just a document. Major incidents are rare and high-pressure, so the process won't be in anyone's muscle memory unless you drill it — a tabletop exercise or simulated incident reveals gaps, unclear roles, and missing contacts while the stakes are zero. Rehearsing also builds the team's confidence to act fast when a real SEV1 hits. The runbook explicitly recommends filling in your contacts and channels and then rehearsing it before you need it.

How does the runbook reduce customer impact during an outage?

In two ways. First, the structured response — clear roles, a known flow, and fast assembly — gets to mitigation faster, shortening the outage itself. Second, the disciplined communication cadence keeps customers informed, which prevents the trust damage and queue flood that silence causes. A status page update every 30 minutes reassures the anxious majority so they don't all contact support, letting agents focus on customers who are genuinely blocked. Faster resolution plus consistent communication is what turns an outage from a crisis into a managed event.

Keep exploring