Incident Command
Lead incidents to restore service fast and turn every outage into a learning moment.
Unstructured incidents create confusion: Slack floods with messages, responsibilities are unclear, and customers wait for answers.
During incidents, leaders watch to see if you can bring order, clarity, and momentum.
This is where strong incident command stands out.
In this lesson, we’ll cover:
The 4 key steps of effective incident response
A communication style that earns trust
An exercise to build your own runbook and templates
Let’s get started.
John loves chaos because it makes him a hero, but he’s really the anti-hero—the system doesn’t improve, and the team stays dependent on him. You can do better.
Running incidents
A Staff+ engineer is expected to:
Coordinate clear roles and structure from the first minute of an incident.
Apply fast mitigation without blocking on the root cause.
Communicate updates with stakeholders.
Capture learnings to improve systems.
Let’s look at a repeatable approach to check off these boxes and bring systems back up with confidence.
Step 1: Contain the chaos
Your first job is to bring order. Unclear roles and overlapping efforts will stall progress and make you look reactive instead of reliable.
Do this immediately:
Open a channel and name it clearly (e.g.,
#inc-sev1-2025-09-24).Assign roles:
Incident commander (IC): Runs the incident.
Comms lead: Handles status page, exec updates, and customer comms.
Scribe: Captures the timeline and actions.
Ops lead: Has hands-on keyboard.
Everyone else: Subject Matter Experts (SMEs)—not drive-by commenters.
Pin a single incident doc in the channel—the source of truth for all updates and decisions.
Set severity to guide urgency:
SEV-1: Many customers can’t do core tasks → page now
SEV-2: Degraded but usable → on-call + comms.
SEV-3: Low impact → work hours
Use both the incident channel and the incident doc to keep everything connected and visible.
Channel: For real-time updates, decisions, and coordination.
Doc: For roles, impact summary, timeline, and follow-ups.
Step 2: Stop the bleeding
When something is on fire, don’t waste time figuring out who lit the match—just put it out.
Use pre-approved playbooks:
Roll back
Flip the feature flag
Fail over
Throttle traffic
Trigger a circuit breaker
Don’t block mitigation on the root cause. You’ll have time for that in the retro.
Step 3: Communicate like a pro
Every 15–30 minutes, post a short, timestamped update in the channel:
12:24 — Impact: ~8% checkouts failing. Action: rolled back 1.42. ETA to mitigation: 20m.
Keep updates concise and structured:
Impact: Who’s affected and how
Scope: Estimated reach
Action: What’s been tried
ETA: When resolution is expected
Use threads for technical discussion; keep the main channel clear for decisions and summaries.
Step 4: Close and capture
When the incident is stable:
Mark it resolved in Slack with a
/resolveor clear message.Thank participants and hand out credit.
Schedule a blameless postmortem within 5 business days.
Turn findings into action items with owners/dates.
Your goal isn’t just to restore service. It’s to make the system more resilient next time.
Maintaining the incident channel
You want to keep your communication clean and useful while the pressure’s on.
Here’s how:
Start messages with time + status: Keep them to one or two sentences.
Threads for deep dives: If you post an update in the main, put technical back-and-forth in its thread.
Use canned commands/templates:
/declare,/update,/resolve.
John Quest: Get ready for mitigation
Build your own lightweight tooling so you’re ready when things go sideways.
1. Incident runbook
A one-pager with:
Role definitions
Severity table
“First five moves” for mitigation
Sample update cadence
Closeout checklist
2. A Slack/Teams template
/declaremessage (impact, scope, IC, comms, scribe)/updateformat/resolvemessage
3. Postmortem template
Timeline
Impact
Contributing factors
What worked / what didn’t
Action items + DRIs
“Revisit when” trigger
Start with a 20-minute tabletop using a past incident. Assign roles, practice updates, then tweak your templates to fit your org.