Incident Command

Lead incidents to restore service fast and turn every outage into a learning moment.

Unstructured incidents create confusion: Slack floods with messages, responsibilities are unclear, and customers wait for answers.

During incidents, leaders watch to see if you can bring order, clarity, and momentum.

This is where strong incident command stands out.

In this lesson, we’ll cover:

  • The 4 key steps of effective incident response

  • A communication style that earns trust

  • An exercise to build your own runbook and templates

Let’s get started.

John loves chaos because it makes him a hero, but he’s really the anti-hero—the system doesn’t improve, and the team stays dependent on him. You can do better.

Running incidents

A Staff+ engineer is expected to:

  1. Coordinate clear roles and structure from the first minute of an incident.

  2. Apply fast mitigation without blocking on the root cause.

  3. Communicate updates with stakeholders.

  4. Capture learnings to improve systems.

Let’s look at a repeatable approach to check off these boxes and bring systems back up with confidence.

Step 1: Contain the chaos

Your first job is to bring order. Unclear roles and overlapping efforts will stall progress and make you look reactive instead of reliable.

Do this immediately:

  • Open a channel and name it clearly (e.g., #inc-sev1-2025-09-24).

  • Assign roles:

    • Incident commander (IC): Runs the incident.

    • Comms lead: Handles status page, exec updates, and customer comms.

    • Scribe: Captures the timeline and actions.

    • Ops lead: Has hands-on keyboard.

    • Everyone else: Subject Matter Experts (SMEs)—not drive-by commenters.

  • Pin a single incident doc in the channel—the source of truth for all updates and decisions.

  • Set severity to guide urgency:

    • SEV-1: Many customers can’t do core tasks → page now

    • SEV-2: Degraded but usable → on-call + comms.

    • SEV-3: Low impact → work hours

Use both the incident channel and the incident doc to keep everything connected and visible.

  • Channel: For real-time updates, decisions, and coordination.

  • Doc: For roles, impact summary, timeline, and follow-ups.

Step 2: Stop the bleeding

When something is on fire, don’t waste time figuring out who lit the match—just put it out.

Use pre-approved playbooks:

  • Roll back

  • Flip the feature flag

  • Fail over

  • Throttle traffic

  • Trigger a circuit breaker

Don’t block mitigation on the root cause. You’ll have time for that in the retro.


Step 3: Communicate like a pro

Every 15–30 minutes, post a short, timestamped update in the channel:

12:24 — Impact: ~8% checkouts failing. Action: rolled back 1.42. ETA to mitigation: 20m.

Keep updates concise and structured:

  • Impact: Who’s affected and how

  • Scope: Estimated reach

  • Action: What’s been tried

  • ETA: When resolution is expected

Use threads for technical discussion; keep the main channel clear for decisions and summaries.


Step 4: Close and capture

When the incident is stable:

  • Mark it resolved in Slack with a /resolve or clear message.

  • Thank participants and hand out credit.

  • Schedule a blameless postmortem within 5 business days.

  • Turn findings into action items with owners/dates.

Your goal isn’t just to restore service. It’s to make the system more resilient next time.

Maintaining the incident channel

You want to keep your communication clean and useful while the pressure’s on.

Here’s how:

  • Start messages with time + status: Keep them to one or two sentences.

  • Threads for deep dives: If you post an update in the main, put technical back-and-forth in its thread.

  • Use canned commands/templates: /declare, /update, /resolve.

John Quest: Get ready for mitigation

Build your own lightweight tooling so you’re ready when things go sideways.

1. Incident runbook

A one-pager with:

  • Role definitions

  • Severity table

  • “First five moves” for mitigation

  • Sample update cadence

  • Closeout checklist

2. A Slack/Teams template

  • /declare message (impact, scope, IC, comms, scribe)

  • /update format

  • /resolve message

3. Postmortem template

  • Timeline

  • Impact

  • Contributing factors

  • What worked / what didn’t

  • Action items + DRIs

  • “Revisit when” trigger

Start with a 20-minute tabletop using a past incident. Assign roles, practice updates, then tweak your templates to fit your org.

Ask