Become the Highest Paid Engineer at Your Company/

...

Incident Command

Lead incidents to restore service fast and turn every outage into a learning moment.

We'll cover the following...

Running incidents
Maintaining the incident channel
John Quest: Get ready for mitigation

Unstructured incidents create confusion: Slack floods with messages, responsibilities are unclear, and customers wait for answers.

During incidents, leaders watch to see if you can bring order, clarity, and momentum.

This is where strong incident command stands out.

In this lesson, we’ll cover:

The 4 key steps of effective incident response
A communication style that earns trust
An exercise to build your own runbook and templates

Let’s get started.

John loves chaos because it makes him a hero, but he’s really the anti-hero—the system doesn’t improve, and the team stays dependent on him. You can do better.

Running incidents

A Staff+ engineer is expected to:

Coordinate clear roles and structure from the first minute of an incident.
Apply fast mitigation without blocking on the root cause.
Communicate updates with stakeholders.
Capture learnings to improve systems.

Let’s look at a repeatable approach to check off these boxes and bring systems back up with confidence.

Step 1: Contain the chaos

Your first job is to bring order. Unclear roles and overlapping efforts will stall progress and make you look reactive instead of reliable.

Do this immediately:

Open a channel and name it clearly (e.g., #inc-sev1-2025-09-24).
Assign roles:
- Incident commander (IC): Runs the incident.
- Comms lead: Handles status page, exec updates, and customer comms.
- Scribe: Captures the timeline and actions.
- Ops lead: Has hands-on keyboard.
- Everyone else: Subject Matter Experts (SMEs)—not drive-by commenters.
Pin a single incident doc in the channel—the source of truth for all updates and decisions.
Set severity to guide urgency:
- SEV-1: Many customers can’t do core tasks → page now
- SEV-2: Degraded but usable → on-call + comms.
- SEV-3: Low impact → work hours

Use both the incident channel and the incident doc to keep everything connected and visible.

Channel: For real-time updates, decisions, and coordination.
Doc: For roles, impact summary, timeline, and follow-ups.

Step 2: Stop the bleeding

When something is on fire, don’t waste time figuring out who lit the match—just put it out.

Use pre-approved playbooks:

Roll back
Flip the feature flag
Fail over
Throttle traffic
Trigger a circuit breaker

Don’t block mitigation on the root cause. You’ll have time for that in the retro.

Step 3: Communicate like a pro

Every 15–30 minutes, post a short, timestamped update in the channel:

12:24 — Impact: ~8% checkouts failing. Action: rolled back 1.42. ETA to mitigation: 20m.

Keep updates concise and structured:

Impact: Who’s affected and how
Scope: Estimated reach
Action: What’s been tried
ETA: When resolution is expected

Use threads for technical discussion; keep the main channel clear for decisions and summaries.

Step 4: Close and capture

When the incident is stable:

Mark it resolved in Slack with a /resolve or clear message.
Thank participants and hand out credit.
Schedule a blameless postmortem within 5 business days.
Turn findings into action items with owners/dates.

Your goal isn’t just to restore service. It’s to make the system more resilient next time.

Maintaining the incident channel

You want to keep your communication clean and useful while the pressure’s on.

Here’s how:

Start messages with time + status: Keep them to one or two sentences.
Threads for deep dives: If you post an update in the main, put technical back-and-forth in its thread.
Use canned commands/templates: /declare, /update, /resolve.

John Quest: Get ready for mitigation

Build your own lightweight tooling so you’re ready when things go sideways.

1. Incident runbook

A one-pager with:

Role definitions
Severity table
“First five moves” for mitigation
Sample update cadence
Closeout checklist

2. A Slack/Teams template

/declare message (impact, scope, IC, comms, scribe)
/update format
/resolve message

3. Postmortem template

Timeline
Impact
Contributing factors
What worked / what didn’t
Action items + DRIs
“Revisit when” trigger

Start with a 20-minute tabletop using a past incident. Assign roles, practice updates, then tweak your templates to fit your org.

Ask

John 2.0

Legacy Code Whisperer

System Design Like a Staff+

AI Engineering

Reliability Under Fire

Data Engineering for Product Impact

Security Without Drama

Product Sense

Multiplier Habits and Systems

Bonus: Comp Strategy and Promotion Packets

Congratulations, John 2.0