Runbooks and Game Days

Explore how runbooks make mitigation fast and consistent; game days make it muscle memory. 

At the Staff+ level, you’re expected to scale reliability beyond yourself. That means building the processes, playbooks, and muscle memory your team needs to handle outages—fast, calmly, and without waiting for you to land from Bali.

That’s what runbooks and game days are for.

  • Runbooks are short, tactical guides that turn 2 a.m. confusion into fast, repeatable recovery.

  • Game days simulate real incidents to build team readiness before it matters most.

In this lesson, you’ll learn:

  • What makes a runbook useful and reliable?

  • How to run game days that actually improve performance.

  • How to integrate both into your team’s on-call and incident process.

Let’s get into it.

Runbooks

Runbooks are short, focused guides that say: “If X breaks, try Y—and here’s how to know it worked.”

Best practice: If your on-call can execute it half-asleep in under 10 minutes, it’s good.

What a good runbook looks like 

Here’s what you need:

  • Title and scope: The system and scenario this covers.

  • Symptoms → Impact: What you see (alerts/logs) and what users feel.

  • Quick checks (≤2 mins): Dashboards/queries/flags to confirm root symptom.

  • Mitigations (ordered): Rollback → feature flag/stop switch → failover → throttle → circuit breaker.

  • Verification: The dials must go green (SLO, error rate, queue depth).

  • Comms snippets: One-paragraph internal + status-page update.

  • After it’s stable: Capture timeline, file follow-ups (DRIs/dates).

  • Links: SLO dashboard, tracing view, design doc, ADR, previous incidents.

  • Owner and last reviewed: Freshness check—stale runbooks are fake safety.

Tooling tip: incident.io is a Slack-native tool that covers the whole flow: spin up incidents (with roles, timelines, and update templates), run game day “exercises” to practice, and attach/link runbooks so on-call can act fast, all in one place.

Press + to interact

Game days

Game days are rehearsal outages: controlled simulations where the team practices detection, diagnosis, and recovery.

Done right, they build confidence and uncover weak points in your recovery plans.

Running an effective game day

Let’s break it down into 7 steps.

1. Choose a realistic scenario

  1. Choose a common and impactful scenario, such as dependency timeout, queue overload, or bad deploy rollback.

2. Define success

Decide what “good” looks like:

  • Fast detection and mitigation (MTTD, MTTR, SLO burn)

  • Calm execution using the runbook

  • Runbook updated with real lessons

3. Assign roles

Make it feel like a real incident:

  • IC: Makes decisions

  • Ops lead: Runs commands

  • Scribe: Logs timeline

  • Observer: Notes friction points

4. Simulate safely

Run the scenario in staging or a prod-like sandbox.

Use only the runbook to recover. No side-channel hints.

5. Recover and observe

Watch the same dials you protect in prod:

  • SLO burn

  • Error rate

  • p95 latency

  • Queue depth

  • Saturation.

6. Debrief immediately

Discuss:

  • What slowed us down?

  • Where did the runbook fall short?

  • Who owns the follow-up changes?

7. Schedule the next one

Repeat monthly for critical systems, quarterly for anything else. Keep them short (≤ 45 mins) and blameless.

Pro tip: Treat game days as “incident insurance.” Every drill pays off when the pager goes off for real.

John Quest: Get ready for game day

You don’t need to overhaul everything. Just start here:

  1. Write a one-page runbook for your riskiest failure path.

  2. Schedule a 30-minute game day next week to test it.

  3. Link the runbook in your service README and incident channel.

Runbook template

You can use the template below to put everything into action.

Ask