Runbooks and Game Days
Explore how runbooks make mitigation fast and consistent; game days make it muscle memory.
We'll cover the following...
At the Staff+ level, you’re expected to scale reliability beyond yourself. That means building the processes, playbooks, and muscle memory your team needs to handle outages—fast, calmly, and without waiting for you to land from Bali.
That’s what runbooks and game days are for.
Runbooks are short, tactical guides that turn 2 a.m. confusion into fast, repeatable recovery.
Game days simulate real incidents to build team readiness before it matters most.
In this lesson, you’ll learn:
What makes a runbook useful and reliable?
How to run game days that actually improve performance.
How to integrate both into your team’s on-call and incident process.
Let’s get into it.
Runbooks
Runbooks are short, focused guides that say: “If X breaks, try Y—and here’s how to know it worked.”
Best practice: If your on-call can execute it half-asleep in under 10 minutes, it’s good.
What a good runbook looks like
Here’s what you need:
Title and scope: The system and scenario this covers.
Symptoms → Impact: What you see (alerts/logs) and what users feel.
Quick checks (≤2 mins): Dashboards/queries/flags to confirm root symptom.
Mitigations (ordered): Rollback → feature flag/stop switch → failover → throttle → circuit breaker.
Verification: The dials must go green (SLO, error rate, queue depth).
Comms snippets: One-paragraph internal + status-page update.
After it’s stable: Capture timeline, file follow-ups (DRIs/dates).
Links: SLO dashboard, tracing view, design doc, ADR, previous incidents.
Owner and last reviewed: Freshness check—stale runbooks are fake safety.
Tooling tip: incident.io is a Slack-native tool that covers the whole flow: spin up incidents (with roles, timelines, and update templates), run game day “exercises” to practice, and attach/link runbooks so on-call can act fast, all in one place.
Game days
Game days are rehearsal outages: controlled simulations where the team practices detection, diagnosis, and recovery.
Done right, they build confidence and uncover weak points in your recovery plans.
Running an effective game day
Let’s break it down into 7 steps.
1. Choose a realistic scenario
Choose a common and impactful scenario, such as dependency timeout, queue overload, or bad deploy rollback.
2. Define success
Decide what “good” looks like:
Fast detection and mitigation (MTTD, MTTR, SLO burn)
Calm execution using the runbook
Runbook updated with real lessons
3. Assign roles
Make it feel like a real incident:
IC: Makes decisions
Ops lead: Runs commands
Scribe: Logs timeline
Observer: Notes friction points
4. Simulate safely
Run the scenario in staging or a prod-like sandbox.
Use only the runbook to recover. No side-channel hints.
5. Recover and observe
Watch the same dials you protect in prod:
SLO burn
Error rate
p95 latency
Queue depth
Saturation.
6. Debrief immediately
Discuss:
What slowed us down?
Where did the runbook fall short?
Who owns the follow-up changes?
7. Schedule the next one
Repeat monthly for critical systems, quarterly for anything else. Keep them short (≤ 45 mins) and blameless.
Pro tip: Treat game days as “incident insurance.” Every drill pays off when the pager goes off for real.
John Quest: Get ready for game day
You don’t need to overhaul everything. Just start here:
Write a one-page runbook for your riskiest failure path.
Schedule a 30-minute game day next week to test it.
Link the runbook in your service README and incident channel.
Runbook template
You can use the template below to put everything into action.