Introduction
Build the reliability mindset of a Staff+ engineer—design systems that detect, recover, and stay calm when everything breaks at 2 a.m.
We'll cover the following...
Staff+ engineers don’t rely on luck. They build systems that tell them what is broken, signal which problems matter, and recover fast enough that customers barely notice.
John’s idea of observability is watching everything burn down while sipping coffee. We’ll teach you to actually prevent the fires.
In this module, we’ll build real reliability muscles by covering:
Observability: Signals you can trust
Incident command: Calm beats chaos
Runbooks and game days: Scale knowledge, not heroics
Because we’ve already covered SLOs, you’ll know how to decide which signals deserve your attention, how to measure reliability the way users actually experience it, and how to use those numbers to focus effort where it matters most.
Let’s get started.