Reading the Room

Learn how to translate product needs into architecture.

Every product makes unspoken promises, like:

  • Payments should be instant.

  • Videos should start right away.

  • 2FA codes should arrive fast enough to actually log in.

At the Staff+ level, your job is to turn those unspoken expectations into measurable guarantees that drive your architecture decisions.

If you skip this step, you’ll either over-engineer for hypothetical scale or under-engineer (and firefight every bug).

SLOs help you prioritize issues through error budgets: the small, agreed-upon amount of “bad” you can afford.

Example: If your SLO is 99.9% over 30 days, your error budget is ~43 minutes of downtime. You can:

  • Spend it on experiments and speed when you’re healthy.

  • Slow down and fix reliability when you’re burning too fast.

From vague ask to design

Take the PM request: “Can you make notifications fast?”

A Senior Engineer says, “Sure, I’ll cache it.”

But a Staff+ Engineer asks:

  • “Fast” for whom? Global users, or just one region?

  • What does success mean? Delivery time? Click-throughs? Reliability?

  • How critical is failure? Annoyance vs. business loss.

They then translate those answers into measurable promises.

Example SLOs

  • Marketing emails: 95% delivered within 5 s—good enough; failures don’t end the company.

  • 2FA codes: 99.99% delivered within 500 ms—failures = account lockouts, lost users.

Two wildly different architectures come from these two SLOs.

How to write a good SLO

Press + to interact
State the target clearly: Include metric, threshold, and time window
State the target clearly: Include metric, threshold, and time window

Here’s the lightweight template:

  1. SLI (What you measure): For example, p95 latency of POST /payments

  2. SLO (Target): For example, 99.99% ≤ 2s over 30 days

  3. Error budget policy: <50% burn: deploy as usual; 100% burn: freeze and fix

  4. How you’ll measure: Metric source, sampling, and windows

  5. Alerting rules:

    1. Fast burn (Page): “We’ll burn the whole budget in 1 hour”

    2. Slow burn (Ticket): “Budget burns in 12h — triage in #reliability”

  6. Dashboard link: One chart everyone—PMs, SREs, execs—can check without asking

Examples:

  • Payments API: “99.99% of POST /payments complete ≤ 2s over 30 days.”

  • 2FA delivery: “95% of OTP codes delivered ≤ 500ms over 7 days.”

  • Dashboard: “99.5% of page loads return 200 with TTFB < 300ms over 30 days.”

SLOs to architecture trade-offs

Once you pin down SLOs, you can weigh trade-offs with clarity:

  1. Consistency:

    1. Strong consistency: Payments, balances, authentication.

    2. Eventual consistency: Likes, follower counts, analytics dashboards.

  2. Replication vs. latency:

    1. Multi-region replication improves durability but slows down writes.

    2. Cache some data close to the user (profiles), centralize others for correctness (ledgers).

  3. Queues vs. streams:

    1. Queues (RabbitMQ, SQS): Reliability and retries.

    2. Streams (Kafka, Pulsar): Ordered, replayable: great for analytics or event-driven state.

  4. Throughput vs. tail latency:

    1. Optimize for p50 (average) or p99/p999 (worst case).

    2. At Google scale, those rare slow requests affect millions.

Anyone can memorize patterns. Staff+ engineers explain why they chose them, and tie every trade-off to business impact.

That’s how promotions happen.

Example: Strong SLO in action

Friday night. Customers complain: “Never got our 2FA codes.”

  • Old you: Guess at queues, carriers, or providers.

  • New you: Open the SLO dashboard.

SLI shows only 82% of deliveries < 500ms. The speed at which a service consumes its error budget (the allowed amount of failure) over time, a high burn rate means reliability is degrading quickly.Burn-rate alert is paging: you’ll exhaust the error budget tonight.

Mitigation:

  • Fail over to a secondary SMS provider.

  • Enable WebAuthn fallback.

  • Delivery jumps to 97% within minutes.

Follow up: Add provider health checks, log failover times, update dashboard.

You didn’t need a war room: the system told you what mattered.

John Quest: Create an SLO one-pager

Pick one critical flow and define:

  • SLI: What you measure (e.g., “p95 latency of POST /payments”)

  • SLO target and window: (“99.99% ≤ 2s over 30 days”)

  • Error budget policy: (“<50% burn: fixes only; 0%: freeze”)

  • How we measure: Source, percentile, sampling method

  • Alerts: Fast and slow burn rules, who gets paged

  • Dashboard link: The graph everyone can find

Share it in your team channel and pin it. You’ve just turned reliability from vibes into a contract that product, on-call, and leadership can actually use.

AI assessment: Search SLO

Your PM says: “Search results should be fast and reliable.” You’re the engineer responsible for defining what “fast” and “reliable” actually mean.

Write one clear SLO that captures this requirement and click “Evaluate” to get your AI assessment below:

Write one SLO

Let’s move over to the concept of “Goldilocks designs.”

Ask