Observability
Turn debugging from guesswork into clarity by building observability that shows what’s slow, what’s broken, and why—so fixes take minutes, not meetings.
We'll cover the following...
At the Staff+ level, your reputation is built on how you handle production issues—not just the fixes but also the speed, clarity, and calm with which you find them.
Observability can be a lifeline between a business thriving and failing. This level of operational maturity builds trust across engineering, support, and leadership. When things go wrong (and they will), you’re the person people want on call—because you get answers fast and avoid drama (and aren’t OOO like John).
This lesson walks you through what production-grade observability looks like at the Staff+ level (and how to build it):
The three golden signals that catch most issues.
The core telemetry every service needs
Let’s get started.
Observability checklist
3 golden signals (per key service endpoint):
Latency (p95/p99)
Errors (% and top reasons)
Saturation (CPU/mem, DB/queue lag)
These three golden signals cover 90% of what you need. But to reach the full 100%, you’ll want to cover the bases below:
Distributed tracing:
One unique trace ID per request for end-to-end tracking.
Spans for service interactions (database and queues) for latency and flow.
Enrich spans with attributes for context and filtering (endpoint, tenant_ID, operation name, status, error_reason).
Structured logs (JSON):
One line per request with
trace_id,user_id,resource_id, action, result, andduration_ms.No personal identifiable information (PII).
Link metrics to traces (exemplars):
From a latency spike, click straight into the slowest traces—no hunting.
One lean dashboard per service:
Rate/Errors/Duration for the endpoint, dependency latency/errors, top slow traces, and a small “what this panel answers” note.
Observability in action
Imagine someone in sales pings you and says, “checkout feels slow.”
You check the dashboard: traffic normal, errors flat, but the slowest 1% doubled.
Click into traces → inventory-db.query latency jumped from 80ms → 600ms.
CPU chart shows DB at 90% after noon deploy.
You flip off the heavy filter with a feature flag. Latency is back under target. A task has been filed to add an index.
Total time: 6 minutes. No guessing. The tools pointed to the fix.
John Quest: Enabling observability
Add a request middleware that creates/propagates trace_id and emits one structured log per request (
user_id,resource_id,action, result,duration_ms,trace_id).Build one dashboard with six widgets: p95 latency, error rate, request rate, dependency latency/errors, and the top slow traces linked to your tracer.
Pin the dashboard in the service README and team channel; add a one-page “how to use this dashboard” cheat at the top.
Tooling tip: Use Datadog. It’s the most widely used all-in-one option (traces, metrics, logs, dashboards, alerts) with solid docs and easy setup across languages. You’ll get one place to see p95 latency, error %, saturation, and slow traces, and you can wire fast/slow burn SLO alerts on the same board.