Technical article
A Production Debugging Playbook for Backend Incidents
A practical sequence for moving from symptom to root cause across logs, metrics, traces, database state, and network behavior.
Start with the user-visible symptom
Before opening every dashboard, define what is actually broken: endpoint, workflow, region, customer segment, job type, or dependency. A precise symptom keeps the investigation narrow.
Build a timeline
Collect deployment time, first alert, first user report, error-rate change, latency change, infrastructure event, and database pressure. Incidents become easier to reason about when the timeline is visible.
Follow the request path
Trace the path from edge to application, database, queue, cache, and third-party services. If traces are missing, use request identifiers in logs and compare timestamps manually.
Close with a system change
The best incident review produces a change in code, infrastructure, dashboards, alerts, or runbooks. If nothing changes, the same failure will be hard to debug again.