Scenario / Incident response

Incident Response Simulation: Triage a Cascading Failure

You inherit a live system that is already in trouble. Database overloaded, error rate climbing, latency spiking. Diagnose the bottleneck and stabilize before the SLA breach window closes.

Run this scenario

Briefing

Incident response & live triage

Incident response is the practice of diagnosing and stabilizing a live degraded system under time pressure.

During an incident, the fastest path to stability matters more than the optimal architecture.

Stop the bleeding first, then fix the root cause.

Find the first bottleneck, not only the loudest symptom.
Use fast interventions before perfect redesigns.
Separate root cause fixes from temporary stabilization.

Contract

Uptime

99.5%

P95 latency

300ms

Budget

$800/mo

Traffic shape

Sustained pressure with little relief. Baseline 1,800 users; peak around 1,800 users over 36 hours.

Available components

Server

HTTP request handler Every web app needs at least one server. More servers let you handle more simultaneous requests before latency starts climbing.

Postgres

Primary data store Without a database, your app has no memory. Most dynamic requests eventually depend on it.

Redis

In-memory cache layer Popular pages, profiles, and product data often get requested again and again. Serving those from memory is much faster and cheaper.

LB

Load balancer If you run more than one server, something needs to decide where each request goes. That is the load balancer.

Replica

Read-only DB copy Many applications read far more often than they write. Replicas let you spread those reads across more machines.

Queue

Async job buffer Moving background work out of the request path keeps the app responsive even when extra processing is needed.

Worker

Background job processor Separating background work keeps checkout, page loads, and other user actions from competing with batch processing.

CDN

Static asset edge cache Images, scripts, stylesheets, and some API responses do not need to travel all the way back to origin every time.

Rate limiter

Request throttle During abuse events, legitimate traffic competes with junk traffic for server capacity. Filtering noisy traffic at the edge protects the rest of the stack.

Common mistakes

Connection pools can fail before raw database capacity reaches 100%.
Rolling recovery preserves availability better than restarting the whole fleet.
Query-heavy paths need indexes, replicas, or search offload before peak traffic.
Caches need warm-up, jittered expiry, and fallback capacity.

Interview adjacency

Debug production latency
Triage an outage
Explain root cause analysis