Scenario / The silent backlog

On-Call Simulation: Diagnose a Queue Backlog Failure

A fulfillment service moved writes behind a queue, but only one worker is draining it. The UI is still mostly green while backlog climbs.

Run this scenario

Briefing

Backpressure before the outage

Backpressure is the signal that producers are outpacing consumers.

A green frontend can hide a failing async path until retries and dead-letter queues spill back into users.

A queue is a shock absorber, not an engine.

  • Watch queue depth and growth direction, not only error rate.
  • Add workers when queued work cannot drain.
  • Use streams or analytics sinks only when the bottleneck is event throughput, not worker capacity.

Contract

Uptime

99.5%

P95 latency

260ms

Budget

$650/mo

Traffic shape

Morning surge that tests capacity during a narrow peak. Baseline 220 users; peak around 5,200 users over 36 hours.

Available components

Server

HTTP request handler Every web app needs at least one server. More servers let you handle more simultaneous requests before latency starts climbing.

Postgres

Primary data store Without a database, your app has no memory. Most dynamic requests eventually depend on it.

LB

Load balancer If you run more than one server, something needs to decide where each request goes. That is the load balancer.

Queue

Async job buffer Moving background work out of the request path keeps the app responsive even when extra processing is needed.

Worker

Background job processor Separating background work keeps checkout, page loads, and other user actions from competing with batch processing.

Stream

Durable event stream Streams are shock absorbers for high-volume ingestion and fan-out systems where one producer feeds many downstream consumers.

Analytics

Async metrics sink Analytics traffic can be high-volume and noisy. Keeping it async prevents dashboards and event collection from hurting core product latency.

Redis

In-memory cache layer Popular pages, profiles, and product data often get requested again and again. Serving those from memory is much faster and cheaper.

Replica

Read-only DB copy Many applications read far more often than they write. Replicas let you spread those reads across more machines.

Common mistakes

  • Queues need dead-letter handling and retry limits.
  • Retries need backoff, priority, and enough worker capacity to avoid self-inflicted load.
  • Telemetry should be isolated behind async sinks before it grows noisy.
  • Count writes per user action. Batch, queue, or sample non-critical writes before they dominate the primary path.

Interview adjacency

  • Debug async backlog
  • Explain backpressure
  • Handle production handover