Scenario / The Friday deploy
On-Call Simulation: Diagnose a Cache Blindspot Failure
A mid-size SaaS is entering the evening peak. The previous engineer scaled servers repeatedly, but every read still lands on the primary database.
Run this scenarioBriefing
Brownfield bottleneck diagnosis
Brownfield diagnosis means reading the current topology and metrics before adding capacity.
Scaling the wrong tier can make an incident more expensive without removing the bottleneck.
Adding more of the healthy tier does not fix the saturated tier.
- Compare server load with database load before adding compute.
- Remove redundant capacity when it blocks the budget for the actual bottleneck.
- Use cache when repeated reads are saturating storage.
Contract
98%
500ms
$380/mo
Traffic shape
Daily traffic curve with a predictable high-traffic window. Baseline 260 users; peak around 900 users over 36 hours.
Available components
Server
HTTP request handler Every web app needs at least one server. More servers let you handle more simultaneous requests before latency starts climbing.
Postgres
Primary data store Without a database, your app has no memory. Most dynamic requests eventually depend on it.
Redis
In-memory cache layer Popular pages, profiles, and product data often get requested again and again. Serving those from memory is much faster and cheaper.
LB
Load balancer If you run more than one server, something needs to decide where each request goes. That is the load balancer.
Replica
Read-only DB copy Many applications read far more often than they write. Replicas let you spread those reads across more machines.
Queue
Async job buffer Moving background work out of the request path keeps the app responsive even when extra processing is needed.
Worker
Background job processor Separating background work keeps checkout, page loads, and other user actions from competing with batch processing.
Rate limiter
Request throttle During abuse events, legitimate traffic competes with junk traffic for server capacity. Filtering noisy traffic at the edge protects the rest of the stack.
Common mistakes
- Connection pools can fail before raw database capacity reaches 100%.
- Query-heavy paths need indexes, replicas, or search offload before peak traffic.
- Rolling recovery preserves availability better than restarting the whole fleet.
- Caches need warm-up, jittered expiry, and fallback capacity.
Interview adjacency
- Debug database latency
- Diagnose missing cache
- Handle production handover