Scenario / When things break anyway

Guided System Design Lesson: Incident Response

A decent architecture still has to survive one production incident.

Run this scenario

Briefing

Incident response

Architecture is the half you control. Incidents are the other half, and the post-mortem is how you learn from both.

Architecture is the half you control. Incidents are the other half, and the post-mortem is how you learn from both.

Architecture is the half you control. Incidents are the other half, and the post-mortem is how you learn from both.

  • Watch incident choices in the simulator.
  • Watch tradeoffs in the simulator.
  • Watch post-mortems in the simulator.

Contract

Uptime

95%

P95 latency

320ms

Budget

$760/mo

Traffic shape

Daily traffic curve with a predictable high-traffic window. Baseline 260 users; peak around 1,900 users over 24 hours.

Available components

Server

HTTP request handler Every web app needs at least one server. More servers let you handle more simultaneous requests before latency starts climbing.

Postgres

Primary data store Without a database, your app has no memory. Most dynamic requests eventually depend on it.

LB

Load balancer If you run more than one server, something needs to decide where each request goes. That is the load balancer.

Redis

In-memory cache layer Popular pages, profiles, and product data often get requested again and again. Serving those from memory is much faster and cheaper.

Replica

Read-only DB copy Many applications read far more often than they write. Replicas let you spread those reads across more machines.

Queue

Async job buffer Moving background work out of the request path keeps the app responsive even when extra processing is needed.

Worker

Background job processor Separating background work keeps checkout, page loads, and other user actions from competing with batch processing.

CDN

Static asset edge cache Images, scripts, stylesheets, and some API responses do not need to travel all the way back to origin every time.

Object Store

Media and blob storage Binary media does not belong in your primary relational database if you want scale and predictable cost.

Common mistakes

  • Rolling recovery preserves availability better than restarting the whole fleet.

Interview adjacency

  • Triage production incidents
  • Explain post-mortems
  • Handle memory leaks