Scenario / High availability
High Availability & Redundancy: Designing for 99.9% Uptime
Health monitoring platform. Lives depend on uptime. No excuses, no downtime, no exceptions.
Run this scenarioBriefing
High availability & redundancy
High availability uses redundant components and failover paths so one failure does not take the entire service down.
Strict uptime targets leave almost no room for late fixes. Reliability has to be designed into the first architecture.
99.99% uptime leaves almost no room for mistakes. Resilience must be designed in ahead of time.
- Remove single points of failure from compute, ingress, and data access.
- Use replicas and queues to reduce pressure on primary systems.
- Spend enough budget on resilience before incidents begin.
Contract
99.99%
100ms
$600/mo
Traffic shape
Daily traffic curve with a predictable high-traffic window. Baseline 500 users; peak around 1,450 users over 48 hours.
Available components
Server
HTTP request handler Every web app needs at least one server. More servers let you handle more simultaneous requests before latency starts climbing.
Postgres
Primary data store Without a database, your app has no memory. Most dynamic requests eventually depend on it.
Redis
In-memory cache layer Popular pages, profiles, and product data often get requested again and again. Serving those from memory is much faster and cheaper.
LB
Load balancer If you run more than one server, something needs to decide where each request goes. That is the load balancer.
Queue
Async job buffer Moving background work out of the request path keeps the app responsive even when extra processing is needed.
Replica
Read-only DB copy Many applications read far more often than they write. Replicas let you spread those reads across more machines.
WS
WebSocket server Chat, multiplayer games, and live dashboards need open connections instead of one request at a time.
CDN
Static asset edge cache Images, scripts, stylesheets, and some API responses do not need to travel all the way back to origin every time.
Rate limiter
Request throttle During abuse events, legitimate traffic competes with junk traffic for server capacity. Filtering noisy traffic at the edge protects the rest of the stack.
Common mistakes
- Connection pools can fail before raw database capacity reaches 100%.
- DDoS response works best before traffic reaches app servers.
- Rolling recovery preserves availability better than restarting the whole fleet.
- Decoupling and fallback paths keep third-party failures from becoming total outages.
Interview adjacency
- Design a highly available service
- Explain failover
- Design health monitoring