Scenario / High availability

High Availability & Redundancy: Designing for 99.9% Uptime

Health monitoring platform. Lives depend on uptime. No excuses, no downtime, no exceptions.

Run this scenario

Briefing

High availability & redundancy

High availability uses redundant components and failover paths so one failure does not take the entire service down.

Strict uptime targets leave almost no room for late fixes. Reliability has to be designed into the first architecture.

99.99% uptime leaves almost no room for mistakes. Resilience must be designed in ahead of time.

Remove single points of failure from compute, ingress, and data access.
Use replicas and queues to reduce pressure on primary systems.
Spend enough budget on resilience before incidents begin.

Contract

Uptime

99.99%

P95 latency

100ms

Budget

$600/mo

Traffic shape

Daily traffic curve with a predictable high-traffic window. Baseline 500 users; peak around 1,450 users over 48 hours.

Available components

Server

HTTP request handler Every web app needs at least one server. More servers let you handle more simultaneous requests before latency starts climbing.

Postgres

Primary data store Without a database, your app has no memory. Most dynamic requests eventually depend on it.

Redis

In-memory cache layer Popular pages, profiles, and product data often get requested again and again. Serving those from memory is much faster and cheaper.

LB

Load balancer If you run more than one server, something needs to decide where each request goes. That is the load balancer.

Queue

Async job buffer Moving background work out of the request path keeps the app responsive even when extra processing is needed.

Replica

Read-only DB copy Many applications read far more often than they write. Replicas let you spread those reads across more machines.

WS

WebSocket server Chat, multiplayer games, and live dashboards need open connections instead of one request at a time.

CDN

Static asset edge cache Images, scripts, stylesheets, and some API responses do not need to travel all the way back to origin every time.

Rate limiter

Request throttle During abuse events, legitimate traffic competes with junk traffic for server capacity. Filtering noisy traffic at the edge protects the rest of the stack.

Common mistakes

Connection pools can fail before raw database capacity reaches 100%.
DDoS response works best before traffic reaches app servers.
Rolling recovery preserves availability better than restarting the whole fleet.
Decoupling and fallback paths keep third-party failures from becoming total outages.

Interview adjacency

Design a highly available service
Explain failover
Design health monitoring