System Design

Consistency

|700

Availability

Availability Percentages versus Service Downtime
Availability % Downtime per Year Downtime per Month Downtime per Week
90% (1 nine) 36.5 days 72 hours 16.8 hours
99% (2 nines) 3.65 days 7.20 hours 1.68 hours
99.5% (2 nines) 1.83 days 3.60 hours 50.4 minutes
99.9% (3 nines) 8.76 hours 43.8 minutes 10.1 minutes
99.99% (4 nines) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (5 nines) 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% (6 nines) 31.5 seconds 2.59 seconds 0.605 seconds
99.99999% (7 nines) 3.15 seconds 0.259 seconds 0.0605 seconds

Reliability

Reliability measures how well a system performs its intended operations (functional requirements). We use averages for that (Mean Time to Failure, Mean Time to Repair, etc.)

Availability measures the percentage of time a system accepts requests and responds to clients.

Example 1: A certain system may be 90% available but only reliable 80% of the time.

Example 2: Suppose we consider our “system” the stuff inside a data center (hardware + software). Let’s assume this data center suffers a network failure such that no outsider traffic is coming in and no insider traffic is going out. In this case, instantaneous availability might be zero (because clients cannot reach the service) even though inside the data center, all systems are perfectly functioning (instantaneous reliability 100%).

We use both of them (reliability and availability) in different contexts. For example, storage vendors often quote MTTF for their disks. Most online services use uptime (as a measure of availability) in their SLAs. For example, the uptime of EC2 virtual machines is 99.95%.

Why

The reason for such sharing is to excite the technical community that the company is solving complex problems. They also hope to motivate more people to join their company. Such public blogs can also help to advertise company products to B2B customers. Additionally, such material helps the company train potential future workers independently.

Important Latencies

Component Time (nanoseconds)
L1 cache reference 0.9
L2 cache reference 2.8
L3 cache reference 12.9
Main memory reference 100
Compress 1KB with Snzip 3,000 (3 microseconds)
Read 1 MB sequentially from memory 9,000 (9 microseconds)
Read 1 MB sequentially from SSD 200,000 (200 microseconds)
Round trip within same datacenter 500,000 (500 microseconds)
Read 1 MB sequentially from SSD with speed ~1GB/sec SSD 1,000,000 (1 milliseconds)
Disk seek 4,000,000 (4 milliseconds)
Read 1 MB sequentially from disk 2,000,000 (2 milliseconds)
Send packet SF->NYC 71,000,000 (71 milliseconds)

Thoughts 🤔 by Soumendra Kumar Sahoo is licensed under CC BY 4.0