61.26 System Design

Consistency

Availability

Availability Percentages versus Service Downtime
Availability %	Downtime per Year	Downtime per Month	Downtime per Week
90% (1 nine)	36.5 days	72 hours	16.8 hours
99% (2 nines)	3.65 days	7.20 hours	1.68 hours
99.5% (2 nines)	1.83 days	3.60 hours	50.4 minutes
99.9% (3 nines)	8.76 hours	43.8 minutes	10.1 minutes
99.99% (4 nines)	52.56 minutes	4.32 minutes	1.01 minutes
99.999% (5 nines)	5.26 minutes	25.9 seconds	6.05 seconds
99.9999% (6 nines)	31.5 seconds	2.59 seconds	0.605 seconds
99.99999% (7 nines)	3.15 seconds	0.259 seconds	0.0605 seconds

Reliability

Reliability measures how well a system performs its intended operations (functional requirements). We use averages for that (Mean Time to Failure, Mean Time to Repair, etc.)

Availability measures the percentage of time a system accepts requests and responds to clients.

Example 1: A certain system may be 90% available but only reliable 80% of the time.

Example 2: Suppose we consider our “system” the stuff inside a data center (hardware + software). Let’s assume this data center suffers a network failure such that no outsider traffic is coming in and no insider traffic is going out. In this case, instantaneous availability might be zero (because clients cannot reach the service) even though inside the data center, all systems are perfectly functioning (instantaneous reliability 100%).

We use both of them (reliability and availability) in different contexts. For example, storage vendors often quote MTTF for their disks. Most online services use uptime (as a measure of availability) in their SLAs. For example, the uptime of EC2 virtual machines is 99.95%.

Why

The reason for such sharing is to excite the technical community that the company is solving complex problems. They also hope to motivate more people to join their company. Such public blogs can also help to advertise company products to B2B customers. Additionally, such material helps the company train potential future workers independently.

Important Latencies

Component	Time (nanoseconds)
L1 cache reference	0.9
L2 cache reference	2.8
L3 cache reference	12.9
Main memory reference	100
Compress 1KB with Snzip	3,000 (3 microseconds)
Read 1 MB sequentially from memory	9,000 (9 microseconds)
Read 1 MB sequentially from SSD	200,000 (200 microseconds)
Round trip within same datacenter	500,000 (500 microseconds)
Read 1 MB sequentially from SSD with speed ~1GB/sec SSD	1,000,000 (1 milliseconds)
Disk seek	4,000,000 (4 milliseconds)
Read 1 MB sequentially from disk	2,000,000 (2 milliseconds)
Send packet SF->NYC	71,000,000 (71 milliseconds)

Types of Requirements

Requirements will have two sub-categories:

Functional requirements: These represent the features a user of the designed system can use. For example, the system will allow users to search for content using the search bar.
Non-functional requirements (NFRs): The non-functional requirements are criteria based on which a system user will consider the system usable. NFR may include requirements like high availability, low latency, scalability, etc.

Further reads

Latency vs Throughput: https://lnkd.in/gSBsmijw
CAP Theorem: https://lnkd.in/gV7NunUD
ACID Transactions: https://lnkd.in/gpQMxV9u
Consistent Hashing: https://lnkd.in/gaCVWBJM
Rate Limiting: https://lnkd.in/gjkrHkGu
Microservices Architecture: https://lnkd.in/gy3kRzep
API Design: https://lnkd.in/ghcbQySg
Strong vs Eventual Consistency: https://lnkd.in/g2ACr56Q
Synchronous vs asynchronous communications: https://lnkd.in/gYZ8Acth
REST vs RPC: https://lnkd.in/gs7htCMG
Batch Processing vs Stream Processing: https://lnkd.in/gBKHzqAe
Fault Tolerance: https://lnkd.in/ggzdZVhM
Consensus Algorithms: https://lnkd.in/gUcVEhUx
Gossip Protocol: https://lnkd.in/gvkckQGY
Serverless Architecture: https://lnkd.in/g3EYA3nz
Service Discovery: https://lnkd.in/gt84khQG
Disaster Recovery: https://lnkd.in/grpEFGfD
Distributed Tracing: https://lnkd.in/ga5FJuH2
- Horizontal vs Vertical Scaling: https://www.spiceworks.com/tech/cloud/articles/horizontal-vs-vertical-cloud-scaling/

Content Delivery Network (CDN): https://www.cloudflare.com/learning/cdn/what-is-a-cdn/
Domain Name System (DNS): https://www.cloudflare.com/learning/dns/what-is-dns/
Caching: https://medium.com/must-know-computer-science/system-design-caching-acbd1b02ca01
Distributed Caching: https://redis.com/glossary/distributed-caching/
Load Balancing: https://aws.amazon.com/what-is/load-balancing/
SQL vs NoSQL: https://www.integrate.io/blog/the-sql-vs-nosql-difference/
API Gateway: https://www.nginx.com/learn/api-gateway/
Distributed Locking: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
Checksum: https://www.lifewire.com/what-does-checksum-mean-2625825

Tree: https://lnkd.in/g2v9qf87
To Queue Or Not To Queue: https://lnkd.in/gh5tigTk
Hash Tables: https://lnkd.in/gsmg6XSA
Heaps: https://lnkd.in/g4xAGQa8
Linked List: https://lnkd.in/gN7fUxbJ
Recursion: https://lnkd.in/gvMiZWb8
Tries: https://lnkd.in/gbfm2DVR
Stacks and Overflows: https://lnkd.in/gKhNktj6
Binary Search: https://lnkd.in/gv_rDTUa
Dynamic Programming: https://lnkd.in/g_AYf32w
BFS Traversal - Going Broad In A Graph: https://lnkd.in/gVirya_Q
Introduction To Graph Theory: https://lnkd.in/geXpetJH
Substring problems: https://lnkd.in/gfV2PeeR
DFS Traversal - Deep Dive through a Graph: https://lnkd.in/gy-4mbgN
Finding The Shortest Path: https://lnkd.in/gA4Zz425