System Design Fundamentals notes

7 min readSep 28, 2021

<<Personal Notes, Using Medium as a cloud repository>>

I am a Data scientist at Amazon and have worked across science, product, business teams and most recently joined tech team in Prime video. Though I have built ML and Deep learning models in the past, I would want to lead deployment efforts and starting off with learning CS fundamentals.

I have started with system design and following few medium blogs and Youtube courses/Interviews to get a grasp of the subject.

Wring down my key takeaways so far:

First question in sysis Functional and Non-Functional requirements. Functional requirement would cover core function of the system i.e for TikTok it would be uploading video, building user feed and ability of users to like, comment and share. Non-Functional requirements cover availability, reliability, scalability, bandwidth and latency requirements.

There may be monolithic/single machine and distributed systems. Benefits of distributed systems are reliability/fault tolerance, scalability, cost effectiveness and lower latency.

Glossary:

Scalability-Ability of system to grow and manage increased traffic
Reliability-Probability a system will fail during a period of time, Measured by Mean time between Failure
Availability- Amount of time a system is operational during a period of time. Poorly designed software requires downtime for updates.
Efficiency-How well the system performs with increase in traffic, metric is latency and throughput
Manageability-Speed and difficulty involved with maintaining systems. Difficulty in deploying updates.

For a design interview, engineers should have Traffic(DAU * Reads or writes per second), Memory(To include 20% as cache), Storage and bandwidth calculations as back of head.

Scaling an application:

Horizontal scaling: Multiple servers, complexity up front but efficient long term. Redundancy or backup server built-in and need load balancer to manage traffic. Kubernetes, Docker and Hadoop can be used.When it comes to horizontal scaling, one of the more common techniques is to break up your services into partitions, or shards. The partitions can be distributed such that each logical set of functionality is separate; this could be done by geographic boundaries, or by another criteria like non-paying versus paying users. The advantage of these schemes is that they provide a service or data store with added capacity.
Vertical scaling: Single server, Becomes slow with increase in traffic. Easiest way to scale an application with diminishing returns(Bigger CPUs wont be cheaper) and single point of failure. Consistent data and fast inter process communication vs horizontal.

Latency: Communication happens via Fiber optic cable at speed of light however there is a lag in ms if 2 players are playing in Aus and US.

System Design Components:

Load Balancer- Balances incoming traffic to multiple servers, can perform dynamic scaling to add or subtract servers. Increases reliability as no Single Point Of Failure. Types of load balancers are Round Robin(Cycle through server),Least connection(Chat app), Least Response Time and IP Hash.Located between client and Server. Elastic Load Balancer(ELB) is managed by AWS and guarantees availability and reliability
Caching- Located between App server and database and prevents continuous hitting of DB. Reading from memory is 50X faster than from database, Can serve traffic with high efficiency. Can pre-compute and serve data as used by Twitter timeline feed generation. Reads have more use case for caching than Writes. Example- Redis Cache. Redis can be deployed as standalone docker container. Cache runs on SSD and is expensive which prohibits all the data to be stored in cache. Cache policy determines cache entry and cache eviction with examples as LRU,LFU, Sliding window policy. Distributed/Global cache such as Redis is closer to database and is resilient and ensures data consistency.
Caching layers- DNS, Content Distribution Network(Pictures and Videos), Distributed cache and cache eviction i.e cleaning up stale data. TTL(Time to live) decides on time when cache memory would be cleared lets say front page from NYT. Caching can be cleared by Least Recently used or Least Frequently used .
Database scaling- Most webapps are generally reads i.e 95% read:write ratio. Most used scaling techniques are Indexes(E.g lookup for popular celebrities), Denormalisation, Connection Pooling(Allows multiple application threads to use same DB i.e JB content if accessed by one user in SF can be shared to others), Caching(Most relevant) and Vertical Scaling.
Synchronous requests-When the server receives more requests than it can handle, then each client is forced to wait for the other clients’ requests to complete before a response can be generated. Degrades performance.
Solution is Queue i.e a task comes in, is added to the queue and then workers pick up the next task as they have the capacity to process it.Queues enable clients to work in an asynchronous manner, providing a strategic abstraction of a client’s request and its response. While the client is waiting for an asynchronous request to be completed it is free to perform other work, even making asynchronous requests of other services.

Tradeoffs for Database scaling:

Replication( Create replica servers to handle reads)
Partitioning, Horizontal Partitioning using Sharding- Schema of table remains same but split across multiple DBs. Can result in uneven traffic e.g XYZ names have lower hits than ABC.
Partitioning, Vertical Partitioning-Divide schema into multiple tables. Insta split like count on photos vs other user attributes for high performance.
Relational Database vs NoSQL: More structured, If a single user has multiple videos , Relational DB is better. Free form searching is better for Non SQL. Relational DB is space and speed efficient. Advantages of NoSQL databases are insertions and deletions, schema is changeable and built for Scale with problems of non-read optimized, non-guaranteed ACID as compared to SQL DB.

Design a Youtube clone-

Functional requirement is uploading video, being able to view video, search functionality, comment and video stats.

2. Estimates: Capacity(Storage and Bandwidth)
Database Design: Video table (VideoID,User, Likes, Dislikes) , User Table(User, emailid) and Comments table(user, VideoID)

3. System Architecture-

CDN should store top 20% of video
Webserver: Distributing traffic across web servers.

Design TikTok:

Functional Requirement is user should be able to upload videos, view feed and perform user actions such as like, share etc.
Non-Functional- It should be scalable to 1Mn DAU with 99.9% availability time and a storage of 10Mb per user.

Database should have Blobs(10Mb per Video) and Relational(User, Content id etc). Caching is an important consideration to pre-load the feed for every user id on app launch, caching should connect to RDB read only data.

Key principles that influence the design of large-scale web systems:

Availability: The uptime of a website is absolutely critical to the reputation and functionality of many companies. For some of the larger online retail sites, being unavailable for even minutes can result in thousands or millions of dollars in lost revenue, so designing their systems to be constantly available and resilient to failure is both a fundamental business and a technology requirement. High availability in distributed systems requires the careful consideration of redundancy for key components, rapid recovery in the event of partial system failures, and graceful degradation when problems occur.
Performance: Website performance has become an important consideration for most sites. The speed of a website affects usage and user satisfaction, as well as search engine rankings, a factor that directly correlates to revenue and retention. As a result, creating a system that is optimized for fast responses and low latency is key.
Reliability: A system needs to be reliable, such that a request for data will consistently return the same data. In the event the data changes or is updated, then that same request should return the new data. Users need to know that if something is written to the system, or stored, it will persist and can be relied on to be in place for future retrieval.
Scalability: When it comes to any large distributed system, size is just one aspect of scale that needs to be considered. Just as important is the effort required to increase capacity to handle greater amounts of load, commonly referred to as the scalability of the system. Scalability can refer to many different parameters of the system: how much additional traffic can it handle, how easy is it to add more storage capacity, or even how many more transactions can be processed.
Manageability: Designing a system that is easy to operate is another important consideration. The manageability of the system equates to the scalability of operations: maintenance and updates. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate. (I.e., does it routinely operate without failure or exceptions?)
Cost: Cost is an important factor. This obviously can include hardware and software costs, but it is also important to consider other facets needed to deploy and maintain the system. The amount of developer time the system takes to build, the amount of operational effort required to run the system, and even the amount of training required should all be considered. Cost is the total cost of ownership.

Sources:

https://www.youtube.com/watch?v=MbjObHmDbZo — Primer Course on System Design
https://www.youtube.com/watch?v=Z-0g_aJL5Fw- Design Tiktok
https://www.youtube.com/watch?v=REB_eGHK_P4
https://www.youtube.com/watch?v=Z3SYDTMP3ME&ab_channel=AWSTrainingCenter- AWS services for system design
https://gist.github.com/vasanthk/485d1c25737e8e72759f- System Design Cheatsheet
http://www.aosabook.org/en/distsys.html- Book on system design

System Design Fundamentals notes

Written by Ravi Shankar