The System Design Cheat Sheet

System Design Studying can be daunting. It can be useful to have a table to know what problems requires what components, and for those components what caveats they bring and how to mitigate them. This repository gives you a table to serve just this purpose.

Your ⭐ on the repo would bring me joy, validate the effort and motivate me to do more like this. ✨

Use Cases/Problems	System Design Questions	⭐ Component	What it solves	Caveats/Issues	Mitigations	Examples of Tools
- Unified API access: Centralizes client requests. - Security: Manages authentication and authorization.	- Design an API gateway for microservices. - Implement secure and scalable API access.	API Gateway	Single entry point, manages authentication and routing.	Can become a bottleneck, adds latency.	- Use multiple gateways with load balancing. - Implement rate limiting and caching. - Use circuit breakers and retries.	Kong, Apigee, AWS API Gateway
- High traffic websites: Ensures uptime and balances load. - Scalable APIs: Distributes incoming requests.	- Design a scalable web application. - Build a highly available online service.	Load Balancer across multiple redundant workers	Distributes traffic across workers, improves reliability and availability.	Single point of failure, adds complexity.	- Use multiple load balancers in different regions. - Implement health checks. - Use DNS-based load balancing.	Nginx, HAProxy, AWS ELB
- Financial transactions: Requires ACID compliance. - Complex queries: Needs structured and relational data.	- Design a financial transaction system. - Create a scalable relational database.	SQL Database	Strong ACID properties, structured data, complex queries.	Limited scalability, schema management.	- Implement sharding. - Use read replicas. - Employ clustering and partitioning.	MySQL, PostgreSQL, MS SQL Server
- Large-scale data: Supports horizontal scaling. - Unstructured data: Flexible schema adapts to changes.	- Design a large-scale user profile store. - Create a scalable data storage solution.	NoSQL Database	Flexible schema, horizontal scalability, high performance.	Eventual consistency, limited transaction support.	- Use consistency settings (e.g., quorum reads/writes). - Design for idempotent operations. - Implement conflict resolution strategies.	MongoDB, Cassandra, DynamoDB
- High availability: Ensures data is replicated and available.	- Design a data replication strategy. - Implement a highly available database system.	Data Replication	Ensures data durability, to ensure system availability.	Increases costs, consistency issues.	- Use asynchronous replication. - Implement conflict resolution. - Use multi-master replication.	AWS RDS standby (synchronous), AWS RDS Read Replicas (asynchronous), MongoDB Replica Set (asynchronous)
- High read load: Reduces latency for frequent reads. - Session storage: Speeds up access to session data.	- Design a high-performance caching layer. - Optimize read-heavy workload.	Cache	Reduces latency, decreases load on databases.	Cache consistency issues, potential for stale data.	- Implement cache invalidation strategies. - Use Time-to-Live (TTL) settings. - Employ write-through or write-back caching.	Redis, Memcached
- Real-time analytics: Requires fast data access. - Leaderboards: High-speed data retrieval is crucial.	- Design a real-time analytics system. - Create a fast leaderboard service.	In-Memory Database	Extremely fast data retrieval, reduces latency.	Volatile storage, high memory cost.	- Enable persistence options. - Use hybrid storage models (in-memory + disk). - Implement data backup strategies.	Redis, Memcached
- Event streaming: Manages high-throughput data streams. - Real-time processing: Facilitates real-time data flows.	- Design a real-time event streaming platform. - Implement a reliable messaging system.	Message Broker	Facilitates message exchange, supports multiple patterns.	Bottleneck potential, delivery guarantees.	- Use scalable brokers with partitions. - Implement backpressure handling. - Monitor message broker performance.	Apache Kafka, RabbitMQ, ActiveMQ
- Event-driven systems: Manages asynchronous events. - Microservices: Decouples service communication.	- Design an event-driven architecture. - Create a reliable task processing system.	Distributed Queue	Manages asynchronous communication, decouples components.	Message ordering and delivery guarantees.	- Use message brokers with strong ordering guarantees. - Implement idempotent message processing. - Use message deduplication techniques.	Apache Kafka, RabbitMQ, AWS SQS
- Large applications: Enhances modularity and scalability. - Continuous delivery: Facilitates independent deployment.	- Design a scalable microservices architecture. - Build a modular, independently deployable system.	Microservices	Improves modularity, independent deployment.	Increased communication complexity.	- Use service meshes. - Implement standardized APIs. - Use centralized logging and monitoring.	Docker, Kubernetes, Istio
- Microservices: Enables service discovery. - Dynamic environments: Tracks changing service instances.	- Design a service discovery mechanism. - Implement dynamic service registration.	Service Registry	Tracks services and their instances.	High availability required, consistency issues.	- Use distributed service registries. - Implement regular health checks. - Use consensus algorithms for consistency.	Consul, Eureka, Zookeeper
- Content-heavy sites: Improves load times for users. - Global reach: Distributes content across regions.	- Design a content delivery system. - Optimize a global website’s performance.	CDN (Content Delivery Network)	Reduces latency, improves load times.	Cache invalidation complexity, cost.	- Implement cache purging strategies. - Use regional CDNs. - Monitor CDN performance and hit rates.	Cloudflare, Akamai, AWS CloudFront
- Business intelligence: Centralizes analytics data. - Historical analysis: Supports complex querying over large datasets.	- Design a data warehouse for analytics. - Build a scalable business intelligence platform.	Data Warehouse	Centralizes data, supports complex queries.	High storage and maintenance costs.	- Use data compression and partitioning. - Implement data lifecycle management. - Use cloud-based, scalable data warehouses.	Amazon Redshift, Snowflake, Google BigQuery
- E-commerce sites: Provides fast product search. - Large datasets: Enables full-text search over extensive data.	- Design a product search system. - Implement a scalable search solution.	Search Engine	Enables fast search over large datasets.	Indexing and maintenance required.	- Implement efficient indexing strategies. - Use distributed search architectures. - Optimize search queries and relevance.	Elasticsearch, Solr, Algolia
- Media storage: Handles large files like images and videos. - Backup solutions: Stores and retrieves backups.	- Design a scalable file storage system. - Implement a reliable backup solution.	File Storage	Scales with data growth, handles unstructured data.	Backup and redundancy required, retrieval latency.	- Use distributed file systems. - Implement multi-region replication. - Use lifecycle policies for data management.	AWS S3, Google Cloud Storage, HDFS
- Data warehousing: Prepares data for analysis. - Data migration: Transforms data from multiple sources.	- Design an ETL pipeline for a data warehouse. - Build a reliable data integration system.	ETL Pipeline	Facilitates data integration and analysis.	Complex to build and maintain.	- Use managed ETL services. - Implement monitoring and error handling. - Use data validation and transformation tools.	Apache Nifi, AWS Glue, Talend
- System reliability: Monitors uptime and performance. - Issue detection: Alerts for anomalies and failures.	- Design a system monitoring solution. - Implement an alerting and dashboard system.	Monitoring System	Tracks system health, enables alerting.	High overhead, potential noise.	- Use threshold tuning and anomaly detection. - Implement efficient data collection. - Use centralized monitoring dashboards.	Prometheus, Grafana, Datadog
- Debugging: Captures logs for issue diagnosis. - Compliance: Maintains audit trails.	- Design a centralized logging system. - Implement a scalable logging and analysis solution.	Logging System	Aids in auditing and troubleshooting.	Large data volumes, storage and querying.	- Use log rotation and retention policies. - Implement centralized logging. - Optimize log storage and indexing.	ELK Stack, Splunk, Fluentd
- Secure applications: Manages user identity and access. - Single sign-on: Centralizes authentication across services.	- Design a secure authentication system. - Implement a single sign-on solution.	Authentication Service	Enhances security, manages user authentication.	Single point of failure, security measures needed.	- Use multi-factor authentication. - Implement redundancy and failover. - Use secure token storage and management.	OAuth, Okta, Auth0
- Containerized apps: Automates container management. - Microservices: Coordinates service deployments.	- Design a container orchestration system. - Implement a CI/CD pipeline for microservices.	Orchestration Tool	Automates deployment and management.	Adds complexity, learning curve.	- Use managed orchestration services. - Implement robust CI/CD pipelines. - Use monitoring and scaling tools.	Kubernetes, Docker Swarm, Mesos
- Dynamic applications: Centralizes config changes. - Large systems: Manages configurations across services.	- Design a configuration management system. - Implement dynamic configuration updates.	Configuration Service	Centralizes configuration management.	Single point of failure, secure access needed.	- Use distributed configuration stores. - Implement encryption for sensitive data. - Use versioning and rollback mechanisms.	Consul, etcd, Spring Cloud Config
- Real-time dashboards: Aggregates live data feeds. - Monitoring: Provides instant insights from data streams.	- Design a real-time analytics system. - Implement a live data aggregation platform.	Real-Time Data Aggregation	Enables real-time analytics and monitoring.	High complexity, data velocity issues.	- Use stream processing frameworks. - Implement windowing and aggregation techniques. - Monitor and scale processing infrastructure.	Apache Flink, Apache Storm, AWS Kinesis
- Microservices: Tracks requests across services. - Performance tuning: Identifies bottlenecks and delays.	- Design a distributed tracing system. - Implement performance monitoring for microservices.	Distributed Tracing	Aids in debugging and performance monitoring.	High overhead, integration required.	- Use sampling to reduce overhead. - Implement efficient trace storage. - Use correlation IDs for request tracking.	Jaeger, Zipkin, OpenTracing
- Fault tolerance: Prevents system overloads. - Resilient services: Isolates failures in microservices.	- Design a fault-tolerant microservices system. - Implement circuit breakers for service reliability.	Circuit Breaker	Protects services from cascading failures.	Adds complexity, tuning needed.	- Use monitoring tools to detect failures. - Implement fallback strategies. - Use retries and exponential backoff.	Hystrix, Resilience4j, Istio
- API management: Protects against request floods. - Fair resource allocation: Ensures fair usage policies.	- Design an API rate limiting system. - Implement a fair resource allocation mechanism.	Rate Limiter	Controls request rate, prevents abuse.	Can impact user experience.	- Use dynamic rate limiting. - Implement user-based quotas. - Use monitoring to adjust limits.	Kong, Envoy, Nginx
- Periodic tasks: Automates recurring jobs. - Batch processing: Manages large data processing tasks.	- Design a job scheduling system. - Implement a reliable task processing system.	Scheduler	Manages background jobs and tasks.	Requires monitoring, can become bottleneck.	- Use distributed schedulers. - Implement job prioritization. - Use monitoring and retry mechanisms.	Apache Airflow, Celery, Kubernetes CronJobs
- Microservices: Handles inter-service communication. - Observability: Provides insights into service interactions.	- Design a service mesh for microservices. - Implement observability for service interactions.	Service Mesh	Manages microservices communication.	Adds operational complexity.	- Use managed service meshes. - Implement automation tools. - Use monitoring and observability tools.	Istio, Linkerd, Consul Connect
- Disaster recovery: Ensures data is safe and recoverable. - Data integrity: Maintains backups for compliance.	- Design a backup and recovery system. - Implement a reliable disaster recovery solution.	Data Backup and Recovery	Ensures data durability, protects against data loss.	Resource-intensive, regular testing needed. Increases costs. if backups are not up to date there will be data loss. Backup may not be accessible.	- Use automated backup solutions. - Implement multi-region storage. - Regularly test backup and recovery processes.	Use native backup capabilites of the data store, or centralized backup products like AWS Backup, Google Cloud Backup, Veeam
- Social networks: Models complex relationships. - Recommendation engines: Analyzes connected data.	- Design a social network graph database. - Implement a recommendation engine.	Graph Database	Efficiently handles graph-based data and relationships.	Steep learning curve, non-graph query inefficiency.	- Use graph-specific optimizations. - Implement hybrid models for different data types. - Use indexing and caching for performance.	Neo4j, Amazon Neptune, OrientDB
- Big data analytics: Stores and processes vast data. - Data warehousing: Prepares raw data for analytics.	- Design a big data analytics platform. - Implement a data lake for diverse data types.	Data Lake	Supports diverse data types and analytics.	Governance required, risk of becoming data swamp.	- Use metadata management. - Implement data cataloging. - Use data lifecycle policies.	AWS Lake Formation, Azure Data Lake, Hadoop
- Event-driven architectures: Processes data streams in real-time. - Analytics: Real-time insights from continuous data flow.	- Design a real-time data streaming system. - Implement an event-driven architecture.	Data Streaming Platform	Facilitates real-time data processing.	High operational complexity.	- Use managed streaming services. - Implement scaling strategies. - Monitor and optimize processing.	Apache Kafka, AWS Kinesis, Google Pub/Sub

Other tips

Memorize non-functional requirements and be quick at back of the envelope calculations. Use this link. QPS and Storage numbers are often more important than others. To arrive at QPS, go from DAU for Product -> DAU for feature -> Read/Write -> Seconds in the day -> QPS.
Going for L4/E4/Intermediate role: Feel free to ask as many Qs to interviewer as you want. Going for L5/E5/Senior role: Lead the interview, make choices yourself, and provide justifications and just 'confirm' with interviewer you're on the right track.
First provide a high level design overview with all components you would have, then deep dive into each component and why it is useful, what are its pros and cons.

Contributions

Community contributions both for issues and PRs are welcome.

Found it useful?

Thanks. :) Feel free to leave a star on the repo, or if you can afford it, .

Feel free to connect to stay updated.

LinkedIn :
X :

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The System Design Cheat Sheet

Other tips

Contributions

Found it useful?

Files

README.md

Latest commit

History

README.md

File metadata and controls

The System Design Cheat Sheet

Other tips

Contributions

Found it useful?