A common reason that companies seemingly like to hire Big Tech engineers (at least before the recent hiring market) is that they are good at scaling large systems. While Big Tech today might be too big to just assume that any engineer is good at scaling systems, it’s still important to understand what system scalability is and how to scale a system properly.
Systems tend to slow down as they grow unless proactively adjusted to handle the increased demands.
Scalability is the ability to handle more load by adding resources.
A truly scalable system can adapt and evolve to consistently manage a growing workload.
This article will examine various dimensions of system growth and explore common strategies for achieving scalability.
How can a system grow?
A system can grow in different ways. Here are the most common:
1. More Users: A larger user base creates a greater number of requests.
- Example: A social media platform experiencing a surge in new users.
2. More Features: Adding new features to the system increases its capabilities.
- Example: An e-commerce website adding support for a new payment method.
3. More Data: The system stores and manages more data because of user activity or logging.
- Example: A video streaming platform like YouTube storing more video content over time.
4. More Complexity: The system’s architecture evolves to handle new features and scale, adding more parts and connections.
- Example: A system that started as a simple application is broken into smaller, independent systems.
5. More Locations: The system serves users in new regions or countries.
- Example: An e-commerce company launching websites and distribution in new international markets.
How to Scale a Software System
Here are 10 common ways to make a system scalable:
1. Vertical Scaling (Scaling Up)
This means adding more power to your existing machines by upgrading server with more RAM, faster (or more) CPUs, or additional storage.
It’s a good approach for simpler architectures but has limitations in how far you can go.
graph LR subgraph "After Vertical Scaling" direction TB CPU2[8 CPU Cores] RAM2[32GB RAM] SSD2[500GB SSD] end subgraph "Before Scaling" direction TB CPU1[2 CPU Cores] RAM1[8GB RAM] SSD1[100GB SSD] end Before --> After style CPU1 fill:#f9f,stroke:#333 style RAM1 fill:#bbf,stroke:#333 style SSD1 fill:#bfb,stroke:#333 style CPU2 fill:#f9f,stroke:#333 style RAM2 fill:#bbf,stroke:#333 style SSD2 fill:#bfb,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
2. Horizontal Scaling (Scaling Out)
This means adding more machines to your system to spread the workload across multiple servers.
This is usually the simplest and most efficient way to scale a system.
graph LR subgraph "Before Scaling" direction TB SERVER1[Server<br/>2 CPU Cores<br/>8GB RAM] end subgraph "After Horizontal Scaling" direction TB SERVER2[Server<br/>2 CPU Cores<br/>8GB RAM] SERVER3[Server<br/>2 CPU Cores<br/>8GB RAM] SERVER4[Server<br/>2 CPU Cores<br/>8GB RAM] end Before --> After style SERVER1 fill:#bbf,stroke:#333 style SERVER2 fill:#bbf,stroke:#333 style SERVER3 fill:#bbf,stroke:#333 style SERVER4 fill:#bbf,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Netflix uses horizontal scaling for its streaming service, adding more servers to their clusters to handle the growing number of users and data traffic.
3. Load Balancing
Load balancing is the process of distributing traffic across multiple servers to ensure no single server becomes overwhelmed.
graph LR subgraph "Before Scaling" direction TB LB1[Load Balancer] SERVER1[Server 1<br/>2 CPU Cores<br/>8GB RAM] end subgraph "After Horizontal Scaling" direction TB LB2[Load Balancer] SERVER2[Server 1<br/>2 CPU Cores<br/>8GB RAM] SERVER3[Server 2<br/>2 CPU Cores<br/>8GB RAM] SERVER4[Server 3<br/>2 CPU Cores<br/>8GB RAM] LB2 --> SERVER2 LB2 --> SERVER3 LB2 --> SERVER4 end LB1 --> SERVER1 Before --> After style LB1 fill:#f96,stroke:#333 style LB2 fill:#f96,stroke:#333 style SERVER1 fill:#bbf,stroke:#333 style SERVER2 fill:#bbf,stroke:#333 style SERVER3 fill:#bbf,stroke:#333 style SERVER4 fill:#bbf,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Google employs load balancing extensively across its global infrastructure to distribute search queries and traffic evenly across its massive server farms.
4. Caching
Caching is a technique to store frequently accessed data in-memory (like RAM) to reduce the load on the server or database. Caching can improve response times by a lot.
graph TD C[Client] CACHE[Cache Layer<br/>Response: ~1ms] DB[(Database<br/>Response: ~100ms)] C --> |Request Data| CACHE CACHE --> |Cache Hit| C CACHE --> |Cache Miss| DB DB --> |Fetch & Store| CACHE CACHE --> |Return Data| C style C fill:#f9f,stroke:#333 style CACHE fill:#bbf,stroke:#333 style DB fill:#bfb,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Reddit uses caching to store frequently accessed content like hot posts and comments so that they can be served quickly without querying the database each time.
5. Content Delivery Networks (CDNs)
CDN distributes static assets (images, videos, etc.) closer to users. This can reduce latency and result in faster load times.
Example: Cloudflare provides CDN services, speeding up website access for users worldwide by caching content in servers located close to users.
graph TD OS[Origin Server<br/>New York] EDGE1[Edge Server<br/>London] EDGE2[Edge Server<br/>Tokyo] EDGE3[Edge Server<br/>Sydney] U1[User<br/>Europe] U2[User<br/>Asia] U3[User<br/>Australia] OS --> EDGE1 OS --> EDGE2 OS --> EDGE3 U1 --> EDGE1 U2 --> EDGE2 U3 --> EDGE3 style OS fill:#f96,stroke:#333 style EDGE1 fill:#bbf,stroke:#333 style EDGE2 fill:#bbf,stroke:#333 style EDGE3 fill:#bbf,stroke:#333 style U1 fill:#bfb,stroke:#333 style U2 fill:#bfb,stroke:#333 style U3 fill:#bfb,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
6. Sharding/Partitioning
Partitioning means splitting data or functionality across multiple nodes/servers to distribute workload and avoid bottlenecks.
graph TD APP[Application] subgraph "Shard Key: User ID" KEY1[ID: 1-1000] KEY2[ID: 1001-2000] KEY3[ID: 2001-3000] end subgraph "Database Shards" DB1[(Shard 1<br/>Users 1-1000)] DB2[(Shard 2<br/>Users 1001-2000)] DB3[(Shard 3<br/>Users 2001-3000)] end APP --> KEY1 APP --> KEY2 APP --> KEY3 KEY1 --> DB1 KEY2 --> DB2 KEY3 --> DB3 style APP fill:#f9f,stroke:#333 style KEY1 fill:#bfb,stroke:#333 style KEY2 fill:#bfb,stroke:#333 style KEY3 fill:#bfb,stroke:#333 style DB1 fill:#bbf,stroke:#333 style DB2 fill:#bbf,stroke:#333 style DB3 fill:#bbf,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Amazon DynamoDB uses partitioning to distribute data and traffic for its NoSQL database service across many servers, ensuring fast performance and scalability.
7. Asynchronous communication
Asynchronous communication means deferring long-running or non-critical tasks to background queues or message brokers.
This ensures your main application remains responsive to users.
graph LR A[App Server<br/>Instant Response] Q[Message Queue] W1[Worker 1] W2[Worker 2] W3[Worker 3] U1[User 1<br/>Sends Message] --> A U2[User 2<br/>Continues Using App] --> A A -->|1 Store Task| Q A -->|2 Return Success| U1 Q -->|3a. Process Task| W1 Q -->|3b. Process Task| W2 Q -->|3c. Process Task| W3 style A fill:#bbf,stroke:#333 style Q fill:#f96,stroke:#333 style W1 fill:#bfb,stroke:#333 style W2 fill:#bfb,stroke:#333 style W3 fill:#bfb,stroke:#333 style U1 fill:#f9f,stroke:#333 style U2 fill:#f9f,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Slack uses asynchronous communication for messaging. When a message is sent, the sender’s interface doesn’t freeze; it continues to be responsive while the message is processed and delivered in the background.
8. Microservices Architecture
Micro-services architecture breaks down application into smaller, independent services that can be scaled independently.
This improves resilience and allows teams to work on specific components in parallel.
graph LR subgraph "Monolithic" M[Monolithic App<br/>Auth + Orders<br/>Products + Cart<br/>Notifications] end subgraph "Microservices" A[Auth Service] O[Orders Service] P[Products Service] C[Cart Service] N[Notifications Service] A --> O O --> P P --> C O --> N end Monolithic --> Microservices style M fill:#f96,stroke:#333 style A fill:#bbf,stroke:#333 style O fill:#bbf,stroke:#333 style P fill:#bbf,stroke:#333 style C fill:#bbf,stroke:#333 style N fill:#bbf,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Uber has evolved its architecture into microservices to handle different functions like billing, notifications, and ride matching independently, allowing for efficient scaling and rapid development.
9. Auto-Scaling
Auto-Scaling means automatically adjusting the number of active servers based on the current load.
This ensures that the system can handle spikes in traffic without manual intervention.
graph TD subgraph "Low Load: 20% CPU" L1[Server 1] L2[Server 2] end subgraph "Medium Load: 60% CPU" M1[Server 1] M2[Server 2] M3[Server 3] M4[Server 4] end subgraph "High Load: 80% CPU" H1[Server 1] H2[Server 2] H3[Server 3] H4[Server 4] H5[Server 5] H6[Server 6] end Low --> Medium Medium --> High style L1 fill:#bfb,stroke:#333 style L2 fill:#bfb,stroke:#333 style M1 fill:#f96,stroke:#333 style M2 fill:#f96,stroke:#333 style M3 fill:#f96,stroke:#333 style M4 fill:#f96,stroke:#333 style H1 fill:#f9f,stroke:#333 style H2 fill:#f9f,stroke:#333 style H3 fill:#f9f,stroke:#333 style H4 fill:#f9f,stroke:#333 style H5 fill:#f9f,stroke:#333 style H6 fill:#f9f,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: AWS Auto Scaling monitors applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
10. Multi-region Deployment
Deploy the application in multiple data centers or cloud regions to reduce latency and improve redundancy.
graph TD subgraph "US Region" US_APP[App Servers] US_DB[(Database)] US_APP --> US_DB end subgraph "EU Region" EU_APP[App Servers] EU_DB[(Database)] EU_APP --> EU_DB end subgraph "Asia Region" ASIA_APP[App Servers] ASIA_DB[(Database)] ASIA_APP --> ASIA_DB end US_DB <-->|Sync| EU_DB EU_DB <-->|Sync| ASIA_DB ASIA_DB <-->|Sync| US_DB US_USER[US Users] --> US_APP EU_USER[EU Users] --> EU_APP ASIA_USER[Asia Users] --> ASIA_APP style US_APP fill:#bbf,stroke:#333 style EU_APP fill:#bbf,stroke:#333 style ASIA_APP fill:#bbf,stroke:#333 style US_DB fill:#f96,stroke:#333 style EU_DB fill:#f96,stroke:#333 style ASIA_DB fill:#f96,stroke:#333 style US_USER fill:#bfb,stroke:#333 style EU_USER fill:#bfb,stroke:#333 style ASIA_USER fill:#bfb,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
Example: Spotify uses multi-region deployments to ensure their music streaming service remains highly available and responsive to users all over the world, regardless of where they are located.