What exactly is scalability?

Stay updated with SWE Quiz

Get one free software engineering question every Saturday, with resources to learn more.

Never fail a system design interview again.

Get lifetime access to 500+ system design questions and comprehensive learning resources.

A common reason that companies seemingly like to hire Big Tech engineers (at least before the recent hiring market) is that they are good at scaling large systems. While Big Tech today might be too big to just assume that any engineer is good at scaling systems, it’s still important to understand what system scalability is and how to scale a system properly.

Systems tend to slow down as they grow unless proactively adjusted to handle the increased demands.

Scalability is the ability to handle more load by adding resources.

A truly scalable system can adapt and evolve to consistently manage a growing workload.

This article will examine various dimensions of system growth and explore common strategies for achieving scalability.

How can a system grow?

A system can grow in different ways. Here are the most common:

1. More Users: A larger user base creates a greater number of requests.

Example: A social media platform experiencing a surge in new users.

2. More Features: Adding new features to the system increases its capabilities.

Example: An e-commerce website adding support for a new payment method.

3. More Data: The system stores and manages more data because of user activity or logging.

Example: A video streaming platform like YouTube storing more video content over time.

4. More Complexity: The system’s architecture evolves to handle new features and scale, adding more parts and connections.

Example: A system that started as a simple application is broken into smaller, independent systems.

5. More Locations: The system serves users in new regions or countries.

Example: An e-commerce company launching websites and distribution in new international markets.

How to Scale a Software System

Here are 10 common ways to make a system scalable:

1. Vertical Scaling (Scaling Up)

This means adding more power to your existing machines by upgrading server with more RAM, faster (or more) CPUs, or additional storage.

It’s a good approach for simpler architectures but has limitations in how far you can go.

graph LR
    subgraph "After Vertical Scaling"
        direction TB
        CPU2[8 CPU Cores]
        RAM2[32GB RAM]
        SSD2[500GB SSD]
    end
    
    subgraph "Before Scaling"
        direction TB
        CPU1[2 CPU Cores]
        RAM1[8GB RAM]
        SSD1[100GB SSD]
    end

    Before --> After
    
    style CPU1 fill:#f9f,stroke:#333
    style RAM1 fill:#bbf,stroke:#333
    style SSD1 fill:#bfb,stroke:#333
    
    style CPU2 fill:#f9f,stroke:#333
    style RAM2 fill:#bbf,stroke:#333
    style SSD2 fill:#bfb,stroke:#333

    classDef default fill:#fff,stroke:#333,stroke-width:2px

2. Horizontal Scaling (Scaling Out)

This means adding more machines to your system to spread the workload across multiple servers.

This is usually the simplest and most efficient way to scale a system.

graph LR
    subgraph "Before Scaling"
        direction TB
        SERVER1[Server<br/>2 CPU Cores<br/>8GB RAM]
    end
    
    subgraph "After Horizontal Scaling"
        direction TB
        SERVER2[Server<br/>2 CPU Cores<br/>8GB RAM]
        SERVER3[Server<br/>2 CPU Cores<br/>8GB RAM]
        SERVER4[Server<br/>2 CPU Cores<br/>8GB RAM]
    end

    Before --> After
    
    style SERVER1 fill:#bbf,stroke:#333
    style SERVER2 fill:#bbf,stroke:#333
    style SERVER3 fill:#bbf,stroke:#333
    style SERVER4 fill:#bbf,stroke:#333

    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Netflix uses horizontal scaling for its streaming service, adding more servers to their clusters to handle the growing number of users and data traffic.

3. Load Balancing

Load balancing is the process of distributing traffic across multiple servers to ensure no single server becomes overwhelmed.

graph LR
    subgraph "Before Scaling"
        direction TB
        LB1[Load Balancer]
        SERVER1[Server 1<br/>2 CPU Cores<br/>8GB RAM]
    end
    
    subgraph "After Horizontal Scaling"
        direction TB
        LB2[Load Balancer]
        SERVER2[Server 1<br/>2 CPU Cores<br/>8GB RAM]
        SERVER3[Server 2<br/>2 CPU Cores<br/>8GB RAM]
        SERVER4[Server 3<br/>2 CPU Cores<br/>8GB RAM]
        
        LB2 --> SERVER2
        LB2 --> SERVER3
        LB2 --> SERVER4
    end

    LB1 --> SERVER1
    Before --> After
    
    style LB1 fill:#f96,stroke:#333
    style LB2 fill:#f96,stroke:#333
    style SERVER1 fill:#bbf,stroke:#333
    style SERVER2 fill:#bbf,stroke:#333
    style SERVER3 fill:#bbf,stroke:#333
    style SERVER4 fill:#bbf,stroke:#333

    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Google employs load balancing extensively across its global infrastructure to distribute search queries and traffic evenly across its massive server farms.

4. Caching

Caching is a technique to store frequently accessed data in-memory (like RAM) to reduce the load on the server or database. Caching can improve response times by a lot.

graph TD
    C[Client]
    CACHE[Cache Layer<br/>Response: ~1ms]
    DB[(Database<br/>Response: ~100ms)]
    
    C --> |Request Data| CACHE
    CACHE --> |Cache Hit| C
    CACHE --> |Cache Miss| DB
    DB --> |Fetch & Store| CACHE
    CACHE --> |Return Data| C
    
    style C fill:#f9f,stroke:#333
    style CACHE fill:#bbf,stroke:#333
    style DB fill:#bfb,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Reddit uses caching to store frequently accessed content like hot posts and comments so that they can be served quickly without querying the database each time.

5. Content Delivery Networks (CDNs)

CDN distributes static assets (images, videos, etc.) closer to users. This can reduce latency and result in faster load times.

Example: Cloudflare provides CDN services, speeding up website access for users worldwide by caching content in servers located close to users.

graph TD
    OS[Origin Server<br/>New York]
    
    EDGE1[Edge Server<br/>London]
    EDGE2[Edge Server<br/>Tokyo]
    EDGE3[Edge Server<br/>Sydney]
    
    U1[User<br/>Europe]
    U2[User<br/>Asia]
    U3[User<br/>Australia]
    
    OS --> EDGE1
    OS --> EDGE2
    OS --> EDGE3
    
    U1 --> EDGE1
    U2 --> EDGE2
    U3 --> EDGE3
    
    style OS fill:#f96,stroke:#333
    style EDGE1 fill:#bbf,stroke:#333
    style EDGE2 fill:#bbf,stroke:#333
    style EDGE3 fill:#bbf,stroke:#333
    style U1 fill:#bfb,stroke:#333
    style U2 fill:#bfb,stroke:#333
    style U3 fill:#bfb,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

6. Sharding/Partitioning

Partitioning means splitting data or functionality across multiple nodes/servers to distribute workload and avoid bottlenecks.

graph TD
    APP[Application]
    
    subgraph "Shard Key: User ID"
        KEY1[ID: 1-1000]
        KEY2[ID: 1001-2000]
        KEY3[ID: 2001-3000]
    end
    
    subgraph "Database Shards"
        DB1[(Shard 1<br/>Users 1-1000)]
        DB2[(Shard 2<br/>Users 1001-2000)]
        DB3[(Shard 3<br/>Users 2001-3000)]
    end
    
    APP --> KEY1
    APP --> KEY2
    APP --> KEY3
    
    KEY1 --> DB1
    KEY2 --> DB2
    KEY3 --> DB3
    
    style APP fill:#f9f,stroke:#333
    style KEY1 fill:#bfb,stroke:#333
    style KEY2 fill:#bfb,stroke:#333
    style KEY3 fill:#bfb,stroke:#333
    style DB1 fill:#bbf,stroke:#333
    style DB2 fill:#bbf,stroke:#333
    style DB3 fill:#bbf,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Amazon DynamoDB uses partitioning to distribute data and traffic for its NoSQL database service across many servers, ensuring fast performance and scalability.

7. Asynchronous communication

Asynchronous communication means deferring long-running or non-critical tasks to background queues or message brokers.

This ensures your main application remains responsive to users.

graph LR
    A[App Server<br/>Instant Response]
    Q[Message Queue]
    W1[Worker 1]
    W2[Worker 2]
    W3[Worker 3]
    
    U1[User 1<br/>Sends Message] --> A
    U2[User 2<br/>Continues Using App] --> A
    
    A -->|1 Store Task| Q
    A -->|2 Return Success| U1
    
    Q -->|3a. Process Task| W1
    Q -->|3b. Process Task| W2
    Q -->|3c. Process Task| W3
    
    style A fill:#bbf,stroke:#333
    style Q fill:#f96,stroke:#333
    style W1 fill:#bfb,stroke:#333
    style W2 fill:#bfb,stroke:#333
    style W3 fill:#bfb,stroke:#333
    style U1 fill:#f9f,stroke:#333
    style U2 fill:#f9f,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Slack uses asynchronous communication for messaging. When a message is sent, the sender’s interface doesn’t freeze; it continues to be responsive while the message is processed and delivered in the background.

8. Microservices Architecture

Micro-services architecture breaks down application into smaller, independent services that can be scaled independently.

This improves resilience and allows teams to work on specific components in parallel.

graph LR
    subgraph "Monolithic"
        M[Monolithic App<br/>Auth + Orders<br/>Products + Cart<br/>Notifications]
    end
    
    subgraph "Microservices"
        A[Auth Service]
        O[Orders Service]
        P[Products Service]
        C[Cart Service]
        N[Notifications Service]
        
        A --> O
        O --> P
        P --> C
        O --> N
    end
    
    Monolithic --> Microservices
    
    style M fill:#f96,stroke:#333
    style A fill:#bbf,stroke:#333
    style O fill:#bbf,stroke:#333
    style P fill:#bbf,stroke:#333
    style C fill:#bbf,stroke:#333
    style N fill:#bbf,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Uber has evolved its architecture into microservices to handle different functions like billing, notifications, and ride matching independently, allowing for efficient scaling and rapid development.

9. Auto-Scaling

Auto-Scaling means automatically adjusting the number of active servers based on the current load.

This ensures that the system can handle spikes in traffic without manual intervention.

graph TD
    subgraph "Low Load: 20% CPU"
        L1[Server 1]
        L2[Server 2]
    end
    
    subgraph "Medium Load: 60% CPU"
        M1[Server 1]
        M2[Server 2]
        M3[Server 3]
        M4[Server 4]
    end
    
    subgraph "High Load: 80% CPU"
        H1[Server 1]
        H2[Server 2]
        H3[Server 3]
        H4[Server 4]
        H5[Server 5]
        H6[Server 6]
    end
    
    Low --> Medium
    Medium --> High
    
    style L1 fill:#bfb,stroke:#333
    style L2 fill:#bfb,stroke:#333
    
    style M1 fill:#f96,stroke:#333
    style M2 fill:#f96,stroke:#333
    style M3 fill:#f96,stroke:#333
    style M4 fill:#f96,stroke:#333
    
    style H1 fill:#f9f,stroke:#333
    style H2 fill:#f9f,stroke:#333
    style H3 fill:#f9f,stroke:#333
    style H4 fill:#f9f,stroke:#333
    style H5 fill:#f9f,stroke:#333
    style H6 fill:#f9f,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: AWS Auto Scaling monitors applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.

10. Multi-region Deployment

Deploy the application in multiple data centers or cloud regions to reduce latency and improve redundancy.

graph TD
    subgraph "US Region"
        US_APP[App Servers]
        US_DB[(Database)]
        US_APP --> US_DB
    end
    
    subgraph "EU Region"
        EU_APP[App Servers]
        EU_DB[(Database)]
        EU_APP --> EU_DB
    end
    
    subgraph "Asia Region"
        ASIA_APP[App Servers]
        ASIA_DB[(Database)]
        ASIA_APP --> ASIA_DB
    end
    
    US_DB <-->|Sync| EU_DB
    EU_DB <-->|Sync| ASIA_DB
    ASIA_DB <-->|Sync| US_DB
    
    US_USER[US Users] --> US_APP
    EU_USER[EU Users] --> EU_APP
    ASIA_USER[Asia Users] --> ASIA_APP
    
    style US_APP fill:#bbf,stroke:#333
    style EU_APP fill:#bbf,stroke:#333
    style ASIA_APP fill:#bbf,stroke:#333
    
    style US_DB fill:#f96,stroke:#333
    style EU_DB fill:#f96,stroke:#333
    style ASIA_DB fill:#f96,stroke:#333
    
    style US_USER fill:#bfb,stroke:#333
    style EU_USER fill:#bfb,stroke:#333
    style ASIA_USER fill:#bfb,stroke:#333
    
    classDef default fill:#fff,stroke:#333,stroke-width:2px

Example: Spotify uses multi-region deployments to ensure their music streaming service remains highly available and responsive to users all over the world, regardless of where they are located.