How to Choose the Right Database

Stay updated with SWE Quiz

Get one free software engineering question every Saturday, with resources to learn more.

Never fail a system design interview again.

Get lifetime access to 500+ system design questions and comprehensive learning resources.

In system design interviews, the quality of your design and its ability to scale depends heavily on the database you choose. Often, choosing the wrong database can lead to pain later, as migrations with active users and active data is time-consuming and must be done with care.

In this article, we will cover common use cases that come up often in system design interviews and explore the best databases for each scenario.

flowchart TD
    A[Start: What type of data?] --> B{Primarily Structured?}
    B -->|Yes| C{Need ACID?}
    B -->|No| D{What type?}
    
    C -->|Yes| E[Consider: PostgreSQL/MySQL]
    C -->|No| F{High Write Load?}
    
    F -->|Yes| G[Consider: Cassandra]
    F -->|No| H[Consider: MongoDB]
    
    D -->|Files/Binary| I[Consider: Object Storage\ne.g. S3, Azure Blob]
    D -->|Time Series| J[Consider: InfluxDB/TimescaleDB]
    D -->|Graph Data| K[Consider: Neo4j/Neptune]
    D -->|Search Text| L[Consider: Elasticsearch/Solr]
    
    %% Additional decision points
    L --> M{Need Real-time?}
    M -->|Yes| N[Add Redis Cache]
    M -->|No| O[Standard Setup]
    
    E --> P{Scale Requirements?}
    P -->|Horizontal| Q[Consider: Vitess/CockroachDB]
    P -->|Vertical| R[Standard Setup]
    
    %% Performance considerations
    H --> S{Query Complexity?}
    S -->|High| T[Add Indexes]
    S -->|Low| U[Standard Setup]

1. Structured Data Requiring ACID Compliance

Take a look at large-scale e-commerce platforms like Amazon that handle vast numbers of transactions each day.

Every purchase requires several dependent operations to execute:

Choosing an item from inventory
Adjusting stock levels
Processing customer payment
Logging transaction details for business records

These interconnected steps demand strong consistency and ACID properties to prevent data inconsistencies and failed transactions.

Atomicity means a purchase won’t ship without successful payment processing
Consistency maintains accurate inventory counts, preventing negative stock levels
Isolation ensures simultaneous purchases of limited stock items are handled properly
Durability guarantees completed transactions remain recorded even if systems fail immediately after

Recommended Solution – Relational Database

For applications requiring rigid data consistency and structured schemas, relational databases such as MySQL or PostgreSQL typically provide the optimal solution.

These systems are engineered specifically to maintain data relationships, enforce business rules, and provide robust transaction handling.

2. Flexible Schema

A flexible schema enables data storage and management without requiring the strict, predefined data structures typically found in relational systems.

Using a flexible schema approach, individual entries can contain different fields, hierarchical information, and varying data types – perfect for systems where data structures are diverse and change often.

Think about a social networking platform with highly variable user information:

Some profiles might include: a few interests and basic biographical details
Others could contain: various interests, several locations, and preferred activities

Attempting to store this varied information in traditional relational tables would require multiple table structures and frequent schema modifications whenever new fields are introduced.

In contrast, flexible schemas allow immediate structural updates as needed. This requires using a database system capable of rapid field adjustments.

Recommended Solutions – Document Database

MongoDB: Utilizes JSON-style document storage, offering developer-friendly flexibility. Perfect for applications like social platforms, content systems, and online retail sites.
Couchbase: Merges adaptable JSON document storage with efficient key-value operations. Features robust offline capability support, including built-in data synchronization.

3. Needs Caching

Caching serves as an effective method to enhance system performance by keeping commonly requested data in a rapid access storage tier—usually within memory.

This technique enables subsequent data requests to be handled with significantly reduced latency, eliminating the need to execute resource-intensive queries against slower storage systems or external services.

4. Searching Through Large Textual Data

Numerous modern applications depend on performing fast searches across substantial text-based content.

Consider these implementations:

Job Platforms: Services like LinkedIn, Indeed, and Glassdoor enable users to find positions using role-specific terms.
E-commerce: Platforms like Amazon, eBay, and Etsy let customers locate products through names, specifications, or groupings.
Content Platforms: Sites like Medium, YouTube, and Spotify help users discover content through keyword searches.
Documentation Sites: Platforms like Stack Overflow and GitHub allow developers to search through code and technical content.
Academic Databases: Services like Google Scholar and JSTOR provide research paper searches.
News Aggregators: Applications like Reddit and News APIs require fast article searching.

These applications require sophisticated text search functionality, incorporating features such as result ordering, contextual relevance, and approximate matching.

Traditional database systems, whether relational or NoSQL, typically struggle to deliver the necessary search performance and capabilities.

Recommended Solution – Text Search Engine Dedicated search engines are designed to process complex text-based queries efficiently while providing advanced search features.

Elasticsearch: A scalable, open-source platform for search and analysis built on Apache Lucene. Delivers advanced capabilities including: comprehensive text search with relevance calculations, error-tolerant matching for misspellings, and result highlighting.
Apache Solr: Another Lucene-based search solution, recognized for its adaptability and scalability. Offers robust text search features comparable to Elasticsearch, but particularly suited for specialized or custom implementations.

5. File Storage / Object Storage

Contemporary applications frequently require efficient systems for storing and delivering various media assets—including images, video content, audio files, and other substantial binary data.

Consider these scenarios:

Streaming Services: Platforms like YouTube, Netflix, and Disney+ must manage and distribute vast video libraries.
Social Networks: Applications like Instagram, Facebook, and TikTok handle enormous volumes of user-generated visual content.
Media Sharing Platforms: Services like Imgur and Flickr need robust storage and quick delivery of uploaded materials.
Cloud Storage: Services like Dropbox and Google Drive store diverse file types.
Educational Platforms: Sites like Coursera and Udemy host educational videos and materials.

While conventional relational and NoSQL databases can technically store binary content as BLOBs (Binary Large Objects), they’re not ideal for these scenarios due to several limitations:

Scalability: Large binary file storage in databases leads to increased costs and management complexity as content volumes expand.
Performance: Direct media serving from databases creates unnecessary latency and degrades system responsiveness.
Cost: Database pricing models based on storage and throughput make them financially inefficient for large file management.

Given these constraints, specialized object storage solutions, typically paired with Content Delivery Networks (CDNs), represent the optimal approach.

6. Highly Connected Data

In modern systems, data often exists as a web of interconnected entities rather than simple tabular information.

These scenarios prioritize the connections between data points as much as, or even more than, the data points themselves – what we term highly connected data.

Consider these implementations:

Social Platforms: Services like Facebook, LinkedIn, and Twitter analyze user connections to recommend friends, detect groups, and map relationship networks.
Recommendation Engines: Sites like Amazon, Netflix, and Spotify suggest content by analyzing connections between users, products, and preferences.
Knowledge Graphs: Google, Wikipedia, and research databases link concepts to provide context-rich information.
Fraud Detection: Financial systems track transaction patterns and relationships to identify suspicious activities.
Supply Chain Systems: Organizations map complex networks of suppliers, products, and logistics.

While traditional relational databases can represent relationships through foreign keys and joins, they struggle with highly connected data for several reasons:

Query Complexity: Multi-level relationship queries like “connections of connections of connections” become inefficient in standard RDBMS.
Performance Issues: Analyzing deep relationship chains requires resource-intensive joins, leading to slower performance as data expands.
Schema Limitations: The rigid structure of relational databases makes it challenging to adapt to evolving relationship patterns.

Recommended Solution – Graph Database Graph databases are specifically engineered for connected data, utilizing nodes (entities) and edges (relationships) for efficient data organization and traversal.

Neo4j: A leading graph database optimized for connected data operations. Features Cypher, a specialized query language for efficient graph navigation.
Amazon Neptune: A cloud-based managed graph database service supporting various graph models, including property graphs and RDF.

7. Metrics Data and Time Series

Time series data represents chronological sequences of measurements, each with an associated timestamp, showing the progression of a system, metric, or process across time.

Common applications include monitoring server performance metrics, tracking financial indicators, recording environmental sensors, and analyzing user behavior patterns.

Such data typically flows in at rapid intervals, demanding systems capable of efficient data capture, storage, and access for both immediate system monitoring and long-term pattern analysis.

While conventional relational and NoSQL databases can technically store time-based data, they’re not typically ideal for these specific requirements.

8. Large-scale Analytics

When dealing with vast and continuously expanding datasets—reaching terabytes or petabytes in scale—you need database systems specifically optimized for large-scale analytical processing.

These implementations typically require handling intricate queries, extensive data aggregations, comprehensive reporting, and integration with AI/ML workflows.

Consider these scenarios:

Digital Retail Platforms examining customer patterns, sales metrics, and stock management.
Social Networks extracting insights from countless daily user engagements.
Connected Device Networks handling data streams from distributed sensor arrays.
Financial Systems analyzing market trends and transaction patterns.
Healthcare Networks processing patient data and treatment outcomes.

Managing such workloads demands database solutions engineered for rapid data retrieval and aggregation, capable of maintaining performance as data volumes expand dramatically.

9. Spatial Data

Spatial data encompasses any information that describes a position, form, or region on Earth. This primarily involves recording geographic coordinates (latitude and longitude) for points or polygons to define areas.

When your application’s core functionality depends on location (where) just as much as characteristics (what), you’ll require a database with built-in support for geographic data types and location-based queries.

Consider these scenarios:

Companies like Uber providing ride-sharing services must identify available drivers nearby and determine travel distances instantly.
Platforms like Uber Eats or Zomato offering food delivery need to efficiently identify restaurants and customers in close proximity.

Working with spatial data demands specialized indexing and query capabilities to compute geometric relationships including distance measurements, area intersections, and proximity checks.

Standard indexing methods like B-trees or hash tables aren’t designed for handling multi-dimensional spatial information. Instead, specialized spatial structures including R-trees, Quadtrees, and Geohashes are employed to efficiently filter geographic search spaces.

Database Recommendations:

PostgreSQL with PostGIS: A widely adopted solution for geographic data storage and querying.
- Supports sophisticated geographic operations (such as ST_Distance, ST_Within, ST_Intersects).
- Features ACID compliance for data consistency, perfect for applications needing both location-based queries and transaction guarantees.
MongoDB (Geospatial Indexing): Provides geographic indexing capabilities for circular and polygonal area searches.
- Implements 2dsphere indexes for Earth-based queries and 2d indexes for planar coordinate systems.
- Facilitates rapid proximity searches, like finding locations within specified distances ($near queries).
- Particularly suitable for teams preferring schema flexibility with a NoSQL approach.

How to Choose the Right Database

Stay updated with SWE Quiz

Never fail a system design interview again.

1. Structured Data Requiring ACID Compliance

2. Flexible Schema

3. Needs Caching

4. Searching Through Large Textual Data

5. File Storage / Object Storage

Recommended Storage Solutions:

6. Highly Connected Data

7. Metrics Data and Time Series

8. Large-scale Analytics

9. Spatial Data

Get free interview practice