How to Choose the Right Database

In system design interviews, the quality of your design and its ability to scale depends heavily on the database you choose. Often, choosing the wrong database can lead to pain later, as migrations with active users and active data is time-consuming and must be done with care.

In this article, we will cover common use cases that come up often in system design interviews and explore the best databases for each scenario.

flowchart TD
    A[Start: What type of data?] --> B{Primarily Structured?}
    B -->|Yes| C{Need ACID?}
    B -->|No| D{What type?}
    
    C -->|Yes| E[Consider: PostgreSQL/MySQL]
    C -->|No| F{High Write Load?}
    
    F -->|Yes| G[Consider: Cassandra]
    F -->|No| H[Consider: MongoDB]
    
    D -->|Files/Binary| I[Consider: Object Storage\ne.g. S3, Azure Blob]
    D -->|Time Series| J[Consider: InfluxDB/TimescaleDB]
    D -->|Graph Data| K[Consider: Neo4j/Neptune]
    D -->|Search Text| L[Consider: Elasticsearch/Solr]
    
    %% Additional decision points
    L --> M{Need Real-time?}
    M -->|Yes| N[Add Redis Cache]
    M -->|No| O[Standard Setup]
    
    E --> P{Scale Requirements?}
    P -->|Horizontal| Q[Consider: Vitess/CockroachDB]
    P -->|Vertical| R[Standard Setup]
    
    %% Performance considerations
    H --> S{Query Complexity?}
    S -->|High| T[Add Indexes]
    S -->|Low| U[Standard Setup]

1. Structured Data Requiring ACID Compliance

Take a look at large-scale e-commerce platforms like Amazon that handle vast numbers of transactions each day.

Every purchase requires several dependent operations to execute:

  • Choosing an item from inventory
  • Adjusting stock levels
  • Processing customer payment
  • Logging transaction details for business records

These interconnected steps demand strong consistency and ACID properties to prevent data inconsistencies and failed transactions.

  • Atomicity means a purchase won’t ship without successful payment processing
  • Consistency maintains accurate inventory counts, preventing negative stock levels
  • Isolation ensures simultaneous purchases of limited stock items are handled properly
  • Durability guarantees completed transactions remain recorded even if systems fail immediately after

Recommended Solution – Relational Database

For applications requiring rigid data consistency and structured schemas, relational databases such as MySQL or PostgreSQL typically provide the optimal solution.

These systems are engineered specifically to maintain data relationships, enforce business rules, and provide robust transaction handling.

2. Flexible Schema

A flexible schema enables data storage and management without requiring the strict, predefined data structures typically found in relational systems.

Using a flexible schema approach, individual entries can contain different fields, hierarchical information, and varying data types – perfect for systems where data structures are diverse and change often.

Think about a social networking platform with highly variable user information:

  • Some profiles might include: a few interests and basic biographical details
  • Others could contain: various interests, several locations, and preferred activities

Attempting to store this varied information in traditional relational tables would require multiple table structures and frequent schema modifications whenever new fields are introduced.

In contrast, flexible schemas allow immediate structural updates as needed. This requires using a database system capable of rapid field adjustments.

Recommended Solutions – Document Database

  • MongoDB: Utilizes JSON-style document storage, offering developer-friendly flexibility. Perfect for applications like social platforms, content systems, and online retail sites.
  • Couchbase: Merges adaptable JSON document storage with efficient key-value operations. Features robust offline capability support, including built-in data synchronization.

3. Needs Caching

Caching serves as an effective method to enhance system performance by keeping commonly requested data in a rapid access storage tier—usually within memory.

This technique enables subsequent data requests to be handled with significantly reduced latency, eliminating the need to execute resource-intensive queries against slower storage systems or external services.

Recommended Solutions for Caching:

  • Redis: A widely-adopted, memory-based key-value store recognized for both its rapid performance and versatility.
    • Features rich data structure support including lists, sets, hashes, and ordered sets.
    • Offers data persistence capabilities, enabling disk-based backup storage.
    • Well-suited for implementations like scoreboard systems, live analytics, message routing, and user session management.
  • Memcached: A streamlined, distributed caching system operating in memory.
    • Features simple deployment and minimal resource requirements.
    • Particularly effective for applications with basic caching needs, like temporary data storage or query result caching.

4. Searching Through Large Textual Data

Numerous modern applications depend on performing fast searches across substantial text-based content.

Consider these implementations:

  • Job Platforms: Services like LinkedIn, Indeed, and Glassdoor enable users to find positions using role-specific terms.
  • E-commerce: Platforms like Amazon, eBay, and Etsy let customers locate products through names, specifications, or groupings.
  • Content Platforms: Sites like Medium, YouTube, and Spotify help users discover content through keyword searches.
  • Documentation Sites: Platforms like Stack Overflow and GitHub allow developers to search through code and technical content.
  • Academic Databases: Services like Google Scholar and JSTOR provide research paper searches.
  • News Aggregators: Applications like Reddit and News APIs require fast article searching.

These applications require sophisticated text search functionality, incorporating features such as result ordering, contextual relevance, and approximate matching.

Traditional database systems, whether relational or NoSQL, typically struggle to deliver the necessary search performance and capabilities.

Recommended Solution – Text Search Engine Dedicated search engines are designed to process complex text-based queries efficiently while providing advanced search features.

  • Elasticsearch: A scalable, open-source platform for search and analysis built on Apache Lucene. Delivers advanced capabilities including: comprehensive text search with relevance calculations, error-tolerant matching for misspellings, and result highlighting.
  • Apache Solr: Another Lucene-based search solution, recognized for its adaptability and scalability. Offers robust text search features comparable to Elasticsearch, but particularly suited for specialized or custom implementations.

5. File Storage / Object Storage

Contemporary applications frequently require efficient systems for storing and delivering various media assets—including images, video content, audio files, and other substantial binary data.

Consider these scenarios:

  • Streaming Services: Platforms like YouTube, Netflix, and Disney+ must manage and distribute vast video libraries.
  • Social Networks: Applications like Instagram, Facebook, and TikTok handle enormous volumes of user-generated visual content.
  • Media Sharing Platforms: Services like Imgur and Flickr need robust storage and quick delivery of uploaded materials.
  • Cloud Storage: Services like Dropbox and Google Drive store diverse file types.
  • Educational Platforms: Sites like Coursera and Udemy host educational videos and materials.

While conventional relational and NoSQL databases can technically store binary content as BLOBs (Binary Large Objects), they’re not ideal for these scenarios due to several limitations:

  1. Scalability: Large binary file storage in databases leads to increased costs and management complexity as content volumes expand.
  2. Performance: Direct media serving from databases creates unnecessary latency and degrades system responsiveness.
  3. Cost: Database pricing models based on storage and throughput make them financially inefficient for large file management.

Given these constraints, specialized object storage solutions, typically paired with Content Delivery Networks (CDNs), represent the optimal approach.

Recommended Storage Solutions:

  • Object Storage (Primary Storage): Object storage is designed to store large amounts of unstructured data efficiently.
    • Amazon S3: The most widely used object storage service, offering high durability (99.999999999% durability) and seamless integration with AWS services.
    • Azure Blob Storage: A comparable solution from Microsoft Azure, ideal for applications running in the Azure ecosystem.
    • Google Cloud Storage: Provides robust storage capabilities for applications hosted on Google Cloud.
  • Content Delivery Networks (Global Distribution): CDNs cache media files on servers distributed worldwide, ensuring fast and reliable delivery to users regardless of their location.
    • Amazon CloudFront: Integrated with Amazon S3, providing seamless delivery and support for dynamic and static content.
    • Cloudflare: Known for its high performance and robust security features, often used for websites and media-heavy applications.

6. Highly Connected Data

In modern systems, data often exists as a web of interconnected entities rather than simple tabular information.

These scenarios prioritize the connections between data points as much as, or even more than, the data points themselves – what we term highly connected data.

Consider these implementations:

  • Social Platforms: Services like Facebook, LinkedIn, and Twitter analyze user connections to recommend friends, detect groups, and map relationship networks.
  • Recommendation Engines: Sites like Amazon, Netflix, and Spotify suggest content by analyzing connections between users, products, and preferences.
  • Knowledge Graphs: Google, Wikipedia, and research databases link concepts to provide context-rich information.
  • Fraud Detection: Financial systems track transaction patterns and relationships to identify suspicious activities.
  • Supply Chain Systems: Organizations map complex networks of suppliers, products, and logistics.

While traditional relational databases can represent relationships through foreign keys and joins, they struggle with highly connected data for several reasons:

  1. Query Complexity: Multi-level relationship queries like “connections of connections of connections” become inefficient in standard RDBMS.
  2. Performance Issues: Analyzing deep relationship chains requires resource-intensive joins, leading to slower performance as data expands.
  3. Schema Limitations: The rigid structure of relational databases makes it challenging to adapt to evolving relationship patterns.

Recommended Solution – Graph Database Graph databases are specifically engineered for connected data, utilizing nodes (entities) and edges (relationships) for efficient data organization and traversal.

  • Neo4j: A leading graph database optimized for connected data operations. Features Cypher, a specialized query language for efficient graph navigation.
  • Amazon Neptune: A cloud-based managed graph database service supporting various graph models, including property graphs and RDF.

7. Metrics Data and Time Series

Time series data represents chronological sequences of measurements, each with an associated timestamp, showing the progression of a system, metric, or process across time.

Common applications include monitoring server performance metrics, tracking financial indicators, recording environmental sensors, and analyzing user behavior patterns.

Such data typically flows in at rapid intervals, demanding systems capable of efficient data capture, storage, and access for both immediate system monitoring and long-term pattern analysis.

While conventional relational and NoSQL databases can technically store time-based data, they’re not typically ideal for these specific requirements.

Recommended Solutions:

  • Time Series Databases (TSDBs): Specifically engineered for temporal data management, offering optimized mechanisms for storing, indexing, and querying time-stamped information.
    • InfluxDB: Purpose-built temporal database providing sophisticated querying capabilities.
    • TimescaleDB: Extends PostgreSQL’s functionality with specialized time series features and capabilities.
  • Wide-Column Databases: Suitable for scenarios demanding distributed architecture and high-speed write operations, these systems effectively handle temporal data streams.
    • Apache Cassandra: Engineered for rapid write operations and distributed deployment, well-suited for time series applications.

The database choice often depends on factors like data volume, query patterns, and scaling requirements.

8. Large-scale Analytics

When dealing with vast and continuously expanding datasets—reaching terabytes or petabytes in scale—you need database systems specifically optimized for large-scale analytical processing.

These implementations typically require handling intricate queries, extensive data aggregations, comprehensive reporting, and integration with AI/ML workflows.

Consider these scenarios:

  • Digital Retail Platforms examining customer patterns, sales metrics, and stock management.
  • Social Networks extracting insights from countless daily user engagements.
  • Connected Device Networks handling data streams from distributed sensor arrays.
  • Financial Systems analyzing market trends and transaction patterns.
  • Healthcare Networks processing patient data and treatment outcomes.

Managing such workloads demands database solutions engineered for rapid data retrieval and aggregation, capable of maintaining performance as data volumes expand dramatically.

Recommended Solutions:

  • Columnar Databases: Systems that organize data by columns rather than rows, optimized for analytical operations across large datasets.
    • Amazon Redshift: Managed service offering high scalability, featuring SQL analytics capabilities and AWS ecosystem integration.
    • Snowflake: Modern cloud data warehouse providing independent scaling of compute and storage resources.
    • Google BigQuery: Serverless analytics platform designed for massive-scale data processing and ad-hoc analysis.
  • Wide-Column Stores: Databases architected for analytical operations requiring substantial write capacity.
    • Apache Cassandra: Built for distributed operations and high-volume write handling, commonly deployed in real-time analytics.
    • HBase: Hadoop-based solution offering quick random access operations, frequently utilized within Hadoop analytics environments.

9. Spatial Data

Spatial data encompasses any information that describes a position, form, or region on Earth. This primarily involves recording geographic coordinates (latitude and longitude) for points or polygons to define areas.

When your application’s core functionality depends on location (where) just as much as characteristics (what), you’ll require a database with built-in support for geographic data types and location-based queries.

Consider these scenarios:

  • Companies like Uber providing ride-sharing services must identify available drivers nearby and determine travel distances instantly.
  • Platforms like Uber Eats or Zomato offering food delivery need to efficiently identify restaurants and customers in close proximity.

Working with spatial data demands specialized indexing and query capabilities to compute geometric relationships including distance measurements, area intersections, and proximity checks.

Standard indexing methods like B-trees or hash tables aren’t designed for handling multi-dimensional spatial information. Instead, specialized spatial structures including R-trees, Quadtrees, and Geohashes are employed to efficiently filter geographic search spaces.

Database Recommendations:

  • PostgreSQL with PostGIS: A widely adopted solution for geographic data storage and querying.
    • Supports sophisticated geographic operations (such as ST_Distance, ST_Within, ST_Intersects).
    • Features ACID compliance for data consistency, perfect for applications needing both location-based queries and transaction guarantees.
  • MongoDB (Geospatial Indexing): Provides geographic indexing capabilities for circular and polygonal area searches.
    • Implements 2dsphere indexes for Earth-based queries and 2d indexes for planar coordinate systems.
    • Facilitates rapid proximity searches, like finding locations within specified distances ($near queries).
    • Particularly suitable for teams preferring schema flexibility with a NoSQL approach.

Get free interview practice

One software engineering interview question every week, with detailed explanations and resources.