dashboard image

Objectives :

Databases

dashboard image
The Problem

As some point, if you're developing an application, you'll need to pause, and reflect on the right approach to store/retrieve/protect the data persisted.

As illustrated in the following diagram, before starting to implement anything, you need to understand the trade-offs, ask yourself these questions :

dashboard image

Choosing the right database will impact the performance of your application, so it's important to understand which type of databases are the most appropriate for your use case.

dashboard image

Which type of databases for my use case ?

Your application might be a monolith or microservices, either way, the structure of the data persisted might be more efficient for some services and not for others. You often have to choose between 2 main type of databases :

Relational Database (SQL)

Non-Relational Database (NoSQL)


Relational Database (SQL)

Relational databases are the most common database used by professionals. The main advantage is to provide strong consistency for your application.

Relational databases offer ACID properties : Atomicity, Consistency, Isolation and Durability.

One other aspect, is that relational database require a schema (fixed structure before storing data), this provides more control on the type of data your application persist.

If you need strong relationship (a query to join multiple tables, etc ..), relational database are the most appropriate choice.

To summarize, if your application require the following, you may choose a relational database (SQL):

ACID properties

Data needs to be structured and consistent

The transactions are a strong priority

You have to perform complex queries between data (joins, etc..)

You need in advance to set a fixed schema (structure of the data persisted)

            Customers table in a SQL database
            
            | CustomerID | FirstName | LastName | Email                  |
            |------------|-----------|----------|------------------------|
            | 101        | Bob       | Smith    | bob.smith@email.com  |
            | 102        | Alice     | Johnson  | alice.j@email.com      |
            | 103        | John      | Doe      | john.doe@email.com     |
            
            Students table
            
            | StudentID | Name         | Age | GPA  | GraduationYear |
            |-----------|--------------|-----|------|----------------|
            | S001      | Daniel Monto | 20  | 3.7  | 2026           |
            | S002      | Emma Brown   | 22  | 3.9  | 2024           |
            | S003      | Liz Brian    | 21  | 3.5  | 2025           |
              

Use-case 1 (Financial industry) : If you develop an application for the financial sector, there is chance that you may want to choose a relational database, for strong consistency , to prevent double spending or balances. You may also have to perform complex queries between customers, accounts and transactions.

Use-case 2 (Ecommerce platform) : If you develop an ecommerce application, you probably need to have a structured data to store your products, categories, customers, orders or reviews. You may need to perform complex queries between the customer and the orders, and you need to guarantee transactions (strong consistency) to ensure a reliable inventory management or checkout.

Use-case 3 (Healthcare systems) : If you develop a healthcare application, and as you store sensitive data, you probably need strong consistency, and integrity. Also, healthcare industries are heavily regulated, so having a relational database will ensure your data persisted follow the latest compliance and regulations.


Non-Relational Database (NoSQL)

Non-Relational databases offer more flexibility. You don't need to have strong consistency and you don't need a fixed schema before storing your data. NoSQL databases are more appropriate for use cases where you need to store unstructured or semi-structured data (JSON Document, Key Value), and you think about having lower latency.

Types of unstructured data Examples
Text documents Emails, chat messages, Word/PDF files
Media files Images, videos, audio recordings
Social media posts Tweets, comments, likes
Logs & event data Server logs, clickstreams, IoT device messages
Web pages HTML content with mixed text, links, metadata.
        JSON Document
      [
        {
          "user_id": 123,
          "name": "Alice",
          "purchases": ["Book", "Laptop", "Headphones"],
          "last_login": "2025-09-14"
        },
        {
          "user_id": 124,
          "name": "Bob",
          "purchases": ["Book", "Laptop", "Headphones"],
          "last_login": "2025-09-14"
        },
        {
          "user_id": 125,
          "name": "John",
          "purchases": ["Book", "Laptop", "Headphones"],
          "last_login": "2025-09-14"
        }
      ]
            

NoSQL databases are a good choice, if you don't want to deal with strict rules to manage your data and the transactions.

To summarize, if your application require the following, you may choose a non-relational database (NoSQL):

Unstructured or semi-structured data, and it's flexible

You expect a huge volume of data, and you don't want to worry about horizontal scaling

You need to have a simple design, with the potential to scale fast.


Use-case 1 (Real-time analytics and Big Data) : You develop an application, at some point you may need to understand more about the users behavior and activity, this will create a huge volume of data (logs, clickstreams, ...), the best way is to use NoSQL databases to handle these type of data (unstructured and semi-structured). Also choosing a non-relational database, is a better option for faster writes and horizontal scaling.

Use-case 2 (Content Management & Product Catalog) : You develop a content management application, or you need to store product catalogs. The content often changes (flexible schema), using JSON documents to store these type of data is more efficient.


Real world companies - use cases

Company Database used Use Case Estimated Daily Active User (DAU) Sources
Stripe PostgreSQL + Redis PostgreSQL for transactional payment data (ACID compliance critical), Redis for caching & real-time fraud detection ~1.3 million active websites using Stripe globally https://redstagfulfillment.com/how-many-payments-stripe-process-per-day/
Notion PostgreSQL + Redis PostgreSQL for document + collaboration metadata storage, Redis for session caching and quick lookups ~20M monthly active users https://www.boringbusinessnerd.com/top-startups?
Canva PostgreSQL + Redis PostgreSQL for design assets and user accounts, Redis for fast retrieval of frequently accessed templates ~100M monthly active users https://www.boringbusinessnerd.com/top-startups?
Slack MySQL, PostgreSQL, Redis MySQL/Postgres for message history & team data, Redis for ephemeral data (presence, typing indicators) ~42–47M daily active users https://www.demandsage.com/slack-statistics/?

How can I prevent failures ?

Your database is by default in a healthy state, all operations (connections, queries) return a successfull response. But you may experience unexpected downtime due to failures .

dashboard image

Unavailable access to your data can have significant impact on your business activities, revenue and reputation. It's important to understand the most common root cause of database failures.

Replication Lag & Failover Issues

Schema Migrations Gone Wrong

Misconfigurations (DNS, Networking, Permissions)

Lack of Backups & Disaster Recovery Planning

Planning for these scenarios during the design phase is essential. There are 2 main metrics to implement in order to lower the risks if an incident happens :

Recovery Time Objective (RTO) : How long it takes to restore the database in case of failure ?

Recovery Point Objective (RPO) : How much data can you loose (the retention of your backups : minutes, days, months, years) ?

There are several database recovery strategies :

Database recovery strategies RTO (Recovery Time Objective) RPO (Recovery Point Objective) Cost
Full Backup (Daily/Weekly) Hours to days (restore from backup) Up to last backup (24h+ data loss possible) $ (Cheap)
dashboard image
Database recovery strategies RTO (Recovery Time Objective) RPO (Recovery Point Objective) Cost
Asynchronous Replication (Active-Passive) Seconds to minutes Seconds to minutes (small lag possible) $$ (higher cost)
dashboard image
Database recovery strategies RTO (Recovery Time Objective) RPO (Recovery Point Objective) Cost
Synchronous Replication (Active-Standby) Seconds to minutes (automatic failover) Zero (no data loss) $$$ (Very expensive)
dashboard image

Here are the recommended RTO and RPO per industry :

Type of workload Recommended RTO (Recovery Time Objective) Recommended RPO (Recovery Point Objective) Sources
Mission-critical (tier-1) applications ≤ 15 minutes “near-zero” https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-of-on-premises-applications-to-aws/recovery-objectives.html
Important but not mission critical (tier-2)” apps ≤ 4 hours ≤ 2 hours https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-of-on-premises-applications-to-aws/recovery-objectives.html
Less critical applications 8-24 hours 4 hours https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-of-on-premises-applications-to-aws/recovery-objectives.html

Real-world cases where lack of RPO/RTO strategies cost companies revenue


Company Incident Description Revenue / Impact Root Cause / Lessons Sources
Ma.gnolia (2009) Production database lost; backups incomplete/unrecoverable. Complete data loss; company shut down. No RPO/RTO strategy; backups not tested, no recovery plan. More details
Code Spaces (2014) Attackers deleted databases and backups on AWS; recovery impossible. 100% business loss; company shut down within days. No tested disaster recovery plan, no offsite/immutable backups (RPO/RTO failure). More details
Delta Airlines (2016) Power outage + backup system failure caused DB corruption; IT systems down. 2,300+ flights canceled; global operations disrupted for 3 days; ~$150 million loss. Disaster recovery didn’t meet RTO; systems took too long to restore. More details
British Airways (2017) Data center power issue led to DB corruption for check-in/reservation systems. 75,000 passengers stranded; days of disruption; £80+ million in costs. Lack of resilient recovery plan to meet business-critical RTO for databases. More details

How can I limit the access to the database ?

The database stores all kind of data, you probably want to limit who can have access to the database, especially if you store sensitive and critical information. Understanding who should have access is a first step, is it a service account or a human user ?

dashboard image

To enhance the access to the database, it might be a good practice to limit the network access, authorizing only requests from specific subnet (usually backend subnet), and use security group to authorize incoming traffic on the database target TCP port from restricted sources (IP address, group of servers, etc..)

dashboard image

Once the user or service account authenticated, it's important to limit the actions on the database (CREATE DATABASE, CREATE TABLE, UPDATE, ...). Best practices include assigning the least privileges (roles to perform only restricted operations on the database). Credentials can be stored in a secrets store (secrets manager, key vault, ..), with password rotation to enhance security.

dashboard image

Breaches due to credentials

Company Incident Description Data Compromised / Key Metrics Root Cause / Lessons Sources
Uber Attackers accessed Uber’s AWS database using credentials found in a GitHub repository. 57 million users and 600,000 drivers’ personal information. Storing credentials in version control; importance of secret management. More details
Capital One Attacker exploited a misconfigured AWS IAM role using stolen credentials to access cloud storage. 106 million customer records in the U.S. and Canada. Excessive permissions in IAM roles; need for credential rotation and monitoring. More details
Dropbox Stolen employee password (from a LinkedIn breach) was reused to access Dropbox employee accounts, leading to data exposure. 68 million user accounts affected. Highlights the dangers of password reuse and lack of MFA. More details

Do I need to secure the data in transit ?

It's a good practice to secure the traffic between your applications and the database. By enabling SSL mode in your database, you lower the risks, and prevent anonymous connections from intercepting or tampering the traffic sent to the database.

Also if you need to follow these regulations (GDPR, HIPAA, PCI DSS, SOX, HDS), it's required to enable encryption to secure the data in transit.

dashboard image

GDPR (General Data Protection Regulation) : A regulation by the EU that governs how organizations collect, process, store, and share personal data of individuals in the EU

HIPAA (Health Insurance Portability and Accountability Act) : sets standards for how healthcare organizations, insurers, and their business partners handle Protected Health Information (PHI) — any data that can identify a patient and relates to their health, treatment, or payment for care

PCI DSS (Payment Card Industry Data Security Standard) : a global security standard created to protect cardholder data (credit/debit card information) and reduce payment fraud. Applies to any organization that stores, processes, or transmits payment card data.

HDS (Hébergement de Données de Santé) : french regulation, It requires that healthcare-related personal data (patient records, medical history, test results, etc.) be stored and processed only by certified Health Data Hosts


Enabling SSL mode on your database does add CPU overhead for encryption and decryption. One option to prevent more load on your CPU is to use connection pooling.

Connection pooling is cache of open database connections that your application can reuse instead of creating a new one each time, reducing latency and CPU load.

Using benchmark to measure if your database performs better with or without connection pooling is recommended.


Breaches due to not protecting data in transit

Company Incident Description Data Compromised / Key Metrics Root Cause / Lessons Sources
Heartbleed (OpenSSL) The Heartbleed vulnerability in OpenSSL allowed attackers to read memory, exposing data that was assumed to be encrypted in transit. Usernames, passwords, session keys, and sensitive application traffic (Yahoo confirmed user credential leaks). Flawed TLS implementation; always patch crypto libraries and enforce TLS properly. More details
TJX / TJ Maxx Attackers exploited weak WEP encryption between store POS systems and backend databases to intercept card data. 94 million credit card numbers stolen. Outdated WEP protocol; always use strong encryption standards (TLS 1.2/1.3, WPA2+). More details
Equifax Attackers exploited unpatched Apache Struts, intercepting unencrypted traffic between apps and backend databases. ~147 million personal records including SSNs and financial data. Lack of TLS enforcement and unpatched software; highlights need for TLS and regular patching. More details
Marriott / Starwood Hotels Attackers monitored unencrypted internal traffic between reservation systems and databases for years. 500 million guest records (passport numbers, PII, payment data). Legacy systems lacked encryption in transit; internal traffic must also be encrypted. More details

How can I secure the database at rest ?

You've just secured the traffic sent to your database. But how about the data stored in your database ? It's a good practice to encrypt the data at rest, at the disk level, using key management system. Backups should also be encrypted to prevent unauthorized access to the data.

dashboard image

Breaches due to not protecting databases at rest

Company Incident Description Data Compromised / Key Metrics Root Cause / Lessons Sources
Health Net Lost unencrypted hard drives containing sensitive medical data. ~1.9 million patient records. No encryption on physical storage; highlights importance of database encryption at rest. More details
US Veterans Affairs Stolen laptop contained an unencrypted database of veterans’ information. ~26.5 million personal records. No encryption on endpoint databases; lesson: enforce full-disk and DB encryption. More details
Ashley Madison Hackers stole entire unencrypted database, with weak password storage and payment details. ~32 million user records leaked. Poor hashing/encryption; highlights need for strong cryptography at rest. More details
Desjardins Group Employee exfiltrated internal unencrypted databases containing customer information. ~4.2 million customer records. Lack of encryption at rest made insider theft possible; highlights need for DB encryption and monitoring. More details

What happens if I need to scale the database ?

You may have to handle more load on your database (traffic, queries, ..), and to prevent any performance degradation, you have some options to scale your database. Usually there are 2 scaling strategies :

Vertical Scaling

Horizental Scaling


Vertical scaling is the first option if you want to handle more traffic, it's easier to implement, you just have to add more resources (more cpu, memory or faster disk and bandwitdh).

dashboard image

Horizental scaling is a little bit more complex to implement. You need to add replicas node which handle only read-only requests. All the write queries goes to the primary DB, and the read queries to the replicas nodes to distribute the traffic.

dashboard image

There are several other options you can implement to handle more traffic

Load balancers or databaes proxies : helping you redistribute the traffic (connection pooling)

Autoscaling : cloud managed databases offer options to scale automatically your database.

Sharding and Partionning : For heavy workloads and traffic, you may need at some point to use sharding and partionning, the goal is to split your database in multiple smaller "datasets". It's usually quite complex to implement.


Use cases with scaling-related database outage/revenue loss

Company Incident Description Revenue / Impact Root Cause / Lessons Sources
Twitter (Fail Whale Era) Frequent outages due to MySQL write bottlenecks as user growth exploded. Millions in lost ad revenue and user churn. Vertical scaling limits reached; need for sharding, caching, and horizontal scaling. More details
Amazon Prime Day (2018) Surge of traffic overloaded internal databases and order systems. Estimated $72–100 million in lost sales during downtime. Peak load exceeded DB scaling capacity; need for proactive load testing and auto-scaling. More details
Roblox (2021) A database growth issue caused cascading failure across backend (MongoDB-based). 3 days offline; tens of millions in lost revenue + developer payouts delayed. Poor handling of DB growth; need for partitioning, replication, and distributed DB design. More details
Facebook (early scaling) Multiple outages due to MySQL bottlenecks and insufficient caching in early years. Millions in lost ad revenue during downtime. Single-node bottlenecks; drove adoption of sharding and memcached at massive scale. More details
GitHub (2012) Database cluster failover led to prolonged downtime; writes bottlenecked on a single master. Lost paid subscriber revenue and SLA penalties to enterprise customers. Weak HA setup; need for resilient replication and multi-master designs. More details
Ticketmaster (various) Ticket sales portals crashed during high-demand events due to backend DB bottlenecks. Hundreds of millions in lost ticket sales and refunds (e.g., Taylor Swift 2022). Insufficient scaling for sudden spikes; need elastic DB scaling and load balancing. More details

How can I monitor the activity on the database ?

Let's assume your application is running smoothly, everything works just fine. Users have access to every features of your application and can perform all operations. Then suddenly, you start getting emails from 1 frustated user, furious about their experiences, seems like some features didn't work fine, or there is a huge latency when performing some actions. But it doesn't end there, more and more users start to send emails with their bad experiences.

You're shocked as you didn't notice any sign of performance degradation, your application seems to work just fine. You then realize by investigating, that there are a lot of warnings, and errors in the logs.

This experience shows the importance of monitoring the activity of your infrastucture, to anticipate and prevent in advance your users/customers from having bad experiences.

It's then crucial to understand which metrics can help your understand what's happening on your database and if you should get any alerts as soon as some threshold are met.

Monitoring Objectives Description Core metrics to collect Alerts and thresholds
Availability & health

Is the database up and reachable ?

Examples of metrics on PostGreSQL : pg_up, pg_connections, pg_stat_activity_count

Examples with PostgreSQL (pg_up == 0)

Performance

Is there any latency to reach the database ?


Are the queries slow ?

Total queries/sec and transactions/sec


95th/99th percentile query latencies


Slow queries count

connections >= 90% of max_connections (Connections near max)


query running > 5–15 min (Long running query)


slow_queries / total_queries > 1% over 5m (Slow query rate)

Resources & capacity

Is the database server capacity overloaded ?


What is the status of the resources consumption (CPU, Memory, Disk usage, Disk latency, network I/O)

CPU


Memory


Disk usage, Disk Latency

CPU > 85% for 5m


Memory > 85% for 5m


Disk usage > 80%

Errors & Fault

What are the errors in the logs ?


This should indicate the failed queries, connection errors

Error Logs

Error rate (failed queries / total queries)

Security and Audits

Logins (successful & failed)


Privilege changes


Suspicious exports

failed logins


new user creation


privilege escalations


unusual export activity

failed login attempts spike (possible brute force)


Replication and Bakcups

Replication lag


Backup success


WAL/redo lag

last successful backup time


backup duration


restore test results

last_backup_time older than expected retention window


lag > 10s (or > 60s depending on app)


Use cases with revenue loss from not monitoring DB activity

Company Incident Description Revenue / Impact Root Cause / Lessons Sources
Target (2013) Attackers exfiltrated payment card data; suspicious DB queries went unnoticed. ~$162 million in costs (settlements, fines, legal). Lack of real-time monitoring for abnormal database reads and exports. More details
Sony Pictures (2014) Attackers had weeks of DB access; leaked unreleased films, scripts, employee data. Hundreds of millions in lost revenue and IP theft. No database query monitoring or large export detection. More details
Equifax (2017) Attackers exploited Struts flaw, accessed DBs, and stole 147M personal records. >$1.4 billion in breach costs + stock value drop. No alerts on abnormal DB access to sensitive PII. More details
Capital One (2019) Misconfigured firewall allowed DB access; 100M+ records exfiltrated without detection. $190 million in settlements + regulatory fines. No alerts for massive data pulls from databases. More details
Desjardins Group (2019) Insider exfiltrated data of 9.7M members over months without being flagged. Hundreds of millions in costs, lawsuits, and brand damage. Lack of insider threat monitoring and alerts for unusual exports. More details
Shopify (2020) Rogue employees accessed merchant DBs and stole customer transaction details. Reputational harm, legal expenses, and merchant trust loss. Inadequate monitoring of privileged employee queries on live DBs. More details

Are there any hidden costs I should worry about ?

During the design phase, you should define a cost model and forecasting plan for operating your infrastructure in the cloud, so you understand expected expenses and the trade-offs of architectural decisions

Your cloud cost architecture plan often includes the following :

Expected baseline costs (compute, storage, networking)

Your assumptions on the growth and if your infrastructure scales

The cost tradeoffs of the design decisions (should you implement a multi-AZ, a single-AZ, reserved vs on-demand, open-source vs commercial , etc ...)

There are always hidden costs while operating your cloud database :

Cost Category Use Cases Prevention Sources
Backup / Snapshot Storage & Retention

You enabled snapshots on your database, and you don't have a clear retention policies


There are copies of your snapshots cross-region

Snapshot storage are charged per GB/month, unecessary snapshots should be deleted


Set retention policies

More details
Data Transfer (Cross-AZ / Cross-Region)

You application read from database (read-replicas) in another AZs or in another region


There is traffic (egress) sent to Internet, We tend to underestimate the volume of data (egress/ingress).

Collect metrics and setting alerts to prevent the billing to increase unexpectedly.

More details
Idle / Non-Production Environments & Orphaned Resources

You run multiple environnements (staging, dev, production), but not all the environnements needs to run 24/7


Snapshots remain after a DB deletion


The database instance size absorb peaks but it's usually in idle (underutilized)

Turn off resources in non critical environments (automatic shutdown off of instances, etc..)


Right size the database instance type, benchmark if possible, optimize the queries if necessary

Monitoring, Logging, Metrics Retention

You enabled multiple metrics, and detailed logs (audit logs, ddl events, etc ...)


Track the volume of the logs ingested and the storage capacity costs (increase)



Use cases with revenue loss due to hidden or unexpected costs

Company Incident Description Revenue / Impact Root Cause / Lessons Sources
Pinterest (2017) AWS costs spiked during holiday season due to massive unoptimized database and compute usage. Margins squeezed by unplanned cloud spend during peak traffic. Lack of monitoring and cost controls on scaling workloads. More details
Adobe (2018) Large unoptimized AWS workloads ran longer than intended, driving up cloud bills. Tens of thousands in unplanned charges. Lack of governance and auto-shutdown policies for workloads. More details
Basecamp (2014) AWS S3 costs surged due to inefficient data storage and backups. Direct margin squeeze until re-architecture. No cost monitoring on object storage growth. More details
NASA JPL (2019) AWS credentials exposed, attackers used resources for crypto-mining. ~$30,000 in stolen compute costs. No monitoring of abnormal API usage or sudden billing spikes. More details
Snapchat (2017) Google Cloud costs spiraled as user activity surged faster than infra optimization. Losses of $2.2 billion in IPO year; cloud costs grew faster than revenue. Poor forecasting and inefficient DB queries increased hidden costs. More details
Multiple Orgs (2018–2021) Misconfigured open databases exploited for crypto-mining workloads. Tens of thousands in surprise compute bills. No alerts on abnormal workloads or sudden cloud billing spikes. More details

Learning from real world experiences

Databases breaches and incidents happen all the time, 100% protection does not exist. Prestigious companies and firms have experienced security breaches despite having implemented tight security controls, it's important to learn from their experiences to understand how you can prevent these scenarios or at least lower the risks or the gravity if you experience a breach.

The goal is not to implement the latest security tools and think you're out of reach, it's better to implement a progressive and continuous protection, re-assess continuously your security posture, improve the protection of the weakest link in your infrastructure.

Here you can find the top root causes of database breaches :

Misconfiguration (Open Databases, Wrong ACLs,..)

Credentials (weak passwords, reused credentials, no password rotation, no MFA)

Zero-days /Supply Chain attacks

Provider/Partner attacks


dashboard image

Below is the summary of recent database breaches grouped by root cause :

Root cause (ranked by frequency) Example of breach Year Impact (record/people) Why It happened Sources
1. Misconfiguration (Unprotected DBs / Poor Access Controls)

Chinese Surveillance DB

2025

~4 Billions

Database left exposed, no password protection

More details
1. Misconfiguration (Unprotected DBs / Poor Access Controls)

Microsoft/Apple/Google/PayPal DB leak

2025

184 Millions

Elasticsearch DB exposed without auth

More details
1. Misconfiguration (Unprotected DBs / Poor Access Controls)

Texas General Land Office

2025

44,485 people

Access control misconfig allowed cross-user data view

More details
2. Credential Compromise / Weak Authentication

23andMe

2023-25

Dozens of orgs; AT&T, Ticketmaster, Santander, etc.

Weak MFA/credential hygiene; compromised access

More details
2. Credential Compromise / Weak Authentication

Kering / Gucci–Balenciaga

2025

Millions of customers

Hacker group “Shiny Hunters” stole customer DB creds

More details
3. Software Vulnerabilities / Supply Chain Exploits

MOVEit Transfer Zero-day

2023

~93–100 Millions people, 2700+ orgs of customers

Zero-day exploited in widely used file transfer system

More details
3. Software Vulnerabilities / Supply Chain Exploits

Latitude Financial

2023

~14 Millions records

Exploit in provider’s systems; details not fully disclosed

More details
4. Direct Cyberattacks / Service Provider Compromise

Qantas Airways

2025

~5.7 Millions customers

Hackers gained access to customer DB

More details
4. Direct Cyberattacks / Service Provider Compromise

Miljodata

2025

~1.5 Million people

IT provider cyberattack leaked sensitive records

More details

Top 10 recurrent database production issues

Everything breaks all the time. Databases can break in production, it's critical to understand the main sources of these issues and anticipate any actions to prevent and reduce their impact and cost.

# Database Issue Real-World Example Business Impact Estimated Loss Root Cause Sources
1 Connection Leaks Barclays IT Outage (Jan 2025) – mainframe software bug impacted core banking access. Widespread service unavailability, frustrated customers. Up to £12.5M compensation Software bug, improper connection handling Yahoo Finance
2 Slow Queries / Missing Indexes Asana Outages (Feb 2025) – excessive logging caused DB and downstream service slowdowns. Multi-hour service disruption, productivity losses Millions in lost productivity Unoptimized queries and logging infrastructure overload LinkedIn
3 Deadlocks Delta Air Lines Outage (Jul 2024) – software update failure triggered system crashes. Flight delays, operational disruption $500M estimated losses Faulty software update, unhandled concurrent operations New York Post
4 Lock Contention Co-op Cyberattack (Apr 2025) – operational systems disrupted across 2,300 stores. Store outages, supply chain delays £275M lost revenue Cyberattack causing DB and system resource contention Acronis
5 Schema Changes in Production GitLab (2017) – schema migration failure led to downtime. Service unavailable, partial data loss 6 hours of customer data lost, reputational damage Uncontrolled migration, no rollback strategy GitLab Blog
6 Data Corruption Delta Air Lines Outage (Jul 2024) – software errors corrupted system data. Operational disruption, loss of customer trust $500M estimated losses Faulty update, unvalidated data integrity New York Post
7 Replication Lag Asana Outages (Feb 2025) – excessive logging caused replica delays. Inconsistent UX, service errors Millions in productivity lost Async replica lag under heavy logging load LinkedIn
8 Unsecured Databases Co-op Cyberattack (Apr 2025) – exposed internal DBs led to operational disruptions. Data theft risk, system downtime £275M lost revenue Lack of authentication, open access points Acronis
9 Backup & Restore Failures GitLab (2017) – untested backups caused permanent data loss during outage. Loss of customer trust, reputational damage Priceless, indirect revenue loss Unverified backup processes, no RPO/RTO drills GitLab Blog
10 Capacity & Scaling Issues Co-op Cyberattack (Apr 2025) – DB overload caused operational outages. Store closures, supply chain disruption £275M lost revenue No auto-scaling, underestimated peak load Acronis