Reducing AWS Aurora Cost Without Sacrificing Performance

Content Overview

1. Abstract

2. Introduction

2.1 Instance resizing

2.2 Moving READ traffic to reader instances

2.3 Tuning queries to reduce database load and resource utilization

2.4 Using database reserved instances

2.5 Database Savings Plan

3. The Right Strategy

1. Abstract

As organizations scale their cloud infrastructure, database, and storage costs often grow unchecked, and left unattended, this growth becomes exponential. What begins as a manageable monthly bill can compound rapidly as data volumes expand, traffic patterns evolve, and provisioned resources sit underutilized, quietly accumulating charges in the background. Without deliberate intervention, AWS database costs can spiral far beyond what the underlying business growth would justify.

This paper examines the mechanisms behind runaway database costs on AWS and presents a structured set of strategies to bring them under control. We explore right-sizing database instances to align compute and memory allocation with actual workload demands, and discuss traffic-based scaling techniques that shift load to smaller, more cost-efficient instances during off-peak periods. We then turn to query tuning as a lever for reducing I/O operations and overall database utilization, often one of the highest-impact and most overlooked optimizations available.

Beyond compute, we address storage bloat directly, demonstrating how offloading infrequently accessed data to Amazon S3 can dramatically reduce the storage footprint of live databases. A leaner database not only translates to lower storage costs but also simplifies schema management, where routine table maintenance, index additions, and structural changes become faster and less resource-intensive when operating on a smaller, well-scoped dataset. Furthermore, reducing the volume of active data has a compounding effect on database performance, as query planners operate more efficiently, buffer pools are utilized more effectively, and overall I/O pressure is significantly reduced.

Finally, we cover financial commitment strategies, including Reserved Instances and AWS Database Savings Plans, which can yield substantial discounts for predictable workloads.

Together, these approaches form a comprehensive framework for database cost governance on AWS, enabling engineering and finance teams to achieve meaningful savings without sacrificing performance or reliability.

Keywords: AWS Cost Optimization, Roles, Amazon RDS, Amazon Aurora, Cloud, Cost Management, FinOps, Cloud Financial Governance, Cloud Economics, Cost Efficiency n

2. Introduction

Databases are one of those things teams set up, get working, and rarely revisit. There is always something more urgent to ship, and touching a production database feels risky. So they sit there, growing quietly, accumulating data, and generating an AWS bill that creeps up month after month. By the time someone stops to look at the numbers, the cost has already drifted well past what the actual business growth would justify.

The good news is that bringing it back under control is more straightforward than it seems. Through our own experience optimizing database infrastructure on AWS, we found that a handful of focused strategies, right-sizing instances, tuning queries, trimming storage, and leveraging savings plans, can make a significant dent without requiring a massive engineering effort or putting production at risk. This paper walks through what worked for us and how you can apply the same approach to your own setup.

2.1 Instance resizing

Most of the time, considering the criticality of databases, instance sizes are allocated generously to ensure stability and avoid any risk of performance degradation. As a result, many instances end up running well below their capacity for extended periods, silently burning through budget without delivering proportional value. This makes instance right-sizing one of the lowest hanging fruits in any database cost optimization effort. In many cases, a single right-sizing exercise can drop your database compute cost by 50% or more, making it one of the highest return optimizations you can make with relatively little effort.

The key is to let the data guide your decisions. If your instance traffic is predictable, CPU utilization is consistently below 20%, and DML latencies such as INSERT, UPDATE, and DELETE in Aurora MySQL or READ and WRITE latencies in Aurora PostgreSQL are comfortably under 5ms, that is a strong signal that your instance has more horsepower than your workload actually needs. In such cases, you can confidently cut the instance size by half. For example, if you are currently running on an r6g.4xlarge, stepping down to an r6g.2xlarge is a reasonable and low-risk move.

The important thing is not to rush the process. After resizing, monitor the instance closely for a few days, keeping a close eye on latency trends, CPU headroom, and any changes in query performance. If the metrics remain healthy and the workload continues to run smoothly, you can consider taking another step down. This incremental approach gives you the confidence to optimize aggressively while still maintaining a safety net at every stage.

Here is the script to get Queries, CPU utilization, and different latencies :

#!/usr/bin/env bash
set -euo pipefail
# ===== CONFIG =====
REGION="us-west-1"&nbsp; &nbsp; &nbsp; &nbsp; # <-- change to your region
PERIOD=600&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # 10-minute datapoints (7d window -> 1008 points, under 1440 limit)
# ===================
# --- Cross-platform date handling (Linux/macOS) ---
if date --version >/dev/null 2>&1; then
&nbsp;&nbsp;END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
&nbsp;&nbsp;START=$(date -u -d "-7 days" +"%Y-%m-%dT%H:%M:%SZ")
else
&nbsp;&nbsp;END=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
&nbsp;&nbsp;START=$(date -u -v-7d +"%Y-%m-%dT%H:%M:%SZ")
fi
STAMP=$(date -u +%Y%m%dT%H%M%SZ)
CSV_FILE="aurora_mysql_weekly_summary_${REGION}_${STAMP}.csv"
# ---- helpers ----
metric_max () {
&nbsp;&nbsp;local instance_id="$1" metric="$2"
&nbsp;&nbsp;aws cloudwatch get-metric-statistics 
&nbsp;&nbsp;&nbsp;&nbsp;--region "$REGION" 
&nbsp;&nbsp;&nbsp;&nbsp;--namespace "AWS/RDS" 
&nbsp;&nbsp;&nbsp;&nbsp;--metric-name "$metric" 
&nbsp;&nbsp;&nbsp;&nbsp;--dimensions Name=DBInstanceIdentifier,Value="$instance_id" 
&nbsp;&nbsp;&nbsp;&nbsp;--start-time "$START" 
&nbsp;&nbsp;&nbsp;&nbsp;--end-time "$END" 
&nbsp;&nbsp;&nbsp;&nbsp;--period "$PERIOD" 
&nbsp;&nbsp;&nbsp;&nbsp;--statistics Maximum 
&nbsp;&nbsp;&nbsp;&nbsp;--output json 
&nbsp;&nbsp;| jq -r '.Datapoints | if length==0 then "N/A" else (max_by(.Maximum).Maximum|tostring) end'
}
metric_avg () {
&nbsp;&nbsp;local instance_id="$1" metric="$2"
&nbsp;&nbsp;aws cloudwatch get-metric-statistics 
&nbsp;&nbsp;&nbsp;&nbsp;--region "$REGION" 
&nbsp;&nbsp;&nbsp;&nbsp;--namespace "AWS/RDS" 
&nbsp;&nbsp;&nbsp;&nbsp;--metric-name "$metric" 
&nbsp;&nbsp;&nbsp;&nbsp;--dimensions Name=DBInstanceIdentifier,Value="$instance_id" 
&nbsp;&nbsp;&nbsp;&nbsp;--start-time "$START" 
&nbsp;&nbsp;&nbsp;&nbsp;--end-time "$END" 
&nbsp;&nbsp;&nbsp;&nbsp;--period "$PERIOD" 
&nbsp;&nbsp;&nbsp;&nbsp;--statistics Average 
&nbsp;&nbsp;&nbsp;&nbsp;--output json 
&nbsp;&nbsp;| jq -r '
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (.Datapoints|length)==0 then "N/A"
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else (([.Datapoints[].Average] | add / length) | tostring)
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;end'
}
metric_min () {
&nbsp;&nbsp;local instance_id="$1" metric="$2"
&nbsp;&nbsp;aws cloudwatch get-metric-statistics 
&nbsp;&nbsp;&nbsp;&nbsp;--region "$REGION" 
&nbsp;&nbsp;&nbsp;&nbsp;--namespace "AWS/RDS" 
&nbsp;&nbsp;&nbsp;&nbsp;--metric-name "$metric" 
&nbsp;&nbsp;&nbsp;&nbsp;--dimensions Name=DBInstanceIdentifier,Value="$instance_id" 
&nbsp;&nbsp;&nbsp;&nbsp;--start-time "$START" 
&nbsp;&nbsp;&nbsp;&nbsp;--end-time "$END" 
&nbsp;&nbsp;&nbsp;&nbsp;--period "$PERIOD" 
&nbsp;&nbsp;&nbsp;&nbsp;--statistics Minimum 
&nbsp;&nbsp;&nbsp;&nbsp;--output json 
&nbsp;&nbsp;| jq -r '.Datapoints | if length==0 then "N/A" else (min_by(.Minimum).Minimum|tostring) end'
}
# ------------------
# Get Aurora **MySQL** instances only
INSTANCES=$(aws rds describe-db-instances --region "$REGION" 
&nbsp;&nbsp;--query "DBInstances[?contains(Engine, 'aurora-mysql')].DBInstanceIdentifier" 
&nbsp;&nbsp;--output text | tr 't' 'n' | sed '/^$/d')
if [ -z "$INSTANCES" ]; then
&nbsp;&nbsp;echo "No Aurora MySQL instances found in region: $REGION"
&nbsp;&nbsp;exit 0
fi
# CSV header (one row per instance)
echo "instance_id,instance_class,Queries_max,CPUUtilization_max,CPUUtilization_avg,InsertLatency_max,UpdateLatency_max,DeleteLatency_max,CommitLatency_max,FreeableMemory_min" > "$CSV_FILE"
for IID in $INSTANCES; do
&nbsp;&nbsp;echo "Processing: $IID"
&nbsp;&nbsp;CLASS=$(aws rds describe-db-instances 
&nbsp;&nbsp;&nbsp;&nbsp;--region "$REGION" 
&nbsp;&nbsp;&nbsp;&nbsp;--db-instance-identifier "$IID" 
&nbsp;&nbsp;&nbsp;&nbsp;--query 'DBInstances[0].DBInstanceClass' 
&nbsp;&nbsp;&nbsp;&nbsp;--output text 2>/dev/null || echo "UNKNOWN")
&nbsp;&nbsp;QRY=$(metric_max "$IID" "Queries")
&nbsp;&nbsp;CPU_MAX=$(metric_max "$IID" "CPUUtilization")
&nbsp;&nbsp;CPU_AVG=$(metric_avg "$IID" "CPUUtilization")
&nbsp;&nbsp;INS=$(metric_max "$IID" "InsertLatency")
&nbsp;&nbsp;UPD=$(metric_max "$IID" "UpdateLatency")
&nbsp;&nbsp;DEL=$(metric_max "$IID" "DeleteLatency")
&nbsp;&nbsp;COM=$(metric_max "$IID" "CommitLatency")
&nbsp;&nbsp;FRE=$(metric_min "$IID" "FreeableMemory")
&nbsp;&nbsp;# emit one CSV row per instance
&nbsp;&nbsp;printf "%s,%s,%s,%s,%s,%s,%s,%s,%s,%sn" 
&nbsp;&nbsp;&nbsp;&nbsp;"$IID" "$CLASS" "$QRY" "$CPU_MAX" "$CPU_AVG" "$INS" "$UPD" "$DEL" "$COM" "$FRE" >> "$CSV_FILE"
done
echo "Wrote CSV: $CSV_FILE"

2.2 Moving READ traffic to reader instances

A pattern we frequently observe is that all application traffic gets routed to the writer instance, while reader instances sit largely idle, reserved only for failover. This is a significant waste of provisioned resources and one of the more overlooked inefficiencies in Aurora cluster configurations. The replication lag between an Aurora writer and its reader instances is typically under 20ms, which means read traffic can be safely redirected to reader instances without any meaningful impact on application latency. Queries that are latency-sensitive and require sub-millisecond to 2ms response times can remain on the writer instance, but the vast majority of read queries can be offloaded to readers without the application ever noticing a difference.

In our own experience, this rebalancing exercise delivered remarkable results. We had a large Aurora cluster serving over 40 applications, with approximately 55,000 queries per minute hitting the writer instance and only 10,000 queries per minute on the reader. The cluster was running on an r6g.24xlarge instance, and the traffic imbalance was putting unnecessary pressure on the writer while leaving the reader largely underutilized. Given the number of application teams involved, redirecting traffic was not a trivial task. It required reaching out to each team, understanding their latency requirements, and carefully determining which workloads could tolerate the reader’s replication lag. For teams that did not have strict sub-2ms latency needs, we migrated their read traffic to the reader instance.

The outcome exceeded our expectations. We successfully moved 25,000 queries per minute to the reader, resulting in a much healthier traffic distribution of 30,000 queries on the writer and 35,000 on the reader. CPU utilization on the writer dropped from 30% to 17%, while the reader rose from 5% to 18%, reflecting a far more balanced and efficient use of the cluster’s capacity. With both instances now operating well within comfortable utilization thresholds, we were able to right-size the cluster from an r6g.24xlarge down to an r6g.12xlarge, cutting the cost of the entire database cluster by 50%.

2.3 Tuning queries to reduce database load and resource utilization

Slow queries in a busy production database are far more damaging than they appear on the surface. Beyond the obvious consequences like lock contention, connection pool exhaustion, and elevated CPU and memory pressure, they carry a cost burden that often goes unnoticed until it shows up as a painful line item on the AWS bill. The most deceptive part is how slow queries distort resource utilization metrics. When teams look at CPU and I/O numbers to decide whether to right-size an instance, artificially inflated metrics make the database appear far busier than it actually is. This leads to a false conclusion that the instance is running at capacity, leaving cost savings on the table that query tuning alone could have unlocked.

The impact of fixing even a single slow query can be remarkable. A query taking 3 seconds that gets optimized with the right indexes and join structure can come down to 30ms, which is a healthy execution time for any busy production database. That is a 100x reduction in execution time, and the ripple effect across the database is immediate. CPU pressure drops, memory frees up, locks are held for a fraction of the time, and the instance suddenly has far more headroom than the metrics previously suggested.

Now imagine that effect multiplied across several poorly performing queries. In most production databases, it is rarely just one culprit. A handful of slow queries collectively pushing resource utilization into a range that seems to justify a larger instance class is more common than people realize. Tune those queries, and the utilization numbers tell a very different story. What looked like an instance running near its limit now has room to spare, and right-sizing becomes not just possible but obvious, delivering meaningful and lasting cost savings.

Here are the commands to find long-running queries in both MySQL and PostgreSQL:

MySQL / Aurora MySQL

SELECT id,user,host,db,command,time,state,nfo AS query FROM information_schema.processlist 
WHERE command != 'Sleep' AND time > 5 ORDER BY time DESC;

This shows all queries running for more than 5 seconds. You can adjust the time > 5 threshold to whatever suits your needs. The time column is in seconds.

For more detailed visibility including transaction and lock information:

SELECT p.id, p.user, p.host, p.db, p.time, p.state, p.info AS query, t.trx_started, t.trx_state, 
t.trx_rows_locked, t.trx_rows_modified FROM information_schema.processlist p LEFT JOIN 
information_schema.innodb_trx t ON p.id = t.trx_mysql_thread_id WHERE p.command != 'Sleep' 
AND p.time > 5 ORDER BY p.time DESC;

PostgreSQL / Aurora PostgreSQL

SELECT pid, usename AS username, application_name, client_addr, state, wait_event_type, wait_event, 
now() - query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' AND query_start 
IS NOT NULL AND now() - query_start > interval '5 seconds' ORDER BY duration DESC;

For blocking and lock related information alongside long running queries:

SELECT pid, usename AS username, application_name, state, wait_event_type, wait_event, 
now() - query_start AS duration, left(query, 200) AS query_snippet, pg_blocking_pids(pid) AS blocked_by 
FROM pg_stat_activity WHERE state != 'idle' AND now() - query_start > interval '5 seconds' ORDER BY duration 
DESC;

2.4 Using database reserved instances

Reserved Instances make the most sense when your workload is stable and predictable, and the longer you commit, the more rewarding the discount becomes. A 1-year Standard Reserved Instance typically saves you around 40% to 42% over On-Demand pricing, while committing to a 3-year term can push those savings up to 60% to 62%, making the longer commitment a compelling option for mature, stable workloads that are unlikely to change significantly over time. Before committing, you should be confident about your instance type and size, because switching to a different instance family mid-term does not automatically transfer your reservation. However, AWS does offer Convertible Reserved Instances that give you the flexibility to change instance types during the term, though at a slightly lower discount of around 30% to 54% depending on the term and payment option. One important and often overlooked advantage of Reserved Instances is the flexibility in how the reservation is applied across instance sizes within the same family. For example, if you have purchased 4 reserved r6g.8xlarge instances, AWS normalizes these reservations and can apply them across equivalent smaller instances. This means those 4 reservations can effectively cover 16 r6g.2xlarge instances, giving you the freedom to right-size your fleet while still taking full advantage of your reserved pricing commitment.

2.5 Database Savings Plan

Aurora Database Savings Plans In addition to Reserved Instances, AWS offers Database Savings Plans for Aurora, which work similarly but offer more flexibility. Rather than committing to a specific instance type, you commit to a fixed hourly spend on Aurora usage, and AWS applies the discount automatically across any Aurora instance type or size. This is particularly useful for teams that are actively right-sizing their clusters, as the savings plan adapts to changes in your instance configuration without requiring a new reservation.

3. The Right Strategy

The smartest approach is to first complete your right-sizing and query tuning exercises before committing to any reservation. Locking in a Reserved Instance on an oversized instance defeats the purpose of cost optimization. Right-size first, stabilize your workload, then commit to a reservation on your optimized configuration. This way, you get the full benefit of both operational efficiency and pricing discounts, stacking your savings in the most effective way possible.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.