One of our clients, Skill Lync, a renowned edtech company, faced a hefty monthly storage bill of around $2000 for a single RDS Aurora PostgreSQL instance. They approached us to help optimize their storage costs.
Step 1: Initial Database Size Analysis
First, we checked the total database size using the pg_database_size
command:
SELECT pg_size_pretty(pg_database_size('db_name'));
We discovered the database size was approximately 1100 GB.
Step 2: Identifying High-Storage Tables
Next, we identified tables consuming significant storage space with the following query:
SELECT table_schema,
table_name,
pg_size_pretty(pg_total_relation_size(table_schema || '.' || table_name)) AS total_size
FROM information_schema.tables
ORDER BY pg_total_relation_size(table_schema || '.' || table_name) DESC;
We found a table with nearly a billion rows occupying around 600 GB of storage. Most of this data was 4-5 years old and no longer in use. To optimize, we moved this data to Amazon S3 using AWS DMS for potential future access.
After this migration, the database size dropped to around 420 GB. However, the daily cost remained high at approximately $63, translating to $1890 per month. This indicated around 8 TB of data usage at $0.248 per GB-month, since we were using an Aurora I/O optimized instance.
Step 3: Investigating Extra Storage Usage
Using Google and ChatGPT, we discovered we could monitor storage usage in Aurora using the VolumeBytesUsed
metric in CloudWatch, which confirmed our suspicions of excessive storage usage.
After a lot of investigation, we found out that the Write-Ahead Log (WAL) size was approximately 7200 GB. We checked this using:
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/00000000')) AS wal_size;
Step 4: Addressing Unused Replication Slots
We found that unused replication slots in PostgreSQL can cause WAL records to grow indefinitely, which was the issue in our case.
We identified and deleted these unused replication slots using:
SELECT slot_name, active, restart_lsn
FROM pg_replication_slots
WHERE active = false;
SELECT pg_drop_replication_slot('slot_name');
Conclusion
After removing the unused replication slots, the CloudWatch metrics and billing dropped significantly. We confirmed the reduction in costs using AWS Cost Explorer, grouping by resource ID.
By implementing these optimizations, we were able to reduce Skill Lync's RDS Aurora storage costs by over 90%, showcasing our ability to deliver significant cost savings through meticulous analysis and targeted actions.
About Boopesh Mahendran
Boopesh is one of the Co-Founders of CyberMind Works and the Head of Engineering. An alum of Madras Institute of Technology with a rich professional background, he has previously worked at Adobe and Amazon. His expertise drives the innovative solutions at CyberMind Works.