Case Study: AWS Cost Optimization

The Situation

Our client, a high-growth transaction-processing platform, experienced a surge in volume following their Series B funding round. As transaction volume grew, their AWS monthly infrastructure spend surged faster, rising from $85k/month to over $190k/month in less than two quarters. The management team was preparing for their next fiscal cycle and needed to realign their gross margins without delaying their product development roadmap.

The primary issue was that their engineering velocity had taken priority over structural efficiency. Their infrastructure layout consisted of dozens of AWS accounts, over-provisioned Amazon RDS Aurora databases, and custom ECS tasks that were running at an average of 4% CPU utilization. Furthermore, they were incurring massive, unexplained cross-availability-zone (cross-AZ) network charges that accounted for nearly 30% of their total EC2 bill.

What We Found

We began the project with a read-only audit of their entire AWS ecosystem. By tracing database connection pools, monitoring routing paths, and parsing AWS Cost and Usage Reports (CUR) using custom query models, we surfaced the following structural issues:

Cross-AZ Egress: Services deployed inside their ECS cluster were communicating across availability zones without considering network locality, incurring cross-AZ egress fees for every internal request.
Aurora Over-Provisioning: Their primary Aurora Postgres cluster was scaled to run on a large `db.r6g.8xlarge` instance to handle rare transaction write spikes, but average database CPU utilization rarely crossed 12%.
Unattached volumes and orphaned snapshots: Years of rapid deployments had left hundreds of terabytes of unattached EBS volumes and abandoned RDS manual snapshots that were billing silently.

+-------------------------------------------------------------+ | AWS REGION (eu-west-1) | | | | +-----------------------+ +-----------------------+ | | | AZ - A | | AZ - B | | | | | | | | | | +---------------+ | | +---------------+ | | | | | ECS Service |<======X====>| ECS Service | | | | | | Local | |Cross| | Remote | | | | | +---------------+ | AZ | +---------------+ | | | | | Egress | | | | +---------------+ | | +---------------+ | | | | | Aurora Primary| | | | Aurora Replica| | | | | | (Active) | | | | (Active) | | | | | +---------------+ | | +---------------+ | | | +-----------------------+ +-----------------------+ | +-------------------------------------------------------------+

What We Did

Rather than relying on automated optimization tools that only suggest instance resizing, we designed an architecture modification plan focused on structural alignment:

1. Network Locality Routing

We configured ECS service discovery and target groups to route traffic locally within the same availability zone using node-local attributes. Cross-AZ traffic was restricted to failover conditions only. This eliminated 92% of their internal network transfer fees.

2. Aurora Postgres Scale-Down and replica placement

We replaced the single large RDS instance with a cluster running on a smaller `db.r6g.2xlarge` master instance. To handle read traffic, we added two read replicas with auto-scaling rules. This configuration handled writing spikes through database connection pooling (`pgBouncer`) rather than expensive vertical scale overrides.

3. Terraform consolidation

We wrote modular Terraform code to rebuild and synchronize their environments, enforcing lifecycle rules on snapshots and storage attachments.

resource "aws_db_instance" "aurora" {
  instance_class      = "db.r6g.2xlarge"
  allocated_storage   = 100
  storage_type        = "gp3"
  iops                = 3000
  deletion_protection = true
  
  lifecycle {
    prevent_destroy = true
  }
}

The Outcomes

Within six weeks of finalizing the deployment changes, the client's monthly cloud spend was reduced from $192,400 to $119,288. Transaction latencies remained steady, and the average query processing time actually decreased due to database connection optimization.

Engineering Reflection

While the project delivered the targeted financial outcomes, the migration was not without friction. Migrating read workloads to the newly established regional replica pool initially caused transient replication lag spikes during daily batch updates. This caused out-of-order write checks inside the billing engine. We resolved this by refactoring their application database pool settings to direct verification logic exclusively to the primary writer instance, utilizing replicas only for non-critical query reads. If we were executing this audit again, we would have staged replication latency load profiles prior to downsizing the database.

Looking to scale efficiently?

We work with B2B SaaS and technical firms that demand high engineering standards. Let's discuss your cloud setup.

Start a conversation →

How a Series B fintech cut AWS spend by 38% in 6 weeks