Our client, a high-growth transaction-processing platform, experienced a surge in volume following their Series B funding round. As transaction volume grew, their AWS monthly infrastructure spend surged faster, rising from $85k/month to over $190k/month in less than two quarters. The management team was preparing for their next fiscal cycle and needed to realign their gross margins without delaying their product development roadmap.
The primary issue was that their engineering velocity had taken priority over structural efficiency. Their infrastructure layout consisted of dozens of AWS accounts, over-provisioned Amazon RDS Aurora databases, and custom ECS tasks that were running at an average of 4% CPU utilization. Furthermore, they were incurring massive, unexplained cross-availability-zone (cross-AZ) network charges that accounted for nearly 30% of their total EC2 bill.
We began the project with a read-only audit of their entire AWS ecosystem. By tracing database connection pools, monitoring routing paths, and parsing AWS Cost and Usage Reports (CUR) using custom query models, we surfaced the following structural issues:
Rather than relying on automated optimization tools that only suggest instance resizing, we designed an architecture modification plan focused on structural alignment:
We configured ECS service discovery and target groups to route traffic locally within the same availability zone using node-local attributes. Cross-AZ traffic was restricted to failover conditions only. This eliminated 92% of their internal network transfer fees.
We replaced the single large RDS instance with a cluster running on a smaller `db.r6g.2xlarge` master instance. To handle read traffic, we added two read replicas with auto-scaling rules. This configuration handled writing spikes through database connection pooling (`pgBouncer`) rather than expensive vertical scale overrides.
We wrote modular Terraform code to rebuild and synchronize their environments, enforcing lifecycle rules on snapshots and storage attachments.
resource "aws_db_instance" "aurora" {
instance_class = "db.r6g.2xlarge"
allocated_storage = 100
storage_type = "gp3"
iops = 3000
deletion_protection = true
lifecycle {
prevent_destroy = true
}
}
Within six weeks of finalizing the deployment changes, the client's monthly cloud spend was reduced from $192,400 to $119,288. Transaction latencies remained steady, and the average query processing time actually decreased due to database connection optimization.
While the project delivered the targeted financial outcomes, the migration was not without friction. Migrating read workloads to the newly established regional replica pool initially caused transient replication lag spikes during daily batch updates. This caused out-of-order write checks inside the billing engine. We resolved this by refactoring their application database pool settings to direct verification logic exclusively to the primary writer instance, utilizing replicas only for non-critical query reads. If we were executing this audit again, we would have staged replication latency load profiles prior to downsizing the database.
We work with B2B SaaS and technical firms that demand high engineering standards. Let's discuss your cloud setup.
Start a conversation →