Decorative circuit board background pattern
I Saved a Client $5K/Year on AWS in 2 Hours

I Saved a Client $5K/Year on AWS in 2 Hours

JL

Jay Long

Software Engineer & Founder

Published March 31, 2026

The Backlog That Never Made Sense to Touch

I just saved one of my clients over $5,000 a year on their AWS bill. In two hours. And it's all stuff I've been thinking about doing for a long time but never had the confidence or the time to plan and execute on. I want to break down exactly how I did it, and why it took AI claws to make it possible.

Quick note on that term: Andrej Karpathy has been referencing these proactive, autonomous AI agents as "claws." These agentic teams that go out and find things to improve, work with you to plan and build, and just operate with initiative. That's how I think about my setup now.

I've had ideas for ways to optimize this client's AWS costs for a long time. Sometimes I'll make small incremental improvements, but the overall rising cloud costs end up making those unnoticeable. Like everything in the economy, cloud costs generally inflate over time. If you've got a client consistently in the thousands on their AWS bill and you save them under $100, when you factor in their data lake growing, their traffic growing, cloud pricing creeping up — you're not really justifying your cost for the time it took. Mathematically, you're not earning your keep.

Why Not Just Walk Away?

I've actually been tempted to bow out of this engagement altogether. But two things keep me on the fence. First, every once in a while there is a breakthrough moment. You're chipping away at the system, and eventually you find something that collectively earns your keep. Those dozens of dollars saved per month do add up to hundreds over a year. You can make the case that without you there, it would be a lot worse. But then you factor in your own cost, and it offsets the savings to some degree.

Second, there are response events. Spikes you have to address. Availability issues. The client also likes having me around because when they need a big migration, I'm there. A lot of what we've been doing is moving their front end out of ECS and into Vercel, which I've covered in detail before. If you're running a standard public-facing site, it really is usually best to outsource your front-end infrastructure to Vercel. The only serious case for staying on AWS is at truly massive scale, like millions of daily users, where rolling your own edge caches and image optimization starts to pencil out. At thousands of users, Vercel's economy of scale is still a cost savings. Even at hundreds of thousands, they can handle it. It's just that at some point their pricing funnels you away, strategically, so they can better service startups, bootstrapped founders, cost-conscious teams. The companies with hundreds of thousands or millions of visitors should be more than capable of hiring an engineer to deploy on a self-managed cloud stack.

The Redshift Problem: Paying 24/7 for a 5-Minute Job

One of the first things I did was look at their Redshift service. They have this custom data collection and analysis engine that ingests data, performs some analysis, and outputs something usable for the customer. I looked at the resource metrics on that thing and you could set your watch to the precision of these little spikes. It was so obvious what was happening. A script was firing off on a schedule, once a day and then once a month. We had records going back years showing this thing has never taken more than five minutes to run. I found the EventBridge scheduler task that fires it off at a specific time. And we were paying for that cluster to be available 24/7.

I always thought it would be a huge cost savings to automate that thing to scale in and out. But first I had to coordinate with the team, who are notoriously hard to reach. Everything is poorly documented. The people I took over from did not leave good documentation on the infrastructure.

What Happens When the Original Builders Are Gone?

This is actually a really common situation and it touches on a lot of things that are changing right now. The team who built this part of the system was an outside contractor who specializes in this kind of thing. They come in, get to know your team just enough to stand a viable thing up, your budget is limited, everybody's not always at the meetings, and then they go off. It seems fine until it's not. Something goes wrong and you've got to hire another contractor because those people have moved on. But you've got this critical piece of technology your organization depends on.

I'm the person who comes in after the fact, who happens to know a lot about cloud and has a background in software engineering. I've met other engineers who hit this same wall. Even with the most probing investigative mentality, you keep trying to find new people to get in touch with. And with a company like this, it's not an enterprise where most people are still around. A lot of these people have moved on permanently. They're not in contact. There may even be bad blood. You end up at dead ends.

So what do you do when you sense potential for cost optimization but it's pulling teeth to get people to respond? When you can't be sure there isn't some internal person somewhere who uses this tool sometimes, in a very awkward time zone, and you might be sound asleep for hours before you get the notification?

CloudTrail, Logs, and Building Confidence

I want to be clear that none of this is groundbreaking innovation. I've always known about CloudTrail. I often use CloudTrail. This is where I think most of the wins are going to come from with AI. If they stopped improving the models right now, reasonably savvy IT professionals and engineers could take the current level of intelligence and there are so many clever tools and systems you can stitch together. The intelligence is increasing so fast that we can't even keep up with what's possible at the current level. And then by the time you've created tools around that level, the model gets smarter and that skill you just developed is actually baked into the model now. That's how fast things are moving.

It's a bunch of obvious things. Let's start capturing the resource metric data and storing it somewhere. Let's start capturing CloudTrail events. Let's start capturing request logs. Between resource metrics, request logs, CloudTrail, and then comparing that with git history, I can say with confidence that nothing is happening other than that three-to-five-minute job once a day. The cluster just sits there for 23 hours doing absolutely nothing and we're paying for it.

So I automated a script that shuts it off after the spike and fires it back up 30 minutes before the next one. That's hundreds of dollars in savings on their Redshift bill alone.

The Read Replica Nobody Was Reading

Same approach with the database. CloudTrail, query logs, request logs. There's a read replica that's just not being used at all. The queries probably should be split between reader and writer, but honestly all it is is a GraphQL backend running locally on the network. It's not something that needs to scale. It's serverless Aurora, so Amazon already owns the failover responsibility. The only purpose for a read replica in this situation would be if you're periodically performing heavy data read operations for reporting or data science, and anything like that is happening in Redshift anyway. So I shut the replica off.

Between Redshift scheduling and the Aurora replica, in about two to three hours, I managed to save them over $5,000 a year.

Reinvesting the Savings Instead of Just Being Cheap

What I've decided to do is take that extra headroom and instead of just trying to be as cheap as possible, look at what they're doing and understand the data science portion that's probably been underserviced. The people who stood some of this stuff up have not been involved in the organization in years. What were they doing years ago that I can quickly and effectively learn enough to improve?

Between my rapid education track, which is AI-supplemented and guided, plugged into the community for authority signals and validated sources, plus building custom skills based on the exact system I'm working on, it's never been more possible for me to revive these projects that were taken to a viable state and then abandoned in terms of development and maintenance. I'm taking the money I'm saving and investing it back into their system.

The practical justification: we may actually be able to replace some of what they're doing with an AI-powered agentic team running in a simple ECS task. We already have an ECS service powering a lot of web services, so what's one more task? This data science pipeline was built before anyone was using ChatGPT. What if they're running heavy Python scripts on expensive AWS infrastructure that can actually be done better with modern AI models for almost no money?

Fresh Terraform, an Observability Dashboard, and What Comes Next

Their Terraform configs were horribly outdated, running an older version locked into a Docker build for long-term support. Smart at the time, but it made it real easy to never update anything. It was so far behind that it just made sense to start a new repository and build out fresh modules with the latest version of Terraform for everything I touched. Before, I'd put this work down for a month or two because I'm busy with other client work and don't want to be padding hours. Now I can point my agents at it and actually make progress.

I also built an observability dashboard so that all of the things we're tracking are in one convenient place. That was a non-starter before. How was I going to justify 10 or 20 hours? How often would it be used? And now, because you can just do things, I was collecting all this data and said, "Hey, while we're at it, why don't we present it in a graphical dashboard?" Five minutes later, done. Really good. Really useful. Super quick and simple and easy to update.

I should probably firm this whole approach up into a repeatable tool. In every service I touched, it's the same pattern: put a CloudTrail on it, make sure logging is good, give it a month, come back and look at resource metrics, request logs, CloudTrail events. Find out who's accessing what and when. This also has powerful implications for security. If you can identify when people are using what, you can use that to inform your IAM roles and limit access to the hours when they actually need it. Principle of least privilege. A lot of compliance requirements are built around gating resources to certain times of the day.

The point is, now startups can do this. Bootstrapped three-person teams can hire me and my agents to come in and clean up their AWS positioning. I can now do things that only the top fraction of cloud architects were capable of doing before. And it's just stringing simple principles together with confidence. Ideas just spring forth now. Anything you previously thought about doing but couldn't justify, it's time to reevaluate, because you might be able to utter it into existence with your voice. Which is crazy to say. But that's where we are.

Share this article

Help others discover this content by sharing it on social media