Top 11 Mistakes You're Making in AWS

AWS is complex. Sign up for free, useful lessons like this.

Here are the 11 most common mistakes and pitfalls I've seen (and done!) in AWS.

Ignoring CPU Credits

If you're not aware of (or are not paying attention) to your CPU credits, you might be in for a bad time.

AWS's most affordable servers are the T series (T2, T3, T4, etc). The reason they're cheaper is because they work on a CPU credit system.

You think your t3.small has 2 CPU's? You actually get 20% of that. Any usage over your baseline (20% for a t3.small) eats away at your CPU credits. Once your credits are gone, you are stuck at that baseline (20% of 2 vCPUs). You accrue more credits when you're at or below your baseline (up to a certain number - 576 for that t3.small).

This is very important if you're cramming a lot of work into your ec2 server, such as your code and a database.

You can monitor your credit usage within Cloudwatch in your AWS account, or within the EC2 console using the Monitoring tab for a given server instance.

T3 and T4 instances come with a feature called "Unlimited Mode". This is enabled by default. If you have no CPU credits remaining, your CPU is allowed to go above the threshold at additional cost. The cost can make it worth moing up to an M type instance that does not have the CPU credit system.

More information on the T-series CPU Credits here.

Ignoring IOPS

IOPS (I/O operations per second) are a bit more sneaky than CPU credits. Performance issues related to hitting disk volume (EBS) limits is not usually obvious.

Your EC2 servers have volumes attached to them. These EBS (elastic block storage) volumes are network-based (NAS), and have limits.

You're likely using either gp2 or gp3 EBS drives. Despite the newer gp3 volumes being (generally) better, they are not yet the default volume type.

No matter which type you use, they'll have a maximum number of IOPS (operations per second) and Throughput (MB/s) allowed.

The tl;dr is: Watch out for IOPS burst balance and Volume Queue Length within your Cloudwatch metrics. There's a lot more details on what you need to know about IOPS here.

Not Using Cheaper Servers

Have you ever wondered what the t3a servers are? Why are they cheaper? This can lead you to think that it means the servers are "worse" (don't perform as well), but all it really means is that they consume less power to run them.

Since they consume less power, they don't cost AWS as much, and they are therefore cheaper to us mere mortals. For most of us, using these cheaper flavors is pure savings.

AMD Servers

Take the a class of servers - these are just AMD (instead of Intel) CPUs. Their performance is quite close to Intel but they're cheaper. For most web applications, I'd be shocked if you noticed a difference in performance.

You can find more information about AWS's server naming conventions to finally learn what all those letters mean - for example, what a C6gn is.

Graviton Servers

Graviton servers have g in their name (e.g. t4g). These are ARM based and are typically cheaper than both Intel and AMD CPU-based servers.

They're also quite performant, making them my go-to server type. Note that your software needs to be able to be compiled for ARM64 CPU architecture. I've yet to find something I've needed that isn't readily available for this server type (much thanks to Apple using ARM CPUs for making that more prevalent).

Which to use?

For both AMD and ARM instances, there's more benefits than the price tag. In practice, the intel-based instance types are the most common. Moving workloads to AMD or ARM based servers means you'll be running on less-used physical hardware. This reduces cases of CPU steal and issues related to underlying hardware issues on older, overworked physical hosts.

I use Graviton instances in almost all cases. You should consider it too, although be aware of if you need software that's not compatible (I'd be surprised if you were). Note that Graviton 3 instances (e.g. m7g) might be more expensive than Graviton 2 (e.g. m6g), in one of the rare cases were the newer generation of servers are more expensive. If you find that to be the case, stick with Graviton 2.

Spending Too Much on Disk Drives

EBS drives are attached to your EC2 drives. It's tempting to make these large because "storage is cheap", but guess what? It's not.

At $0.08/GB-month, it's easy to wrack up larger bills with EBS drives you've forgotten to clean up or over-sized "just in case". One terabyte is $80 per month.

Here's what to do - simply check your EBS dashboard:

If you attach extra volumes for storage, they may not get deleted when you delete an ec2 instance. They're just hanging around costing you money!
Don't over-size your EBS drives. You can always resize them (larger) later - usually on the fly (altho there's a small performance penalty while they're resized).

Only Using the Default VPC

The default VPC exists to reduce friction when you're new to AWS and want to create some servers.

It's setup in a way that's free (that's good!) but may not be optimal. For example, you can't spin up private-network-only instances in the default VPC. They all are setup to assign public IP addresses to your servers, which costs you money (IPv4 usage) and is less secure.

You can use security groups to lock the servers down - that actually might be the cheaper option. But it's not the recommended approach.

Creating your own VPC lets you create private-network subnets (servers that can connect to the internet, but the internet cannot connect to them), giving you a much stronger security posture.

You can also plan out your VPC usage for your organization, which is an often-skipped step that can result in future issues if ignored.

The trade-off is that using private subnets involves creating one (or more) NAT gateways, which are not free. These are used for instances in your (IPv4) private subnets when those instances reach out to the outside internet (e.g. when you apt-get update).

Not Setting Billing Alarms

This may not prevent surprise bills but it can stop them from spiraling to to crazy places.

There's a few things you can do here:

Billing alarms
Cost anomaly detection
Budgets

Billing Alarms

The most important thing is to create billing alarms (instructions in that link).

For dormant accounts, create an alarm for $1 as your first alarm, so you know if something is happening there creating a charge.

For any types of accounts, go crazy making alarms at all sorts of different spend levels. Make sure to create a bunch of alarms that go higher than your usual spend!

Note that billing alarms are 24 hours behind your spend, as that's when the CloudWatch metric Total Estimated Charge is updated. Your bill and estimated spend is not a "live" number!

Budgets

You can create budgets from within the Billing and Cost Management section of AWS as well. They have handy templates, such as getting an alert when you spend more than $.01, or a Monthly cost budget.

There's more options there as well, such as monitoring your Savings Plans spending. Otherwise, budgets are fairly similar to Billing Alarms.

Cost Anomaly Detection

You should go turn this on in all cases. Essentially you just create a "Cost Monitor" and let it monitor AWS Services (the one you should create first), a linked account, a cost category, or cost allocation tags.

It will then monitor your account and let you know (over time, you need some usage data to make this useful!) if any service (or whatever) is spending more than usual. You'll sometimes see very low usage services come up even if they just spend like $0.17 when they usually spend $0.01. But you'll like this when it detects bigger shifts in spending!

Just like Billing Alarms, this data is not real time - there is a delay.

Losing Track of Resources

You should be using IaC (Infrastructure as Code) to create resources in your AWS accounts. There's too much going on in AWS to trust to memory. Jumping from service to service while creating resources and adjusting configurations is a recipe for disaster.

This isn't "just for professionals" - you should be doing this because it's so easy to create things you're not aware of! The web console hides a bunch of details from you in the name of "ease". Yes, it's more work (up front!), and yes it forces you to learn some details the web console may hide from you. It's still worth it.

Here's how to get started with Terraform. AWS has it's own thing - CloudFormation, but there's also CDK which allows you to use a programming languages to output configuration for things like CloudFormation and Terraform. Another popular choice is Pulumi.

Your future self will thank you. I cannot emphasize that enough. Here's how to get started.

Tracking Resources

Whether you use IaC, another fun (and free!) service is Resource Explorer. You can register a search in all (or selected) regions, and this service will scour your account to find all resources that exist within them.

You create "indexes", and define with regions they relate to. Then you can create Views to act as saved filters.

This is a great way to find hiding resources you didn't know you created.

Using the Root Account

The first thing you do when creating an AWS account is setup a solid password, and enable MFA.

The second thing you do is create an IAM user, and use that user to log into the console from then on.

From then on, the root user should generally be considered a "break glass" user - use only in emergency. This is especially true when using multiple accounts. It's less true when you run one AWS account and need the root user for billing (although you can give IAM users billing access).

Basically: Don't use the root account, and definitely don't ever grant it an access key.

Spending on Forgotten Snapshots

Snapshots can pile up if you have automated processes that create AMIs, duplicate drives to move them across AZs, or any number of situations.

These are cheaper then EBS drives per gigabyte but can still add up.

Here's what you do: Use the Data Lifecycle Manager. Essentially what you do is set tags on your resources (EBS drives), usually on creation (say when you create some EC2's) and then tell lifecycle to use those tags.

Based on those tags, Lifecycle Manager will find EBS drives to manage, then automatically create backups (snapshots) of them, and finally prune older ones. You set how many or how long to retain them.

It's one of AWS's easier services to setup, altho requires a tagging strategy in your EBS volumes.

Not Tagging Everything

This section was almost entitled "Not Using Cost Allocation Tags", but that's not good enough.

You should be tagging everything. Add a lot of tags, and over-describe things. You can tag almost everything, even individual S3 files!

Here's a tag I'd be happy to see:

Description: Bob created this while debugging issue XYZ, safe to delete

They're great for context when clicking around the UI, but they're best for automating stuff - querying for resources, using lifecycle manager, cost reporting, service discovery, automating infrastructure, internal tooling, and much more.

Here's how to get started with tagging. Just setup a standard set of tags and tag everything you see! I use some of these:

Environment - staging, production, etc
Team - Marketing, Bob-Special-Team, Micro-Service-4b - anything that let's you know what team/app this resource pertains to
Organization - Finance, Dublin-IT, anything that relates to an organization within the company as a whole
ManagedBy - Terraform, My-Custom-Automation, Chris in Accounting - something that says who created and manages a resource
Role - What's this used for within the application? Web servers, queues, business intelligence - whatever this thing is for
Visibility - Public, private, not-for-joe. Something that denotes if this a private or public resource
Project - The name of the project/application this pertains to

The best part of tags is that they can be set as Cost Allocation Tags within the AWS Billing section. Once you tell AWS which tags to incorporate at a cost allocation tag (yeah, don't forget that step 😅) bills can then be separated by tags. This lets you tell what team/application/developer/environment/whatever is driving the most (or least!) cost within your AWS account(s).

Not Understanding IAM

IAM is a bit confusing at first. And then it gets even more confusing.

There are a ton of options, and even when you sort through that, creating sane IAM policies (to allow you to make the API calls you need) is opaque. It's hard to know what API actions you need to allow!

On top of that, there's more to IAM than allowing certain users to make some API calls (although you can get really far without having to deal with that).

In any case, having at least a basic understanding of IAM permissions is a must - it permeates through every aspect of using AWS.

This video from AWS is a good primer, and this video is my favorite IAM related video. I also have a mini course on IAM Basics which gets into some fancier use of Roles.