Hacker News new | past | comments | ask | show | jobs | submit login
Leveraging mispriced AWS spot instances (pauley.me)
184 points by ericpauley on Oct 21, 2022 | hide | past | favorite | 74 comments



As someone who spends entirely too much time thinking about cloud infrastructure costs ( I'm co-founder of https://www.vantage.sh/ which maintains https://ec2instances.info/ ) I just want to recognize that amount of effort that went into this blog post to collect the data and express an interesting perspective for a fairly complicated topic.

Kudos to the author on producing this.


You should know we use this tool _inside_ of AWS as well. Not in EC2 itself, but many many other places


Thanks to you and the Vantage.sh team for freely hosting https://ec2instances.info . This site is a great resource and got me follow what Vantage is up to more frequently.


Great work on vantage.sh and ec2instances.info!

Quick, small fix: This instance is shown as having 0 GBs of memory, but in fact it has 0.5 GB. https://instances.vantage.sh/aws/ec2/t4g.nano


It's community supported! We just pay the bills and maintain hosting it :)

Do you mind opening an issue on the repo here? https://github.com/vantage-sh/ec2instances.info

Thank you for the report!



Oh I've used ec2instances.info very _very_ often, so thank you for that. So useful!


I use ec2instances site every day, thank you for your service o7


We run a very large installation 100% on spot and have done for a few years. We serve our web traffic, do background work, etc. all on spot instances.

We see similar mismatched pricing all the time and take advantage of it. One additional area not called out here is the difference between c5.24xlarge and c5.metal instance pricing. These are pretty much identical hardware but metal instances are often cheaper.

As you go down this path, do expect to see a lot of weird things that you'll have to track down. For example, when we introduced metal instances we found that the default ubuntu AMI launched with a powersave cpu governor. Non-metal instances don't support CPU throttling so it never came up with c5.24xlarges. When we first launched metal instances the performance per instance was significantly worse and took a bit of work to track down.

Recently we've seen a lot more spot interruptions and it's pushing us to incorporate more 6th gen instances to get us more diversity. We've also temporarily switched to capacity optimized over price optimized and we've enable capacity rebalancing.

It's absolutely a win for us from a pricing perspective. Our traffic is extremely variable each day and very seasonal throughout the year. RIs don't make sense given <12 hrs daily peak and 10x difference between July and September. However, just plan for some odd surprises along the way.


There is a tale - perhaps apocryphal - handed down between generations of AWS staff, of a customer that was all-in on spot instances, until one day the price and availability of their preferred instances took an unfortunate turn, which is to say, all their stuff went away, including most dramatically the customer data that was on the instance storages, and including the replicas that had been mistakenly presumed a backstop against instance loss, and sadly - but not surprisingly - this was pretty much terminal for their startup.

Caveat operator.

(I’m sure parent commenter is either not exposed to this scenario or has otherwise mitigated against it)


We've worked closely with our team at AWS to ensure we are following best practices. The consensus has been that 4+ AZs and 12 instance types is sufficient diversification.

We also have a second, on demand, ASG ready to fire up at a moments notice if something were to happen with capacity.

We also heavily leverage managed services for state.


But wouldn't the rds snapshots or whatever still be there? I don't understand why this caused data loss.


There is no RDS in this tale. All their data was on EC2 spot instance storage.


Absolute yikes.


Have you observed metal instances taking longer to boot? I did last time I checked, and the difference was big enough to affect pricing in a non-trivial way, given that performance is the same and that you start paying immediately.


This is a good point. They do take longer to boot, which might be part of the reason there's a discount, but it hasn't been so significant that we avoid them because diversification is important when running on spot.


Yes they take significantly longer to boot.


The article makes a HUGE assumption.

They spot an inconsistency between two prices, and decide that the fair market value must be the very highest part of the spread. Anything under this is therefore "under-priced".

Is it not possible instead that people are overpaying for the popular ones through sub-optimal bids - instead of simply assuming that only these inflexible/least sophisticated bidding strategies represent the fair market value.

They actually go further. They assume that AWS could realize this value, and that encouraging more flexible bids through tooling etc. would move everyone to the top of the spread, instead of smoothing it out towards the average. And that what is essentially a price-increase can be achieved without hurting the overall value (price-performance vs. flexibility). Given the entire point of this is auctioning unused cycles at a discount, clearly any overall increase would decrease the overall demand.

Having said this it's a great article. I think the overall quality of the article made it so surprising to see this missed.


Author here. This is definitely a big assumption. I cut the price differences in half to account for market movement, but the price difference could definitely be more or less especially as these pools are probably thinner markets.


As I said - it's a great article! This was just one thing I noticed which I pointed out as it made me think.

Keep up the good work


There's underlying capacity as well. Would you rather pay a bit more to get 100 r6g.4xl OR pay a bit less to have 90 r6g.4xl + 10 r6gd.4xl.

Many workloads do not have deployment configuration supporting a non-homogenous fleet of instances. Over time this will be addressed, but it could be a current major contributor to the discrepancies viewed.


It is completely incorrect to characterize these observations as "mispricing" - this is a quirk of automatically-determined prices across very different products. If the author actually tried to use these instances in any significant volume they would understand the driver - capacity pools are nowhere near equal, and not as interchangeable for AWS as the article implies they would be for a user. Prices reflect demand munged with available capacity - uncommon instance types are uncommon precisely because they aren't used as much, so there aren't the same signals to drive the price up and down automatically.

Instances with attached NVMe are available in much lower volumes than others, as are AMD instances. Obviously these pools cannot be used as a drop-in replacement for non-"d" instances or Intel families.


In financial markets, this quirk of automatically-determined prices across different products is frequently called "mispricing" when those products logically should have a relationship with each other.

Straightforwardly: All hosts with space for a c6gd spot instance have space for a c6g instance. If Amazon is willing to host a c6gd instance in that slot for $X, they should be willing to also host a c6g instance there for $X.

In financial markets, the way this gets handled is through arbitrage: someone will buy the equivalent of the c6gd instance, and sell the c6g part for the higher price (they may also sell the "d" part for even more money). This has the effect of "correcting" the price. The AWS spot market does not allow you to do arbitrage, and AWS doesn't appear to do the arbitrage for you.

AWS probably likes this inefficiency in their market: some instance types are more popular than others, and some customers make assumptions that require them to use a very specific instance type (ie a c6gd would not work as a substitute for their c6g instance). However, the vast majority of users probably could work just fine if their c6g instance were a c6gd, and don't look for the arbitrage opportunity. That means Amazon gets paid extra.


> If Amazon is willing to host a c6gd instance in that slot for $X, they should be willing to also host a c6g instance there for $X.

The reality is that direct c6gd demand might be an order of magnitude lower than c6g direct demand - if AWS can get some more flexible people to adopt c6gd by offering a lower price, c6g capacity is slightly stabilized for on-demand usage by people who don't value the flexibility.

Also note that c6g to c6gd has a non-zero switching cost - extra NVMe on the instance adds a new source of potential hardware failure, increasing the probability of termination very slightly. There might be other software-related costs depending on whether your application makes any ill-advised assumptions about attached storage during setup.

So overall, I would just be happier to read this article if it was framed as "PSA: having more features in an ec2 instance is sometimes cheaper! Don't rule yourself out of extra savings by making overly-constrained fleet requests." The extra commentary about foregone revenue makes too many assumptions and detracts from the core point.


The point is that Amazon doesn't have to fill that slot with a c6gd. They can also fill it with a c6g. They just choose not to.

The fact that you have to host a c6gd to get that price instead of a c6g is an inefficiency in the spot market that likely makes Amazon money, but is a little customer-hostile. I think the article is probably wrong that Amazon is foregoing revenue due to this. This is a form of price discrimination and it is likely making Amazon money, but in a scummy way.


Agreed that it's definitely difficult to know the true missed revenue here without internal data, and even then you'd be making some assumptions. I am confident there is some missed revenue here, as amazon routinely has spot capacity constraints under existing prices so could definitely sell some substitute instances without moving the original instance market (even one instance per pool substituted equates to >$1M per year). In either case, a savvy organization can definitely benefit from the price discrepancy even if Amazon couldn't.


I can agree that there is missed revenue - but realistically it wouldale much more sense to sell that capacity via Fargate (which is closer to undifferentiated generic compute and RAM) rather than monkeying with the spot pricing algorithm.


Great point on Fargate, I'd be very curious on whether they select capacity for that from EC2 capcity or if there's a separate physical footprint for it.


Author here. The key here is that customers can leverage these pools in addition to their existing pools, improving capacity and price. AWS actually supports this out of the box (including substituting instances with drives) by specifying core and memory requirements directly instead of instance types.


Totally agree with that; it is a pretty common approach. The only part I don't agree with is calling out the price differences as some kind of "gotcha" that AWS somehow missed, particularly given the speculative "lost revenue" data which have no basis in reality.


See the emphasis on transparent substitutes in the article. This analysis is limited strictly to sets of instances that are fully hardware compatible, meaning AWS could resell one instance as another. There are way more savings to be had as a customer by leveraging instances that aren't transparent substitutes.


I read it all, and don't agree with your interpretation of "transparent substitutes" in several of the cases.


Which instances are not transparent substitutes, in your opinion? Keep in mind the defintion here is that Amazon could substitute the image transparently, e.g., by ignoring the additional resources in hypervisor, not that the instances are by default indistinguishable.

That being said, the substitute instances considered could be trivially accepted by any task running on the original instance, so long as it doesn't misbehave when given too many resources. In the case of vCPU, you can even hide extra vCPU cores, so a c6g.xlarge can be made effectively indistinguishable from a m6g.2xlarge by disabling the vCPUs at the hypervisor level.


> Across all AWS availability zones instances are mispriced by roughly $400/hr at any given time. This means that, with just a single instance of each type, Amazon is missing out on $200/hr or roughy $1.7 million each year. This is over roughly 15,000 pools of instances. Given Amazon controls roughly 100 million IPs, we can guess that each instance pool probably has on the order of 1000 instances (more for smaller instances, less for larger instances). Given this, the average mispriced pool might have hundreds of instances, meaning hundreds of millions each year in missed revenue due to mispriced spot instances. Because amazon keeps their number of instances a secret, it’s difficult to make a precise estimate from the outside, but the missed revenue probably falls somewhere in this range.

You are hypothesizing that the price differences produce "lost" revenues.

An alternative hypothesis can be that the price differences produce similar or higher level of revenues for AWS through price segmentation, with Amazon recognizing the lack of adoption of certain spot instance bidding features and auction markets reacting appropriately.

Unless you have the capacity and quantity demanded for each instance types, you can't prove your hypothesis. You are assuming scenario 3 (below) with no insights into price elasticity of the underlying customers.

Example:

  Baseline:
Instance types A and B are equivalent.

A is priced at $3, with capacity of 1000, quantity demanded of 800. B is priced at $2, with capacity of 1000, quantity demanded of 200. Total quantity demanded = 1,000.

Revenues from instance type A = $3 x 800 = $2,400 Revenues from instance type B = $2 x 200 = $ 400

Total revenues = $2,800

  Scenario 1: All customers purchase instance B instead due to better price discovery.
Revenues from instance type A = $3 x 0 = $0 Revenues from instance type B = $2 x 1,000 = $ $2,000 Total quantity demanded = 1,000.

Total revenues = $2,000

Amazon loses $800 in revenues, there are no "lost" revenues" recovered.

  Scenario 2: Amazon changes instance type B price to $3. Total quantity demand decreases to 900 due to price elasticity of instance type B customers.
Revenues from instance type A = $3 x 800 = $2,400 Revenues from instance type B = $3 x 100 = $300

Total revenues = $2,700

Amazon loses $100 in revenues, there are no "lost" revenues recovered.

  Scenario 3: Amazon changes instance type B price to $3. Total quantity demand remains at 1,000.
Revenues from instance type A = $3 x 800 = $2,400 Revenues from instance type B = $3 x 200 = $600

Total revenues = $3,000

Amazon recovers $200 in "lost" revenues.


The missing component of your analysis is that amazon has 4th option: re-sell instances of B as instances of A when A is more expensive, and otherwise allowing the market to adjust. The analysis is strictly limited to instances where amazon could, in theory, do this (e.g., reselling c6gd as c6g).

Assuming the market is in equilibrium, the above scenarious aren't realistic, as demand at the market price would equal supply at the current price (roughly, of course).

Suppose there are 1000 c6g and 200 c6gd, with equilibrium price of $3 and $2, respectively (i.e., all instances have demand). Amazon re-SKUs c6gd as c6g until there are 1100 c6g selling fro $2.90 and 100 c6gd selling at $2.90. Total revenue is $3480 vs. $3400. Of course it's impossible to know the true numbers without hidden knowledge of the market, but this is more akin to what would occur. Amazon effectively has a risk-free arbitrage opportunity here, so it stands to reason that there is revenue to be made. Customers don't have this option (since you can't short spot instances), so the best you can do is diversify and save money.

Edit: Actually, the AWS spot market is often out of equilibrium in a way that makes this reselling even more effective. For instance, in the example in the article the c6gd instance is actually pegged at the minimum price, so some number of those instances could be resold as c6g without moving the c6gd price at all.


I think you’re think about the revenue functions for spot instances in isolation of the larger supply base of all instances. Spot instances are already a result of revenue management of a fixed supply base that increases in discrete increments over time. Instance capacity overall usually leads instance demand, shortage costs are very high in data centers.

Spot instance capacities are a function of the all instance capacity for the same type and on-demand instance usage. Spot instance pricing can influence the quantity demanded of on-demand instances of the same type, and vice-versa.

Anyhow, there’s no way we can figure out whether you’re right or wrong with any reasonable level of certainty.


While it's tough to say with certainty how much revenue is lost, there is certainly lost revenue. Consider that many substitute instances are available at the minimum allowable price (i.e., won't go any lower, there is unused capacity). These could be resold without moving the substitute market.


The mispricing is likely good for Amazon. It indicates that most people aren't doing this arbitrage, so Amazon can milk them for extra money.


If you want to leverage cheap spot, use us-east-2 / Ohio region. The prices are typically half of what you see in us-east-1.

Also, it really helps to analyze at the AZ level. Certain AZs lack instances or have very low spot availability and contrary to recommended best practice, reducing AZs can sometimes be beneficial (I am looking at you eu-central-1a).

While lowest price sounds nice, they can be really messy in terms of spot interruption rate. It is much better to set a max price and choose capacity optimized with as many instances as possible.


> eu-central-1a

FYI, AZ names are not universal. Your eu-central-1a might be someone else's eu-central-1b.


this actually depends on the region. amazon stopped randomizing az names in new regions quite awhile ago, while also offering azid as a guaranteed id in all regions.


I run a service that has an API. which can help get spot price https://ec2.shop/

Simplify do:

curl 'https://ec2.shop?region=us-west-2&filter=m5&json' | jq

You can pipe it to whatever your system store to get the real time price without dealing with AWS Price API


> For example, you can make all these substitutions:

>c6g.2xlarge→c6g.4xlarge→m6g.4xlarge→r6g.4xlarge→r6gd.4xlarge

A long standing ticket in my personal project backlog is comparing different instance types performance. I'm not sure this equivalente is without caveats.

Anyhow, the reason "misprices" exist is because:

- Many AWS products are elastic but only allow one to choose a single instance type. So you need to guess the best instance for a workload and stick with it.

- No AWS product exposes a "Just give me the cheapest VM with x CPU and Y memory" API


Autoscaling group with mix instance type spot strategy does that. You can even give weights to instance type, giving more performant/higher capacity higher weight and it can choose the cheapest one with the weight in mind.


The end of the article shows a request for just that, doesn't it? No clue about the API though...

Depending on your workload you might be able to actually substitute a single 8xlarge with two 4xlarge for example... A while back I was actually doing something like that to save some money :-)


Wow totally missed that. Cool stuff!


There's actually an AWS API that does that. That's create-fleet in 'instant' mode. Docs are here: https://docs.aws.amazon.com/cli/latest/reference/ec2/create-... In short, you create a request specifying your CPU and memory constraints (you can also specify other constraints like processor manufacturer, memory per vCPU, etc). Then select lowest-price allocation strategy, and fleet "--type instant". Fleet will make a synchronous request provisioning the currently cheapest instance(s) which satisfy the constraints you selected, and will respond with the instance IDs.


AWS doesn't want you to have that API! That's a significant part of their margin.


> Compute-optimized (C) instances can substitute for a general-purpose (M) instances of half the size

They do have the same amount of memory (and twice the CPU). But if you run a workload that automatically scales to the number of available cores, starting twice the number of processes / threads might well run you out of memory.

The article is interesting, but blindly running your code on unexpected instance types may be more "exciting" than the author makes it sound.


Author here. If you design the workload you can ignore the extra instances. You can actually hide these cpu cores from instances within the AWS api (see setting instance vCPU) so it is truly transparent.


Interestingly GCP already offers over 75% discounts for n2d (AMD) spot instances that don't rely on any internal market, and the discounts for other families are fairly close.

We see individual spot instances go away every few days which works pretty well for GKE. The older preemptible class of instances restarted every 24 hours which was more of a pain (mitigated a bit with a preemptible killer to spread the restarts out).


One thing to note is volatility. Spot instances are great for workloads that can absorb spot instance interruptions, and those interruptions tend to happen more if everyone else is trying to get spot instances at that time. Stateless web workloads that can startup and shutdown fast are a good example.

Some workloads might not. You wouldn't want to run stateful workloads on spot, for instance. In our case, we have something that doesn't handle bootup under load very well, and until we can improve that, the overall reliability is not as good.

I also like GCP's way of pricing these: you say whether your workload is preemptible or not, and you get discounts. You automatically get discounts if you run the workload for a long time.


A few years ago, I noticed the common ec2 instance types (c5.large) prices spiking more frequently than their n and d variants (faster network and additional NVME disks respectively). This prompted me to learn how to use them.

There are differences, especially if you actually want to take advantage of the additional resources. Once I learned how to use the d variants, to mount the NVME disks for maximum advantage, to deal with the lack of persistence, and so on... the d spot instances were a steal! A lot of up front cost to use them properly, but performance was excellent and the price remained at the market floor almost 100% of the time. In the long run it was cheaper than the same machine without the fast disk attached! I assumed someone would figure it out and my advantage would disappear as the market corrected.

But I underestimated the sheer number number of ec2 instance types to evaluate! While it's true, having a diversity of instance types will help you keep spot costs down and ensure you're never priced out of that one blessed instance type - There's an upfront cost to make that system work though. Engineering teams just don't have the time to test on them all. At worst there are hidden differences that negatively impact performance, at best you might fail to take advantage of the resources and leave them idle (can your app even push 25Gbps or max out an NVME RAID).


I wrote a program to get AWS spot instance pricing. This program is similar to using "aws ec2 describe-spot-price-history" but is faster and has a few more options.

https://github.com/jftuga/spotprice


Fun article, the phenomenon is interesting to see in practice, I've seen it regularly with newer instance types as it can take time for people to add them to their configurations.

We're heavy users of spot here in Intercom. I spot-checked our biggest workload, and this week we could have paid around 10% less if we were able to get the cheapest spot host possible in us-east-1 that is suitable for our workload (all 16xlarge Gravitons). However that would be at the cost of fleet stability, I think that to run relatively large production services used in realtime on spot you need to prioritise fleet stability, so choosing the "Capacity Optimized" strategy. We've seen incessant fleet churn when trying out cost optimised strategies.


Is there tooling to find the global minimum price for an instance with certain characteristics?

I found it easy enough to do that in one region, but I've got some compute workloads that just read/write from S3 and are not latency sensitive.

They do need 128 GB RAM and ephemeral disks.


Spot fleet requests allow you to set minimum specs for instances, and the fleet will be composed of any instances that meet the spec. If it's asynchronous work, you could pick lowest price allocation and not worry too much about interruptions. In fact, if your work is tolerant of interruptions (batch size <2min), you can actually save even more by being interrupted, as you don't get billed for partial hours: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/billing-...


Does it launch the fleet at the global minimum price? "Any instances that meet the spec" doesn't seem like it's a worldwide price minimum.


You can set this to emphasize diversity or price when creating the request.


Across all regions? Everything I've seen is only for one region.


Correct, though generally you’d have region constraints on your workload. We do see a lot of value in going cross-region for expensive tasks that aren’t latency or data transfer sensitive, such as running machine learning model training in cheaper regions.


> compute workloads that just read/write from S3

> need 128 GB RAM

Eh?


I took "just read/write from S3" to mean that they didn't interact with any other AWS services apart from S3. Such that they didn't care where in the world it ran.

Not that they didn't do anything memory intensive.


You got it. It's some drone image processing. Read in data from S3, do analysis, write results.


My biggest problem with this is that AWS does not seem to have any (easy) API driven way to do this. Even if you need one instance you need to literally use their Fleet API to be able to specify these conditions.

I just want to specify the allowed instances in my CreateInstance command. Create an instance in these subnets, with these allowed instances, preferably a spot instance, but if none exists I’ll be happy with a normal one.


I've spent a lot of time trying to capitalize on these mispricing - and often they're priced like that because the capacity in that region/configuration is much lower and you are exposed to more more preemptions than in higher priced region/configurations.


Quite surprised that there is this degree of mispricing. I would have thought it’s a market that is big and diverse enough to iron that out. Especially given that the participants in question would tend toward the analytical side of things


I was thinking the same thing. I'm wondering if the price differences are reflecting a general demand for certain sizes? When I was maintaining AWS servers, I don't think it would have been easy for me to take advantage of spot prices that were outside of the sizes I was already using. I'd tuned things such that I knew the sizes I tended to need to have the redundancy I needed, and then could auto scale when necessary. Which means, I would never have bid on spot instances that were bigger than what I needed, because it would have been way more complicated to analyze the state of the system as a whole and make sure scaling happened when it needed to. Which also introduces risk that probably was never worth the savings. So if you had a lot of people like me, you'd get m3.large (or whatever current naming) as the thing that gets bid up the most, because it hit an autoscaling sweet spot


> it would have been way more complicated to analyze the state of the system as a whole and make sure scaling happened when it needed to

Yeah that's probably what's going on here. Complexity & that its just a bit counterintuitive


If you use EKS, we’ve had a great experience with Karpenter (karpenter.sh)

It’ll look at your pods’ cpu and memory requests and choose an appropriate instance type for you, and the cheapest spots where appropriate


Just curious, what kind of work flow do you have to run to accept that your host can stop anytime?


Anything that you can easily checkpoint/finish quickly, while needing a ton of computers to do, map-reduce type jobs.

For example, let's say you need to process a few million images, each taking a few seconds to process. You can start a manager task that distributes images, and a pool of interruptible workers, when a worker dies you just reissue the images to another.


And lots of spot instance types can be automatically hibernated.


Literally anything stateless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: