We run a very large installation 100% on spot and have done for a few years. We serve our web traffic, do background work, etc. all on spot instances.
We see similar mismatched pricing all the time and take advantage of it. One additional area not called out here is the difference between c5.24xlarge and c5.metal instance pricing. These are pretty much identical hardware but metal instances are often cheaper.
As you go down this path, do expect to see a lot of weird things that you'll have to track down. For example, when we introduced metal instances we found that the default ubuntu AMI launched with a powersave cpu governor. Non-metal instances don't support CPU throttling so it never came up with c5.24xlarges. When we first launched metal instances the performance per instance was significantly worse and took a bit of work to track down.
Recently we've seen a lot more spot interruptions and it's pushing us to incorporate more 6th gen instances to get us more diversity. We've also temporarily switched to capacity optimized over price optimized and we've enable capacity rebalancing.
It's absolutely a win for us from a pricing perspective. Our traffic is extremely variable each day and very seasonal throughout the year. RIs don't make sense given <12 hrs daily peak and 10x difference between July and September. However, just plan for some odd surprises along the way.
There is a tale - perhaps apocryphal - handed down between generations of AWS staff, of a customer that was all-in on spot instances, until one day the price and availability of their preferred instances took an unfortunate turn, which is to say, all their stuff went away, including most dramatically the customer data that was on the instance storages, and including the replicas that had been mistakenly presumed a backstop against instance loss, and sadly - but not surprisingly - this was pretty much terminal for their startup.
Caveat operator.
(I’m sure parent commenter is either not exposed to this scenario or has otherwise mitigated against it)
We've worked closely with our team at AWS to ensure we are following best practices. The consensus has been that 4+ AZs and 12 instance types is sufficient diversification.
We also have a second, on demand, ASG ready to fire up at a moments notice if something were to happen with capacity.
We also heavily leverage managed services for state.
Have you observed metal instances taking longer to boot? I did last time I checked, and the difference was big enough to affect pricing in a non-trivial way, given that performance is the same and that you start paying immediately.
This is a good point. They do take longer to boot, which might be part of the reason there's a discount, but it hasn't been so significant that we avoid them because diversification is important when running on spot.
We see similar mismatched pricing all the time and take advantage of it. One additional area not called out here is the difference between c5.24xlarge and c5.metal instance pricing. These are pretty much identical hardware but metal instances are often cheaper.
As you go down this path, do expect to see a lot of weird things that you'll have to track down. For example, when we introduced metal instances we found that the default ubuntu AMI launched with a powersave cpu governor. Non-metal instances don't support CPU throttling so it never came up with c5.24xlarges. When we first launched metal instances the performance per instance was significantly worse and took a bit of work to track down.
Recently we've seen a lot more spot interruptions and it's pushing us to incorporate more 6th gen instances to get us more diversity. We've also temporarily switched to capacity optimized over price optimized and we've enable capacity rebalancing.
It's absolutely a win for us from a pricing perspective. Our traffic is extremely variable each day and very seasonal throughout the year. RIs don't make sense given <12 hrs daily peak and 10x difference between July and September. However, just plan for some odd surprises along the way.