Hacker News new | past | comments | ask | show | jobs | submit login
GCP releases Spot VMs, the next generation of Pre-emptible VMs (cloud.google.com)
198 points by drimphgol on Oct 13, 2021 | hide | past | favorite | 124 comments



It's weird how the 3 major clouds have taken different paths to what must be an almost identical resource allocation problem.

AWS has had this kind of spot instance for years, but with a 2 minute grace period rather than the 30 seconds GCP is offering. Azure and GCP both originally went with the 24-hour cutoff (which can easily be replicated on a regular spot instance if needed), but now GCP are backing off on that restriction.


I don't remember Azure having the 24H cutoff. My spots/low priorities used to run for weeks

I used LP/spot both as scaling sets and the more recent single VMs

Also i wouldn't call the cutoff and grace period as "paths". There were much more substantial differences between the different clouds


I seem to remember the Azure 24 hour limit being in place in 2018, though it wasn't highlighted particularly well and came as a surprise to me. Could well have been removed since.


I suspect the 24h limit is to prevent angry calls from customers who buy spot instances because they're cheaper, then wonder why their VM randomly went down and blame it on Azure.

(I work for Azure but don't know anything about the policies)


They probably thought 24 hours limitation can store some design wins; their customers may have proved them wrong.


There are different win for both the approach. I train models on spot VMs which could take more than 24 hours. I set the price to be higher than reserved and there is very slim chance that training could be stopped and I get 80% saving on average. I don't want to spend time writing the complex logic to resume training after spot instance dies.

For services though, GCP pre-emptible instances are perfect combo for kubernetes.


I don’t see this in the article, but is there any reason why they’d introduce a new VM type rather than just removing the 24hour limit of preemptible VMs?


It's probably a niche use case, but there is some utility to having a guaranteed daily shutoff. For example, you might spin up an instance as an on-demand remote dev environment, and the 24 hour cutoff ensures it doesn't accidentally get left on (over a weekend, for example).

This would be easy to work around, but nonetheless could lead to unexpectedly high charges if you were relying on this behavior only to have it silently change.


In AWS we just `sudo shutdown +1440`. Then we can cancel the shutdown later if we need to.


There's some bug we keep hitting that causes some EC2 instances to reboot on shutdown instead of halt and terminate. AWS engineers promised a fix months ago, no news yet.


In AWS I set up a few lambdas to shut down my instances automatically because I've had some cases where everything crashes so thoroughly that the shutdown doesn't go off.


That sounds painful!


Tha sounds like a horrific dev environment, if randomly shut down with 30 seconds notice.


"Horrific" is massive hyperbole, at least for my use case. (I actually use a GCP preemptible instance in this manner for a personal project, and spot instances will have the same shutoff risk.)

I edit application code on my (relatively underpowered) laptop, and automatically mirror it to the instance where the running services picks up any changes, and recompiles and relaunches as needed. It's a fairly chunky app code-wise so moving the CPU usage off of my laptop is very helpful.

When and if the instance shuts off early, I just relaunch it and reconnect. This amounts to one click of the mouse in the GCP UI, one locally run terminal command to connect, and one remote terminal command to (re)start my app. No work lost, and the cost savings are worth the minor inconvenience. Usually the instances last the full 24 hours anyway, and I usually shut it down when I'm done working, so interruptions are very rare.

I can understand that in the context of a larger company with more resources, a dev would be put off by this. But it works very well for my uses.


But what if you were in the middle of stepping through with the debugger?


One thing this helps with is keeping development environments up to date. When there's a large spread between instance lifetimes, it becomes more difficult to deploy updates and changes to developer infrastructure. This can grow into an org problem where shipping dev infra changes blocks on disrupting dev workflows.

You can approach this by continuously deploying upgrades to the dev fleet, but it's simpler to simply set a lifetime bound (with an opt-out for special circumstances).


I can imagine it being useful if you needed to test something out for an hour or two and were worried you might forget to shut it down. Your charges are capped at 24h no matter how badly you screw up.


If a test workload goes wrong in an interesting way, someone would very likely want to ssh there to poke around. If the poking extends to ~24 hours later, it is bound to get brutally interrupted. The worst part, there is no easy way to avoid it. (Well, one could shut it down and start again to reset the 24-hour timer, but that is cumbersome comparing to removing a `shutdown` line from crontab of a spot instance.)


It wasn't a 24H guarantee, it was UP TO 24h

It did guard you from keeping things on by mistake but there are way more downsides to this kind of restrictions. Especially that you can simulate the daily shutoff like other comments here said


I run a small CI system which starts relatively beefy preemptible instances for jobs. Normally those instances are terminated by the job scheduler as soon as there's nothing more to do. But it's a great peace of mind knowing they can't run for more than 24h if my job scheduler screws something up for whatever reason. I see it as a feature rather an a limitation for my use case.


Yes, but they could just provide a flag `maxLifetime` which defaults to `24h` but can be set to 0. (It can even only accept those two values)

Instead they created a whole new API for basically the same feature.


Creating a new api has greater "impact".


This would be superior of course. Ideally being able to limit maximum runtime down to one hours.


Preemptible VMs are GA, Spot VMs are not; if there are changes that might be made to Spot before GA (such as other limits, etc.), this gives them the freedom to do that, without impacting those using Preemptible.


Branding probably. Being inline with Azure and AWS is helpful when onboarding new customers


It smells like some person/team trying to get a promotion for launching a big new feature, rather than their perf packet saying "added option to not terminate after 24h".

Functionally, I see no difference.


From https://cloud.google.com/compute/docs/instances/preemptible: "Spot VMs (Preview) are the latest version of preemptible VM instances. Preemptible VMs continue to be supported for new and existing VMs, and preemptible VMs now use the same pricing model as Spot VMs. However, Spot VMs provide new features that are not supported for preemptible VMs. For example, preemptible VMs can only run for up to 24 hours at a time, but Spot VMs do not have a maximum runtime."

So preëmptible VMs are still supported but may now cost as much as 40% of non-preëmptible VMs, "to reflect the supply-demand dynamics of Compute Engine’s excess capacity." That's quite a big change from the current pricing structure.


Looking through our history, it looks like us-east-1 preëmptible VMs were a flat 30% the cost of non-preëmptible VMs. So in the worst-case scenario, we'll be paying 10% more for the same product, but also potentially saving up to 19% in periods where GCP has excess capacity.


You'll be paying over 30% more in the worst case, just fyi.


What kind of instance was that? For A100 servers we got something like 5x price difference.

The doc says: > Preemptible VM instances are available at much lower price—a 60-91% discount—compared to the price of standard VMs.

Maybe you're using huge disks, those don't get discounts on preemptible instances from what I remember.


I have trouble thinking of a use case that fits this VM type. Is it for batch processing tasks that don't have tight deadlines? The lack of any real guarantees makes it hard to price, as you have no idea what you're actually paying for. Would it make sense to have a model that includes deadlines in its pricing (like give this process 2.5 hours of CPU and 4GB of RAM to complete this task by...next Tuesday)?


That's correct: spot instances are ideal for batch workloads with relatively little persistent state. If your deadline for completing a unit of work isn't extremely short, and you're already distributing jobs across a pool of worker nodes, it's relatively straightforward to dynamically start spot instances if they're available, and on-demand instances otherwise.

Technically, you're gambling a bit because whenever a spot instance is terminated, you probably need to spend a bit extra to redo whatever work it was in the middle of. But spot instances have such a huge discount that it's almost always worth taking that risk, as long as you don't mind the extra management complexity.


> Is it for batch processing tasks that don't have tight deadlines?

I used to work for a company (since shut down) which provided big data processing systems in the cloud. This was one of the typical use-cases for our customers. Big data systems like Hadoop, Spark, etc are built to handle this kind of disruption where you lose a few nodes once in a while, and we had built in further optimizations to do it even better. This fact, combined with the much cheaper price of Spot Instances (upto 90% less than on-demand) make them a compelling alternative to on-demand instances.

In practice - at least on AWS - spot loss used to be quite rare. When it happened, it happened by the truckload, but we used to have spot instances run for several days without termination.


I've had AWS spot instances run for months without termination.


You might be misunderstanding how spot works. The VM either runs at full speed or not at all so you know what you're getting.


Nevertheless, doesn't that force you to divide your work up into units somehow and manage external storage to track what/how much is done?


Yes, but this is common for "cloud native" applications. Web servers are ideally stateless "cattle", not "pets". Code is pulled down via CI/CD pipelines or via container images. State is in databases or blobs. Logs stream to some sort of analytics system.

Similarly, Kubernetes works well with spot VMs. You can have Spot and Regular node pools in one cluster, and then your workloads can happily spread out to the extra capacity at a huge discount.


It's really easy to "beat" the chaos monkey with this kind of design. You can destroy any VMs in our infra with reckless abandon and everything will be fine. If you destroy all the VMs of particular service obviously it will affect availability but it will still recover by itself in about 60 seconds.


You have to do that anyway for serious batch processing.


The reality at scale is that all cloud VMs are preemptible and unreliable. You can pay more to increase the probability of uptime but inevitably have to recover from failure sometimes.


The reality is itselt preemptible and unreliable, you could be terminated without notice


This is absolutely the correct way of thinking about this problem, and exactly how most people I've seen successfully use pre-empt instances treat them as well.


There are tools that enable this (e.g. look at temporal.io).

As for external storage: yes, you do generate more metadata, but that's often a tiny fraction of the cost saved by moving to spot machines.

In general, these tools are adopted more for increasing reliability of existing systems, but I predict they would be a neat fit to run them on spot machines.


+1 for Temporal. Using in prod, its great


(i work at Temporal) nice! glad you're happy with it, would love to learn more if you have any compliments/complaints, lets hear it :)


That is probably the primary use case for spot instances and certainly how I tend to used them. I have a few thousand files that need processing and each files takes 30 minutes. So I spin up 100 spot instances and use a small script to copy the file from a control machine to the next available remote machine, start the processing script, and then copy the result back when done.

I've also got another script that keeps track of how many spot instances are currently running and spins up some new ones (possibly in a different region) if they fall below a certain level.


You gotta do that anyway. It really sucks when a multi-day job crashes halfway through because of some silly bug and then you gotta restart from the beginning. The ability to resume from crashes is required regardless of where you're running the code.


All our pernos.co infrastructure uses AWS spot VMs. Batch processing runs on spot and fails over to regular VMs as necessary. CI jobs run on spot. Our front-end server running the Web site and our application also runs on spot and automatically fails over to a regular VM if it gets terminated. Termination is rare.


How do you coordinate the failover? Custom scripts or?


We use custom logic. There are various ways to do it. One way to do it is to poll for the termination signal on the running instance, as part of your app say, and if you see termination coming, trigger a Lambda that supervises the bringup of an on-demand instance. For critical stuff you probably want a backup system that checks instance health and brings up a new instance if the old one is not available ... but you probably want that even if you always run on-demand.


If your code can survive chaos monkey, it can trivially weather this. Netflix, as one example, could run their entire infrastructure on these VMs.


Well except that on peak hours their infrastructure would disappear :)

You always need some traditional instances to serve things that need to be available at all time, like your website. Now you could run things using spot instances and switch to traditional during peak hours but that requires confidence that enough capacity is going to be available when you need it (which is going to be at the same time as everyone else, so it's always a bet :))


Apparently some just gamble that by having some instances of every time, they won't all go down:

"However, your cloud provider can pull these virtual machines out from under you at any time without notice (because someone else is willing to pay more). So how do you lower the probability of losing all of your instances to nearly zero? Cloud providers generally need to have a large buffer of capacity and they have many different virtual machine configurations with different CPU and memory sizes. So, don’t just spin up virtual machines of a single configuration type, spin up lots of them from many different configuration types!"

https://blog.comma.ai/scaling-for-10x-user-growth/

I'm not sure if they still have a fallback of traditional instances.


Yes absolutely. On AWS you can use spot instance fleets to implement this strategy https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/how-spot...


Except that (presumably) there is no guarantee that any spot instances are available. You could wind up in a state where you have no instances running. No user-facing production service can tolerate that.


I believe AWS can create spot autoscaling groups which include a minimum baseline traditional instances plus a floating level of spot instances. This seems to be a nice compromise.


Just spin up regular VMs if you can't get spot instances when you need them.


Lack of spot VM means that all spare capacity has been taken up by the regular VMs, meaning that it spots are gone, so are regular ones.


Having worked on infrastructure deployed this way, I can say this oversimplifies a bit. There's virtually never no spot capacity at all, there's no capacity at a given bid price, for a given instance type, in a given region. You can start to scale up on-demand infrastructure using spot price as a signal, before there's literally none left, and you can also run on clusters of heterogeneous instances, and change your instance type mix when one instance type's price goes up. In the worst case you can fail out of a particular region whose spot prices have gotten too high. Lots of options.


No shade, but this is a very AWS specific view of cloud. Every cloud has a different resource allocation philosophy, not all the assumptions are the same.


Yeah, GCP spot instances don't vary in price like AWS ("The price adjustments will occur at most once a month").


There is no longer bidding for spot prices on AWS. They moved to model where pay the fixed spot price which isn't based on capacity.

https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pri...


Interesting! I think it still updates more than once a month, but that is a significant change that I wasn't aware of, thank you.


Maybe that's true but we haven't seen this happen in practice. Maybe we're just incredibly lucky but we're able to get on-demand instances to recover after spot termination.


No guarantee but still very likely there are instances available.


Spot is excess capacity. If there's zero excess capacity, that means there's (probably) also no on-demand capacity as well.


If you are not being given any excess capacity, that doesn't mean there is no excess capacity. You are just not one of the lucky few, whose numbers are shrinking.


The 30 second shutdown warning makes this a bit tricky for interactive work, but if your interactive requests will all finish within 20 seconds or so (or you can hand them off), and 5 seconds to get out of load balancing, this could be cost savings for whatever fits; some base number of nodes + a number of spot nodes as demand requires, if no spot nodes are available, serve from somewhat overloaded base nodes + maybe some regular priced nodes (but fewer extra nodes than at spot prices).

Or, more likely, for batch processing that's mostly time insensitive. Things that can be resumed without losing much work or lots of small short jobs. Retranscoding media with new settings comes to mind. If you've got deadlines, you'd probably need a mix of on-demand and spot nodes.

This is the corner of cloud pricing where if you can optimize use of this, you might be able to do better than traditional hosting. But only if your needs are variable, or you get a lot of high discount spot pricing.


With Spark you can indeed run batch jobs on multiple VMs. We are currently working having dynamic allocation and preemption on Kubernetes. This is for us essential for our on premise cluster, as we want fair sharing. But for our cloud GCP jobs we could indeed save money by having a pool of spot instances. If deadlines are important, you could replace them by normal instances.


They are also great if you are paying out of your own pocket and not your employer's. For side projects, academic work, etc., that need GPU instances or large memory instances, the regular pricing is quite out of whack. Spot pricing makes it tolerable.


1. You only pay for compute time you actaully get 2. Any stateless process such as web servers are ideal for spots 3. Also k8s nodes are very good to run in spot since pods are "natively" crash "resistant"


My team runs fairly large GKE clusters on preemptible machines, and with a dedicated footprint that can scale if needed. Business critical workloads, we ensure a lot of these https://cloud.google.com/kubernetes-engine/docs/concepts/spo...

cost savings.


So, probably a very naive question here, but how his the "continuity" across pre-emptions working?

Say I am running a long batch process on one of these spot VMs and I get pre-empted ...

Does my job restart where it was stopped automatically and everything is transparent save how long the job takes to complete, or do I have do to checkpointing myself and deal with the fact that my jobs may be killed at anytime?

Also, if the restarts are indeed transparent to my job, what of stateful network connections?

Finally, what kind of guarantee do I have that my job will ever complete if pre-emption can happen at an arbitrary frequency for arbitrarily long?


Restarting the job is on you, the instance will be destroyed.

So you'll need to adapt your process to be resumable (or partially resumable through checkpoints) and/or idempotent, so nothing goes wrong if you run the job (or parts of the job) twice.


OK, thanks, that answers my question: there is not continuity whatsoever, you have to take care of everything yourself.


Spot instances are the equivalent of stopping an instance while it's running. You get a signal (30 seconds prior for Google, 2 mins for AWS) and you are free to try and handoff the work to a different machine in your cluster .

Basically this is good for running short jobs and not long running ones. If you have a service that process chunks of data from a queue it's probably the ideal scenario.

In practice, with AWS your instances don't get killed that often. Some in the comments are claiming months but I was using mine with AWS Batch and they usually lived 1-2 weeks.


Is this a sign they have surplus pre-emptible capacity on their architecture for non-premptible VMs, and so worrying, if you think they have enough customers to sustain the model?

(google could run negative forever: the point is, some VP won't want to, and the KPI will morph to "kill it")


No, all clouds must overprovision their datacenters to handle sudden peak demand. Everybody doubles or tripes their clusters around the holiday season or shopping festivals, for example. For the other 350 days of the year, this capacity often lies unused - so they're auctioned off to the highest bidder (but usually at spot rates much less than the full rent). They're pre-emptible because if you need the servers for something serious/stable/continuous and make an on-demand rent request, the spot instances are pulled off their workloads and given to you at the full rate.


The overprovisioning makes this kind of product attractive. A small downside is a surge in complaints from customers who experience virtually no pre-empt, and then build a belief it's engines for cheap.. until the machine stops.

Somebody once told me a policy of random offswitch helps


That’s probably why Google added the 24h cap, so you don’t get used to the machines being around all the time. Bit like a forced chaos monkey.


Don't worry. Preemtible VMs were used by enough people and this pricing model is 100% profit for the clouds

This is just aligning the offering with Azure and AWS


These cannot be created from the Google Cloud Console Web interface - need to programmatically do "stuff". EC2 has it right on their Web Console.


Nice, one step closer to AWS. Next up I'd like to see an increase to the 30 second termination notice. 120 seconds would be good enough :)


Just clone AWS down to the API endpoints?


They already replicated the S3 API as an alternative to the Google Cloud Storage API: https://cloud.google.com/storage/docs/migrating#migration-si...


I get the impression that's pretty common, Cloudflare's new product is S3 compatible too.


Yeah, S3 is very common. Google Cloud Storage interestingly did not use the S3 API, but then added a limited compatibility layer specifically for "Migrating from Amazon S3".


Any word on pricing?


Changes over time and ranges from 60-91% discount, according to the docs: https://cloud.google.com/compute/docs/instances/spot#pricing


If it changes at most once in 30 days, that seems different from AWS.


Among other things, the price cuts are a nice change for ultra-cheap hobby projects that have extremely low resource demands. So far the lowest Spot VM price I've found is e2-micro in us-west4, which costs a whopping $0.69/month for 2GB of RAM and 0.25 of a shared core. That's only 25% the cost of Amazon's comparable t3a.small.


Keep in mind as far as I can tell there's a $1.44 fee for an external IP address (only if it's in use by a preemptible VM). If it's not in use, it's even higher.


It isn't written anywhere in the documentation, but the first 744 IP address-hours per month are charged at $0.00.

Link to pricing page: https://cloud.google.com/skus/?currency=USD&filter=C054-7F72...

That means if you just need one VM, you won't pay for IP charges.

Obviously that might change anytime without notice.


Nice, thanks!


Is there a way to avoid paying that fee, if you don't need a public IP address? E.g. can you get NAT for free?


Also what is the granularity? Right now on AWS you have to pay for the full hour even if you use it for 5 minutes.

A per minute granularity can make Google's offering more enticing for a lot of users.


On AWS you only pay for full hours if you're using RHEL or Suse. Any other Linux distro or Windows pay by seconds:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-int...

On GCP seems like you pay per seconds, but if you use a Premium OS and GCP stops your instance, you would pay for the premium OS pricing anyway:

https://cloud.google.com/compute/docs/instances/spot#spot-wi...


Thank you this is great news for me. I always assumed it was by the hour, or maybe it was and they changed it and I missed the announcement. Regardless this is great news for me!


>Also what is the granularity?

Don't have an authoritative answer, sorry, but granularity is an area where Google has historically been better than the competition.


It appears that GCP is rather late to the party here.

I believe AWS has had spot instances for a very long time (more than 5 years at least)

Is my understanding correct? and if so any insights on why it took so long


GCP has had "preemptible" VMs for years; the only news is that they renamed them to "spot" to match AWS. I wonder if they were losing checkboxes over the naming.


Not a rename. See my comment.


These changes are so minor that they don't count as a new product IMO.


GCP has had Preemptible VMs, which WILL terminate after 24 hours if it lasted that long. Spot VMs removes this 24h limit.


They also changed the pricing model. Preempts had a completely predictable discount, and spot VM's have a variable discount.


“requiredDuringSchedulingIgnoredDuringExecution” .. wow, what an epic length for a configuration option!


Let me introduce you to CNLabelContactRelationYoungerCousinMothersSiblingsDaughterOrFathersSistersDaughter

https://developer.apple.com/documentation/contacts/cnlabelco...


I doubt the person who released this did it with a straight face.


I guess you haven't worked with Hadoop. Here's an example of what configuration option names in it can look like:

    dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction


Is length really a problem? I definitely prefer this over, eg.

     dfs.datanode.asvcp.bspf


Naming things is hard. There are a lot of users of Transmission Control Protocol/Internet Protocol and users of American Standard Code for Information Interchange who disagree with you. It's more of a social endeavor than anything else.


Nice.

I think this gives a hint though - when something is very-frequently-used, people need a short name so that it doesn't stand in the way of the discussion; and due to frequency, the fact that it's an acronym doesn't matter: I bet there are more people know what ASCII is than there are people who know what the acronym stands for.

For configuration, though? asvcp.bspf is never going to be frequently used by anybody. That's generally the point of configuration, for the vast majority of cases, you touch it at most once. So I argue you want long names.


And that's a good thing. k8s configurations are rarely written. It's better to make them as clear as possible, even if it makes them verbose and long-to-type.

Same thing can be said about build scripts. Those who strive for terseness (hey sbt) become unreadable mess 1 year after writing, when you need to read them again.


>“requiredDuringSchedulingIgnoredDuringExecution” .. wow, what an epic length for a configuration option

Looks like Java culture is alive, well and spreading.



>Seems Kubernetes-y:

Two guesses as to where the folks who implemented Kubernetes hail from (as in: which programming language)


> Two guesses as to where the folks who implemented Kubernetes hail from (as in: which programming language

Go, obviously, though from their rationale article they “love” C/C++, Java, and Python, too.


Great, now fix your terrible UI.


I find GCP's UI easier to use than AWS' by orders of magnitude.


I've found GCP's UI to be a racy mess. If you ask it to do something trivial like removing an instance from an Instance Group, it can show that it has done it when, in fact, it hasn't. Want to switch projects? Better remember to hit F5.


How do you switch projects on AWS? In GCP it's a dropdown, in AWS it's completely different account (email or IAM or SAML backed) + captcha + MFA every time if not SAML.


AWS SSO [1] negates a lot of that pain, but still won't let you have two accounts open in the same browser window. You can use email/password logins or a SAML source like Active Directory.

[1]: https://aws.amazon.com/single-sign-on/


Yes, I've seen this myself. It's overall still miles better than AWS UI.


You must be a Chrome user.

It's basically unusable in Firefox or Safari.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: