This is a great announcement. Preemptible VMs are a great strategy if:
* you don't care too much about when a workload finishes
* it can be chunked into small units of interruptible work
* the cost of resuming interrupted work is zero or small
Examples of workloads that might look like this:
* geocoding large batches of addresses
* map-reducing large event streams
* content scraping/crawling
Notice that you don't need to run all of your cluster as preemptible if you're using bdutil [0], too -- some of the workload can be preemptible, and some not. So you can guarantee a minimum processing throughput, but get extra throughput for very cheap.
I think that's a great way to do it, though I wish it was generalized for all kinds of clusters, not just for Hadoop ones.
Yeah, I added it to bdutil since it was clearly useful and yet well understood that it would do the right thing. I'd love to see more cluster and machine management tools deal with "machines can come and go".
The guys on the go team are using it for their buildbots (http://build.golang.org) which was already prepared for the box to die.
Any developer who was burned by App Engine pricing in the past is likely to take the lure of a "70% off" discount with a grain of salt because of the much higher cost of incorrectly relying on Google's pricing pledges.
In case anyone forgets, the pricing changes on Google App Engine caused many developers to abandon apps they had developed on the platform because of hundreds of percent increases in pricing:
I can immediately off the top off my head think of no less than four examples of useful services Google has discontinued or EOL'ed: Google Reader, Google Code, Google Wave, and XMPP support for Google Voice.
So as well as potential future price hikes, you should be prepared for the possibility that this service might not be around for the long haul. Therefore you should skip using any Google-specific functionality, and instead implement a design that allows you to easily migrate to another VM vendor.
I never got the "google shuts everything down" mentality that has become so prevalent.
I could easily beat those 4 with a multitude of examples from both Apple and Microsoft, that doesn't mean that any of them are untrustworthy, just that they evolve and continue to grow.
At least when google shuts down a service, they give a good amount of "heads up" to those using it, provide examples of trustworthy equivalent services from competitors, and ALWAYS provide an export function if it makes sense to have one.
Apple and Microsoft may discontinue support/sales but the hardware/software they sold you continues to function. That is inherently not true of "cloud" services.
> I could easily beat those 4 with a multitude of examples from both Apple and Microsoft that doesn't mean that any of them are untrustworthy
No, that's pretty much what it means. To varying degrees when you build on top of Apple, Microsoft or Google offerings, you're sharecropping rather than farming. Sometimes, they'll just find that it's in their interest to no longer lease out the farm to you -- or to change the rates/terms.
"Untrustworthy" might be an over strong term, if for no other reason than that invoking "trust" as a relevant concept in this context is probably itself incorrect.
That thing that jumps out at me from that example data set is that they were all free services. That's not to say there aren't fee-based services that Google has discontinued, or that your point isn't valid, just that your provided data points don't seem entirely relevant to me, given the subject.
Amazon has discontinued Flexible Payment Services, Amazon Webstore Services, and others. Presumably your warning to not rely on proprietary implementations is vendor-agnostic?
Or at the very least: never rely on one vendor's proprietary implementation(s). Always have at least two in active use (though you can do something like using one vendor 95% of the time), and if/when one goes down make it your top priority to find another.
You're right, though I wouldn't compare App Engine to things like Google Reader.
When Google EOLs a highly-visible free service like Reader, it generates a lot of outrage, but consumers can find a replacement like Feedly at reasonably low cost.
The risk of writing apps on a platform like App Engine is far greater because people tend to spend years building software and a business tightly interwoven with the platform, only to be shut down by unpredictably massive rate hikes because you basically have to rewrite to get off the platform, not just the app itself but all your ops stuff too.
While it is possible to dig yourself into that hole using lots of Amazon services,
Amazon doesn't require you to use platform-specific APIs, many of their services like EBS are very easy to replace, and Amazon's services have not been subjected to the same project-ending rate hikes. In practice, people move on and off EC2 all the time. So if you are going to trust a vendor, it's more reasonable to trust Amazon.
The new pricing made it App Engine a sustainable business, because the old pricing model didn't actually account for costs and encouraged some very wasteful practices. The only other option to actually charging enough to cover costs would be to shut the service down. I'm sure more people would be more upset about that.
App Engine has always been a unique PAAS because its APIs were designed to try to force developers into better distributed app architectures. The non-relational DB with entity groups and limited queries, the 30s request limit, originally not offering long-running instances, the task queue, memcache as a service, all made for scalable apps.
But the costs unfortunately weren't designed the same way. Charging by CPU time, and having datastore access free was especially bad for a market where apps typically use very little CPU, but access data a lot.
A lot of blog posts came shorty after the price change that said after making the recommended changes to datastore calls, and enabling multithreading that they got their bill down and performance up significantly. Many apps were doing absolutely no caching, or logging every request to the datastore, because datastore was ridiculously cheap for small records. Some were doing what should have been batch work per-request because they didn't use task queues.
Since 2011, I think all price changes with Google Cloud have been price drops, some pretty big. Last year, App Engine prices were dropped 30%.
It was Google, not the users, which set the old pricing model which people optimized for when they wrote their apps. Many apps which had to be shut down were already doing "right" things like caching, using task queues, and economizing on datastore accesses. It was Google, not the users, which designed App Engine so that porting away would essentially require a rewrite. It was not the users but Google which neglected certain popular language runtimes so that they were no longer cost-effective after the price hikes. And it was Google, not the users, which kept the issue opaque so there was no warning of how big the price hikes were going to be. If certain users found that they could sustain a double or triple bill, that doesn't mean it is the fault of the other users that they were forced to mothball their apps.
The users didn't decide any of these things, and are not to blame for them. What kind of service blames its customers for its own mistakes?
Google earned this distrust fair and square by suddenly forcing huge numbers of customers to rewrite all their existing code. Little price cuts don't matter: other services are already more cost-effective and are cutting their prices all the time, but even if there were price parity the risk attached to the lock-in just isn't worth it.
I don't think they are blaming the customers, and if you read posts from back then, I think they take a similar tone: they didn't get the pricing model right, and in order to make App Engine a real sustainable product with an SLA, they needed to update the pricing.
As for the unique API, of course some of it's unique, because there weren't any standard APIs for the unique parts. Datastore is based on Google's internal NoSQL store. There's are/were standard APIs for NoSQL. Same with Task Queues.
The "proprietary API" bit is overblown though, IMO. The Python version shipped with a somewhat tweaked, Django support, and has grown to support WSGI and standard Django. The Java version uses Servlets with some constraints. Memcache is pretty standard. You can pretty easily abstract away the proprietary bits to run on a standard stack, or use an open source implementation of the APIs.
They published the pricing in May. Nobody cared until the calculator came out in September. You're selling it like Google gave everyone a week.
The price increase also came with an SLA. Just to be clear, you're saying that businesses were entirely built atop a product with no SLA, and that's not the bigger problem?
> build entire businesses on an infrastructure that they later had to abandon.
Frankly, if your business is built entirely around a single 3rd party provider, and you are totally incapable of pivoting cloud providers at short notice... well then you are "doing it wrong".
Netflix doesn't rely on a single provider for hardware, data centers, nor bandwidth. Sure they use EC2 for some things, but also have a great deal of hardware in data centers throughout the country, as well as custom hardware in ISP data centers, etc. I'd be shocked if there didn't use some of the other cloud provider offering too.
Netflix has no single point of failure.
Building your business around a single cloud provider creates a single point of failure.
I run a $MM enterprise business almost entirely on GAE/python with a staff of ~40, public-facing site, etc and the monthly bill is under $1500/mon. Sure, I'd prefer lower $ and faster performance but truthfully, I'm no longer complaining: GAE/python saves me $$$ in IT staffing costs including security upgrades on dozens of packages that are either pre-integrated or I don't need at all (SQL & NoSQL databases incl multi-DC failover, memcache, reverse proxy, email hosting, auto-scaling, etc. etc.)
This doesn't surprise me at all. It is trivial to run a site on app engine that can handle hundreds of requests a second continuously without leaving the free tier.
Sure, when the change from billing CPU time to instance hours came in some app's bills sky-rocketed. But that was because they were poorly coded such that instances were blocking and unable to serve incoming requests.
With a thread safe application and the proper configuration there is absolutely no reason why instance-hours pricing shouldn't be competitive.
> This doesn't surprise me at all. It is trivial to run a site on app engine that can handle hundreds of requests a second continuously without leaving the free tier.
Really? The free tier comes with 28 instance hours per day. That'd mean your app would have to serve hundreds of requests per second, meaning each request must take substantially less than 10 ms, on a 600 MHz, 128 MB RAM machine.
If your request do any work at all, I doubt you can handle them in <10ms on a 600 MHz CPU.
>meaning each request must take substantially less than 10 ms
Correction, each request would need to have less than 10ms of CPU time - the instances support concurrency.
My web frontend, by design, does very little - any CPU heavy operations are done by other systems using the task queue. Writing it in golang has helped as well, wouldn't get that performance from python.
Write a simple Hello world example in golang and get it to do some mathematical calculations to simulate "work", I think you'll be surprised at how many requests a second you can squeeze out of a single instance.
But still, 10 ms really is not a lot on a 600 MHz machine. How long does your front end take to serve one request? How many qps do you serve from a single instance?
I have some go code with a trivial, completely unoptimized blog, rendering a couple of articles. Poking at appstats suggests I spend a bit more than 10 ms CPU time, App Engine reports ~30 ms CPU time.
Before the pricing change app engine did not support concurrent requests to a single instance for python. As a result, applications used many more instances than they needed.
When using the calculator this would make the price increase seem enormous to many customers. When concurrent request support was added (it was available to trusted testers at this time) all a user would have to do was add "threadsafe:true" to the app.yaml file to enable it (assuming their code wasn't doing anything silly).
Deploying to app engine caused us big pain in the long run. I just had to spend two days migrating a legacy app to the new hrd. The migration was painful, app engine "billing bugs" when upgrading etc made the process real hard.
Also when the bugs occured, I realised there was no way to get quick support.
Thankfully we moved our main app away to ec2 around when they increased the pricing a couple of years ago.
or just set the preemptible bool to true via the API (it's under scheduling). When you want to test how your system behaves on preemption, you can just do:
which will give you the same 30 second timeout as when you're preempted. Most OSes have a fairly standard set of things they do on shutdown that will at least send all your running processes a signal (via kill), but if you need to add your own you can inject it via the new shutdown script support (https://cloud.google.com/compute/docs/shutdownscript).
Nice! This ties in with the discussion yesterday on HN about data center utilization. Good for saving money and the end result must be lower global energy use if this also catches on with other providers (AWS already offers something similar).
When I worked at Google in 2013 part of my job was running very large calculations that were often pre-empted. I usually ran at the lowest priority and was in effect using spare capacity. No hassles, assuming that getting runs completed was not too time sensitive.
I might be missing something important here. But according to their pricing page[1], a preemtible VM with 30GB memory costs $86.4 a month.
Why would someone go for this when cheap dedicated host providers like hetzner etc offer powerful dedicated servers with 64GB memory and multicore server grade CPUs? The comparison only gets worse taking into account that Google's offering is preemptible and can shutdown and come up as they wish.
It's like asking why do people use Serviced Offices which are usually 10x the price of a monthly rented office. Answers: because they only need the office / VM for a few hours. Because they can immediately get an office / VM of any size they need. Because they can walk away from the office / VM when they don't need it any longer. It's certainly not for everyone, but Regus seem to be doing ok.
And for those who truly only need it now and again, it's great.
But time and time again I see infrastructure where people pay for these services for large amount of instances that are running continuously, blindly assuming that it's cheap because it's cloud. There's a bizarre level of price-blindness amongst certain subset of customers of Google Cloud and AWS that I've never seen anywhere else.
It's about scaling. Google's offering is a cloud service and Hetzner's is not. You're always paying the same amount per month with Hetzner, whereas you'd only pay that much with Google if you have consistent full utilization (which seems rather unlikely; most workloads do not look like this). Secondly, if you create something that becomes popular, you can scale it up on GCE a lot more easily than you can with a dedicated hosting provider.
If I want to run a processing job that takes a 1000 core-hours and have the results today, I can do that on their VMs for a couple dollars per run. Being able to do that on owned hardware or dedicated hosts would be orders of magnitude more expensive.
Similarly, if you're running hosted app that on most hours doesn't overload a single server, on peak hour fills three hosts, but on a large advertising event or accidental viral link takes fifty hosts for a day, and then goes back to normal - then you don't want to run it on VMs where you have to pay for them by month.
That's true, but those are extreme niche use cases, and in most cases people have a substantial base load that they can run on dedicated servers for anything from 1/2 to 1/3 of the cost of AWS/Azure/Google (even with reserved instances and factoring out retainers for someone to handle ops issues).
Nothing stops you from mixing and matching dedicated servers with handling batch jobs and peaks with cloud servers. In fact, most data centre providers can offer the full range from unfurnished colo space to cloud offerings out of the same data centre these days - either directly or via partners hosted in their buildings. At least that's my experience.
Using the cloud as a rented datacenter is never a good idea. It will certainly cost you more.
If you're going to use the cloud you have to do it right, and that means auto-scaling and variable resource usage. Then this option will save you money.
Google's cloud pricing is really terrible. Other competitors are intentionally driving prices down to dominate the market, so Google's prices kind of come across as a joke.
@boulos, since it looks like you're involved with this: The link to the product pricing [1] goes to a generic landing page for Google services, which is a little confusing. I think you should link directly to the instance pricing [2].
Yeah, as part of our larger pricing announcement we're making sure people go to /pricing so they get a bigger picture before diving straight to the table ;)
I wonder how big the chance is that the machine is turned off in the first hour.
I use Google Compute for some personal project and the typical run time for a VM is about 10 - 20 minutes. If the vm has 90% chance to survive the first hour, it could be worth the trouble to make my process more fault tolerable.
The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but may vary from day to day and from zone to zone depending on current conditions.
Give it a shot in us-central1-a and let us know how it goes!
I couldn't figure out from the docs, can one restart a preempted preemptible instance, or needs to start a different one? Would it be possible to restart in non-preemptible mode, so the job completes but at a higher price? We still want to complete our workloads, one way or another :)
Note of course that if you got preempted because we needed the capacity back for regular VMs (as opposed to say a maintenance event) you may not be able to start a new Preemptible VM in that zone.
The fundamental difference seems to be that Google's prices are fixed, whereas AWS uses a "market" model, which frequently sees crazy prices (well over the on-demand price) especially in us-east-1.
I say "market" because no-one really knows how the spot market place actually works. We've had machines run for weeks, and other times the prices fluctuate in bizarre ways and we can't get out preferred instance types for hours or (in the worst case) days. There's an interesting analysis here: http://santtu.iki.fi/2014/03/20/ec2-spot-market/
We had the same problem with getting priced out of c3.8xlarge in Virginia. We fixed it by changing our allocation algorithm to find alternate instance types and zones. For example, instead of 1 c3.8xlarge, it might pick 2 c3.4xlarge instances or a cc2.8xlarge. Seems to work so far.
I looked through the pricing table and played with the calculator, it seems something equivalent to our needs would cost around a third more on google but each cpu would have twice as much ram. Not worth it for us.
For people coming from AWS, we currently don't have an instance shape that lines up with the c3/c4 ratio (pushing you either to our n1-standards or n1-highcpu-. Can you get by with less memory?
Note: we're very aware of this pain point, and maybe you'll see something soon ;).
We need maybe 1GB for each job, but we need some leeway to avoid the OOM killer when a group of work units is larger. Swapping takes so long that it's not cost effective.
But to be honest about the situation, the cost would have to be much lower to make it worthwhile for me to rewrite our scheduler on Google's API.
I don't really see a fundamental difference except that Google's offering is more opaque and less flexible. Instead of "your instance got killed, but you can restart it if you're willing to pay $x/hour", you get "your instance got killed, tough luck."
EC2's spot market fluctuates based on supply and demand, and there's no reason to think the same forces won't apply to GCE.
It's not really that different: If your preemptible GCE instance gets killed, you can try restarting it as a 'normal' (non-preemptible) instance. But in Google's case, both prices will be predictable.
Well, spot varies. Sometimes it's low, but the price varies based on demand. Using prices as of right now for example, Currently in California, c3.8xl instances are quite popular, running $1.68 an hour on an instance that's normally 1.912 an hour, roughly a 13% discount. 70% strikes me as better than 13%, especially when it's predictable. (and 70% below an already substantially lower hourly price...)
The long-run price for a c3.8xlarge in California is somewhere between $0.40 and $0.50, though you're correct that right now they're $1.68.
California tends to be more expensive than other AWS regions, though (I can't remember the reason - perhaps just availability?). If you're in us-east-1 the price sticks around $0.20 per hour, and eu-west-1 is rock-solid at $0.32 per hour.
If you have work that can be done on spots it can be a good idea to make it region-agnostic so that you can take advantage of better prices for different instances in different regions.
The idea with Google is that you don't need to worry about which instances you run, at which price, and in which region.
For example, with Google you don't need to get a specialized instance type to use it's monster-fast Local SSD - just use the same instance types. This alone should simplify use of Preemptible VMs.
You can find a list of our prices here: https://cloud.google.com/compute/#pricing . Preemptible VMs have flat per-machine type pricing compared to EC2's bidding-based pricing. For example, a Preemptible n1-standard-1 is 1.5 cents per hour in the US and a preemptible n1-standard-4 is 6.0 cents per hour.
Any reason you couldn't use these for conventional web stuff? 30 seconds could easily be long enough to bring up another instance and sync data if needed.
No reason at all. If you have enough preemptible VMs, and they spin up fast enough, then when one goes down, just spin up another. If for some reason GCE's overall preemptible spare capacity drops and you can't get a new one and it's important to you, then spin up a regular VM.
By "conventional web stuff", do you mean serving a website or similar? Yeah, I don't think it makes sense to host a website on a service that is preemptible. You need your webserver available to respond to requests 24/7. It can't just randomly go down. Pre-emptible instances are more suited for batch processing and big data computation runs.
If you have a website that can be distributed among arbitrarily many frontend servers, it could make sense to put them on preemptible instances, with the database on a regular instance. (In the unlikely case no preemptible instances are available, you could always automatically switch the frontend servers to regular ones.) However, I think this could only possibly be efficient pricing-wise if your traffic was extremely bursty.
That could be an interesting model if your web app is a single-page / static app served from Google Cloud Storage, with Javascript making AJAX calls to your backend and configured to gracefully handle transient errors as servers come and go.
You'd have to make sure that if a node gets terminated while processing a request, that request can be re-routed to a live VM. Unless you don't care about users randomly seeing errors of course.
So... this is essentially first come, first served based on their description. And they say the preemptibles come from a smaller pool of resources. Seems odd to me they wouldn't have a number somewhere of how many are available and how many are in use given that it's a finite resource.
Tough to build a business model around a resource you can't even determine the availability of.
reliability != availability. I can build a business model around an unreliable, but available resource. I can't build a business model around an unavailable resource. They're two very different things, and the distinction is important.
This basically for batch-processing. That means that the lack of availability isn't such an issue because this lets you trade speed-of-processing for cost.
Say you have a few hundred terabytes of images to process. You can prioritize images by pushing them to the head of the queue, but you don't really care how long the complete batch takes.
If you are happy to wait for your job to complete you pay less. Otherwise, pay more and guarantee completion.
You're right that it's first come first served, but the 24 hour time limit combined with natural churn means you aren't likely to be "locked out" for long periods of time.
You're not buying a dedicated instance for a month, you're renting it as long as you need it. If you only need a core for an hour, your bill will be $0.01.
If you're running on a Preemptible VM and find that preemption events are "like getting ejected from a hotel at 3AM" then you've either selected the wrong VM type or made a poor choice in designing your app.
Preemptible VMs definitely aren't for everything. We just lowered our pricing on regular VMs (https://news.ycombinator.com/item?id=9564136) as well, so if your workload isn't fault tolerant I'm sure our full price hotel is just fine ;).
Think about these for things like resumable task queue workers: if you're doing something like processing images, slow text analysis, crawling web pages, etc. this is a great way to save money as long as your overall management system has a way to restart tasks which were in-progress when the VM was shut down.
I think that's a great way to do it, though I wish it was generalized for all kinds of clusters, not just for Hadoop ones.
[0] https://github.com/GoogleCloudPlatform/bdutil