As an ex-Googler working in a customer facing role in Cloud you did very well to...

cogman10 · on Dec 10, 2020

I know there's no reason for Google or AWS to do this, but man do I wish there was a way to put down a spending limit and simply disable anything that goes over that limit.

It's a little bit nuts that there are no guardrails to prevent you from incurring such huge bills (especially as a solo developer that might just be trying out their services).

betwixthewires · on Dec 11, 2020

In my opinion, and maybe I'm an absolutist about this, the fact that there aren't these guardrails is opportunistic and predatory. Agile, iterative design and testing will inevitably lead to failures, that's the whole point. Marketing a cloud service to developers who need scalable and changing access to computing during that process should take that into consideration.

KorematsuFred · on Dec 11, 2020

I do not think the intentions here are to be opportunistic and predatory but inability to empathize with small developers. A large customer will very likely just pay off few hundred thousand dollars extra expenses. It is only individual developers who are at risk here and cloud operators do not have much interest in them.

betwixthewires · on Dec 11, 2020

I don't know about that. Large customers can almost always find the capital to run their own infrastructure and save on cost. That's not to say that there aren't big customers of these types of services or that certain business models make using them more attractive than maintaining infrastructure yourself, but I would guess that revenue from these sorts of services are largely built on appealing to smaller customers, so their needs would be taken into consideration. Not taking potential cost overruns into consideration to me seems a bit deliberate.

To me it looks very similar to the personal checking overdraft schemes banks were using up until a few years ago.

rightbyte · on Dec 11, 2020

When I was a young and naive student I thought I could not charge my debit card under 0$. Got down to -3$ and had to pay a 40$ something fee when I already was out of money.

m1gu3l · on Dec 11, 2020

Definitely drank a few 45 dollar lattes in my day. Sucks.

donavanm · on Dec 10, 2020

The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users. And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.

Note: Principal at AWS. Have worked to issue many credits/bill amendments, but dont work in that area nor do I speak for AWS.

cogman10 · on Dec 10, 2020

> And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

What? Why wouldn't this just be an opt in thing? It could even be tied to the account being used. It's not like AWS accounts are expensive or hard to setup.

If a user opts in to the "kill if bill goes too high" and they kill a critical portion of their business, then that's on them. Similar to how a user choosing spot instances if their spot ends up being destroyed. You've already got that "I can kill your stuff if you opt into it" capability.

> On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users.

Yeah, and what happens if someone isn't big enough to justify AWS's forgiveness? What if they get a rep that blows off their request or is having a bad day? You are at the mercy of your cloud provider to forgive a debt, which is a real shitty place to be for anyone.

> And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.

And what do they do if they miss the alert? You can rack up a huge bill in very little time the right AWS services.

The point of the kill switch cap is to guard against risk. The fact is that that while 72k isn't too big for some companies, it means bankruptcy for others. Its because you might want to give your devs a training account to play with AWS services to gain expertise, but you don't want them to blow $1 million dollars screwing around with Amazon satellite services.

remus · on Dec 11, 2020

> What? Why wouldn't this just be an opt in thing?

"Oh cool, I'll set a $1k cap, never gonna spend that on this little side proj." Fast forward a year, the side proj has turned in to a critical piece of the business but the person who set it up has left and no one remembered to turn of the spending cap. Busy christmas shopping period comes along, AWS shuts down the whole account because they go over the spending cap, 6hr outage during peak hours, $20k sales down the pan.

Of course it is technically the customers fault but it's a shit experience. Accidentally spending $72k is also technically the customers fault and also a shit experience. I don't think there is an easy solution to this problem.

cogman10 · on Dec 11, 2020

"Oh cool, I'll use spot instances, never gonna need reliability for this little side proj."

"Oh cool, I'll only scale to 1 server, never gonna see high load for this little side proj."

"Oh cool, I'll deploy only to US West 1, outages are never going to matter for this little side proj."

There are a million ways to be out of money as a company. Why should this be any different? Why is the singular particular instance one where it is simply intolerable to accept that users can screw things up?

There are lots of things that are "shit experiences" that are consumers fault.

There is an easy solution. Give consumers the option and let them deal with the consequences. There are enough valid reasons to want hard caps on spending that it's crazy to not make it available because "Someone MIGHT accidentally set the limit too low which will cause them an outage in production that MIGHT mean they lose money"

freemint · on Dec 11, 2020

There exist totally a solution. It is also user hostile enough so it might get's adopted. $cloud_vendor just has to (and probably will) constantly nudge people to loosen the limit. Have a red banner that says, "you spend 3% of your monthly budget already think about increasing it". Also routinely send out emails to remind people. " Black Friday comes up think about increasing your quota " when your service has nothing to do with E-Commerce.

dathinab · on Dec 11, 2020

> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

This is simple wrong.

Depending on your use-case disabling active resources is the right reasonable solution with less downsides.

E.g. most (smaller) companies would prefer their miscellaneous (i.e. no core-product) website/app/service to be temporary unavailable then have a massive unexpected cost they might not be able to afort which might literally force them to fire people because they can't pay them....

I mean think about it, what worth is it if my app doesn't go temporary unavailable during it's free trial phase when it means I'm going bankrupt from today to tomorrow and in turn can't benefit from it at all.

Sure hug companies can always throw more money at it and will likely prefer uninterrupted service. But then for every hug company there are hundreds smaller companies which have different priorities.

In the end it should be the users choice, a configuration settings you can set (per preferably per project).

And sure limits should probably be resource limits (like accumulated compute time) and not billing limits as prices might be in flux or dependent on your total resource usage or similar so computing it is non trivial or even impossible.

I often have the feelings that hug companies like Amazone or Google often get so detached from how thinks work for literally every one else (who is not a hug company). That they don't realize that solutions proper for hug companies might not only be sub-optimal but literally crippling bad for medium and small companies.

tpmx · on Dec 10, 2020

The upside for the noob trying out/learning is huge.

I'm no longer that person, but I think GCP/AWS are just being lazy about this - perhaps because they earn a lot of money from engineer mistakes. Of course it's possible to create an actual limit. There'll be some engineering cost, like 0.5%-1% extra per service?

Edit: Being European I think legislation might be the fix, since both Amazon and Google have demonstrated an unwillingness to fix this issue, for a very long time.

perfectspiral · on Dec 11, 2020

"The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active."

Lol what ... this is exactly what happens any time you hit a rate limit on any AWS service. The customers application is "catastrophically interrupted" during its most popular/active period.

The only difference is in that case, it suits AWS to do that whereas in the case of respecting a billing limit, it doesn't.

howlgarnish · on Dec 11, 2020

If you hit a rate limit, the marginal portion of requests exceeding that limit is dropped: if you plot the requests, the graph gets clipped. Bad, but not catastrophic.

If you hit a billing limit, everything beyond that point is dropped, and the graph of requests plunges to zero. You're effectively hard down in prod.

franciscop · on Dec 12, 2020

And for some companies/individuals, if you keep charging then THEY will plunge to a large negative debt. It's not even zero, it's a lot worse than that.

freemint · on Dec 11, 2020

Just as it is with bank accounts. Once you run out of money you hit a hard floor.

mewpmewp2 · on Dec 10, 2020

I was creating some side-project. I already incurred around $100 fee. I imagine if I made some looping/recursion bug I could've easily incurred a cost of $10,000 or frankly, infinite cost. How easy would it have been for me to get this pardoned? And at the very moment I would discover that I just lost $100,000 - would I know in advance that they are definitely going to forgive this because I'd be full panic mode? It was very scary for me to use cloud in this case.

I didn't even have any customers at that point.

Alupis · on Dec 10, 2020

Why not alert thresholds, configurable by the user?

Email me when we cross $X amount in one day, Text when we cross $Y, and Call when we cross $Z. Additionally, allow the user to configure a hard cut-off limit if they desire.

Just provide the mechanisms and allow users to make the call. Google et al would have a much stronger leg to stand on when enforcing delinquent account collections if they provided these mechanisms and the user chose to ignore them.

Additionally, Google et al should protect _themselves_ by tracking usage patterns, and reaching out to customers that grossly surpass the average billable amount - just like OP with their near $100k bill in 1 day. Zero vetting to even have a reasonable guarantee the individual or company is even capable of paying such a large bill.

And then what? Sue a company that doesn't have $100k for $100k? This makes zero sense.

Retailer · on Dec 11, 2020

Google has alert thresholds (you set it up under your Budget). But practically speaking, an alert is not enough - what if you are unavailable to get the alert, it comes in the middle of the night, etc?

A better solution would have been 'limits' which they used to have (at least for Google App Engine) but which has been deprecated.

We had to spend sometime to research and see if there was a work around because just like the author of the article, we were quite worried about suddenly consuming a huge amount of resources, getting a spike in our bill and our accounts being cut off/suspended because we hadn't paid the bill. We've documented our solution here

https://retailingplatform.com/blog/how-to-control-costs-on-a...

patch_cable · on Dec 10, 2020

Something like this for alerts? https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitori...

Alupis · on Dec 10, 2020

Doesn't look like there's any cutoff mechanism there, and it's a separate, optional step instead of part of the setup flow with a mandatory opt-out warning.

Nor does that address the other complaint - Google (and possibly others) seem to be willing to extend an unlimited credit line to all customers without any prior vetting for ability to pay. That's crazy.

arp242 · on Dec 11, 2020

> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

Well, this is true, but this is also true of a lot of limits, like limits.conf. Sometimes you really want to spawn loads of processes or open many files, but a lot of the time you don't, so a barrier to limit the damage makes sense.

There is no one solution that will fit everyone: people should be able to choose: "scale to the max", "spend at most $100", etc. If my average bill is $100, then a limit of $500 would probably make sense, just as a proverbial seat belt. This should never be reached and prevents things going out of control (which is also the reason for limits.conf).

_mj78 · on Dec 12, 2020

> It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.

This could be ameliorated by using namespacing techniques to separate prod from dev resources. For example, GCP uses projects to namespace your resources. And you can delete everything in a project in one operation that is impossible to fail by just shutting down the project (no "you can't delete x, because y references it" messages).

Aggressive billing alerts and events, that delete services when thresholds are met, could be used only in the development namespace. That way, fun little projects can be shut down and prod traffic can be free to use a bit more billing when it needs to.

dralley · on Dec 10, 2020

>It would mean a catastrophic interruption to the customers application exactly when its the most popular/active.

Making the worst case scenario no worse than traditional infrastructure.

speed_spread · on Dec 11, 2020

Correct. That argument assumes that every penny spent autoscaling has a positive ROI.

freemint · on Dec 11, 2020

I think this is a insightful way to think about. Thanks.

franciscop · on Dec 12, 2020

> "And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource."

Well there's a very easy way, adding a checkbox and an input:

[ ] I am just trying things out, don't charge me more than [ ] USD

pstuart · on Dec 10, 2020

There are ways it could be done relatively benignly, such as defaulting to paranoid and explicitly opting out.

And for those that are heading into that financial barrier it should be a straightforward problem to look at trending to anticipate the shutdown and send out an alert.

dathinab · on Dec 11, 2020

Or just ask for a default when opening a new account ;=)

howlgarnish · on Dec 11, 2020

This. App Engine used to offer hard spending limits, and they were removed with precisely because so many users set them up to shoot themselves in the foot at precisely the worst possible moment.

salmonlogs · on Dec 11, 2020

^^ this. Hard spending limits seem great until your app/service gets super popular and you have to explain to the CEO why you were down during the exact window you needed to be serving the demand.

Aeolun · on Dec 11, 2020

I feel less troubled by this in AWS because they actually have functional customer service.

dijit · on Dec 10, 2020

> but man do I wish there was a way to put down a spending limit and simply disable anything that goes over that limit.

Literally did this my first week when trying out GCP for my company. It is entirely possible and documented (with code):

https://cloud.google.com/billing/docs/how-to/notify#cap_disa...

dathinab · on Dec 10, 2020

> Note: There is a delay of up to a few days between incurring costs and receiving budget notifications. Due to usage latency from the time that a resource is used to the time that the activity is billed, you might incur additional costs for usage that hasn't arrived at the time that all services are stopped. Following the steps in this capping example is not a guarantee that you will not spend more than your budget. Recommendation: If you have a hard funds limit, set your maximum budget below your available funds to account for billing delays.

(Source link in parent post, emphasis mine).

In this case they had a additional cost due to delay of $72k. Which, lets be honest means this feature kinda useless for anything but the somewhat harmless case.

Only by combining this with resource limits in load balancers, instance and concurrency limits and similar can the maximal worst cost be limited. But tbh. this partially cripples auto-scaling functionality and it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case.

donmcronald · on Dec 11, 2020

> it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case

> I created a new GCP project ANC-AI Dev, set up $7 Cloud Billing budget, kept Firebase Project on the Free (Spark) plan.

There's a lot of middle ground between $7 and $72k. Your quote explains it perfectly though. They flat out can't because the accounting and reporting is badly designed and incapable of providing (near) real-time data.

IMHO the easiest solution to this is government regulation. If you set a budget for a pay-what-you-use online service there should be legislation forbidding companies from charging you more than that.

I also find it (sort of) hilarious they can magically lock the whole thing down once payment fails, but not before the CC is (presumably) maxed out. Lol. Talk about a good deal for Google.

throwclowd · on Dec 11, 2020

There's something uncanny about understanding the situation enough to turn on the budget alerts, while at the same time not realizing it's not going to help in time if your system runs amok.

ggthrowaway2020 · on Dec 11, 2020

I'm not sure if you meant it this way, but your tone makes it seem like the parent just needs to "read the docs".

Unfortunately for all of us, your solution doesn't work, per the huge disclaimer on the page that says those alerts can be days late. You can rack up an almost unlimited $ bill in hours.

alasdair_ · on Dec 10, 2020

The article says that they had a limit in place but that in practice the billing limit lags up to 24 hours behind the "real" number.

merb · on Dec 11, 2020

thats not the best thing you can do. the best thing you can do is put excessive time into quotas. aws has way better quotas for starters than gcp has, sadly

brianwawok · on Dec 10, 2020

There are guard rails in quotas. Like you can only spin up X servers without opening a ticket to ask for more.

Now, think some of these quotas can still lead to some pretty crazy bills.. but that is the point of at least some of them....

dathinab · on Dec 11, 2020

They are broken, unreliable, hard to correctly setup guard rails.

I mean like the article mentioned they could have set the instances and concurrency settings to lower values. Which in this case would have worked.

But finding the right settings to balance intentional auto-scaling and limiting auto-scaling to limit of how fast unexpected cost might rise is hard and prone to get wrong.

Let's be honest it's in the end a very flawed workaround which maybe might help (if you know about it, and did it right).

fweespeech · on Dec 10, 2020

Tbh, its lack is why I don't use Google or AWS for projects.

Retailer · on Dec 11, 2020

If you're on App Engine, we did an article about that

https://retailingplatform.com/blog/how-to-control-costs-on-a...

Aeolun · on Dec 11, 2020

Yeah, had that when I just started using it and it happily kept scaling like crazy. $200 bill in one day.

I never used google again.

KorematsuFred · on Dec 11, 2020

There are already such features but a lot of indie developers are lazy to configure their infra properly. Now, default low limit does not make sense as it will piss off large customers.

I run so many websites on Google Cloud Run that sometimes I feel I might be abusing them, but I have ensured each of my site has max limit of 2 hosts.

paxys · on Dec 10, 2020

This is already present and very easy to set up.

sudcha · on Dec 10, 2020

OP here.

Thanks for sharing!

I have no idea what they did internally, but something like this was my guess. I only communicated through customer support channel and replied to emails, and shared my doc (which cited all the loopholes) with them.

It took them 10-15 days to get back and make a one-time good will contribution. The contribution didn't cover logging cost, so we did pay few hundred dollars.

salmonlogs · on Dec 11, 2020

Sounds like you found an amazing support rep and made a great case for it - good job!

vincentmarle · on Dec 11, 2020

I went through this very scary experience recently as well (although in our case it was $17K, not $72K). One of our devs accidentally caused an infinite loop with one of the Google Maps paid APIs on our dev account and within hours both our prod and dev accounts were suspended (pro tip: don't link your prod account to the billing account of your dev account). The worst part was that after removing the suspension, our cloud functions were broken and had to be investigated and fixed by Google engineers themselves resulting in our prod app being down for 24 hours... be very careful.

Luckily we were able to get $11K refunded on our card and received $6K credits after spending all night with Google support.

lathiat · on Dec 11, 2020

By contrast I hear stories of AWS doing this quite often for one-off mistakes (crediting thousands of dollars). It doesn't make much sense to me not to consider well-intentioned requests for this sort of thing.

Especially if you consider the dollar value of all those approvals and the business you might lose to some other platform and/or hesitance people will have to use those platforms for such things in the future.

xwdv · on Dec 11, 2020

If I were in this situation I would probably offer 10% of the bill to the employee as a reward for their help.

sfkdjf9j3j · on Dec 11, 2020

That's too low. I usually tip my customer service reps 25% of whatever they save me.

xwdv · on Dec 11, 2020

So zero.

rightbyte · on Dec 11, 2020

Sounds like a bribe.

xwdv · on Dec 11, 2020

Right, better hope they help me out of the goodness of their heart instead.