As an ex-Googler working in a customer facing role in Cloud you did very well to get a $72k bill written off! It's definitely possible but requires a lot of approvals and pulling in a few favours. I went through the process to write off a ~$50k bill for one of my customers and it required action every day for 3 months of my life.
Whoever helped you inside Google will have gone to a LOT of trouble, opened a bunch of tickets and attended many, many meetings to make this happen.
I know there's no reason for Google or AWS to do this, but man do I wish there was a way to put down a spending limit and simply disable anything that goes over that limit.
It's a little bit nuts that there are no guardrails to prevent you from incurring such huge bills (especially as a solo developer that might just be trying out their services).
In my opinion, and maybe I'm an absolutist about this, the fact that there aren't these guardrails is opportunistic and predatory. Agile, iterative design and testing will inevitably lead to failures, that's the whole point. Marketing a cloud service to developers who need scalable and changing access to computing during that process should take that into consideration.
I do not think the intentions here are to be opportunistic and predatory but inability to empathize with small developers. A large customer will very likely just pay off few hundred thousand dollars extra expenses. It is only individual developers who are at risk here and cloud operators do not have much interest in them.
I don't know about that. Large customers can almost always find the capital to run their own infrastructure and save on cost. That's not to say that there aren't big customers of these types of services or that certain business models make using them more attractive than maintaining infrastructure yourself, but I would guess that revenue from these sorts of services are largely built on appealing to smaller customers, so their needs would be taken into consideration. Not taking potential cost overruns into consideration to me seems a bit deliberate.
To me it looks very similar to the personal checking overdraft schemes banks were using up until a few years ago.
When I was a young and naive student I thought I could not charge my debit card under 0$. Got down to -3$ and had to pay a 40$ something fee when I already was out of money.
The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users. And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.
Note: Principal at AWS. Have worked to issue many credits/bill amendments, but dont work in that area nor do I speak for AWS.
> And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
What? Why wouldn't this just be an opt in thing? It could even be tied to the account being used. It's not like AWS accounts are expensive or hard to setup.
If a user opts in to the "kill if bill goes too high" and they kill a critical portion of their business, then that's on them. Similar to how a user choosing spot instances if their spot ends up being destroyed. You've already got that "I can kill your stuff if you opt into it" capability.
> On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users.
Yeah, and what happens if someone isn't big enough to justify AWS's forgiveness? What if they get a rep that blows off their request or is having a bad day? You are at the mercy of your cloud provider to forgive a debt, which is a real shitty place to be for anyone.
> And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.
And what do they do if they miss the alert? You can rack up a huge bill in very little time the right AWS services.
The point of the kill switch cap is to guard against risk. The fact is that that while 72k isn't too big for some companies, it means bankruptcy for others. Its because you might want to give your devs a training account to play with AWS services to gain expertise, but you don't want them to blow $1 million dollars screwing around with Amazon satellite services.
> What? Why wouldn't this just be an opt in thing?
"Oh cool, I'll set a $1k cap, never gonna spend that on this little side proj." Fast forward a year, the side proj has turned in to a critical piece of the business but the person who set it up has left and no one remembered to turn of the spending cap. Busy christmas shopping period comes along, AWS shuts down the whole account because they go over the spending cap, 6hr outage during peak hours, $20k sales down the pan.
Of course it is technically the customers fault but it's a shit experience. Accidentally spending $72k is also technically the customers fault and also a shit experience. I don't think there is an easy solution to this problem.
"Oh cool, I'll use spot instances, never gonna need reliability for this little side proj."
"Oh cool, I'll only scale to 1 server, never gonna see high load for this little side proj."
"Oh cool, I'll deploy only to US West 1, outages are never going to matter for this little side proj."
There are a million ways to be out of money as a company. Why should this be any different? Why is the singular particular instance one where it is simply intolerable to accept that users can screw things up?
There are lots of things that are "shit experiences" that are consumers fault.
There is an easy solution. Give consumers the option and let them deal with the consequences. There are enough valid reasons to want hard caps on spending that it's crazy to not make it available because "Someone MIGHT accidentally set the limit too low which will cause them an outage in production that MIGHT mean they lose money"
There exist totally a solution. It is also user hostile enough so it might get's adopted.
$cloud_vendor just has to (and probably will) constantly nudge people to loosen the limit.
Have a red banner that says, "you spend 3% of your monthly budget already think about increasing it".
Also routinely send out emails to remind people. " Black Friday comes up think about increasing your quota " when your service has nothing to do with E-Commerce.
> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
This is simple wrong.
Depending on your use-case disabling active resources is the right reasonable solution with less downsides.
E.g. most (smaller) companies would prefer their miscellaneous (i.e. no core-product) website/app/service to be temporary unavailable then have a massive unexpected cost they might not be able to afort which might literally force them to fire people because they can't pay them....
I mean think about it, what worth is it if my app doesn't go temporary unavailable during it's free trial phase when it means I'm going bankrupt from today to tomorrow and in turn can't benefit from it at all.
Sure hug companies can always throw more money at it and will likely prefer uninterrupted service. But then for every hug company there are hundreds smaller companies which have different priorities.
In the end it should be the users choice, a configuration settings you can set (per preferably per project).
And sure limits should probably be resource limits (like accumulated compute time) and not billing limits as prices might be in flux or dependent on your total resource usage or similar so computing it is non trivial or even impossible.
I often have the feelings that hug companies like Amazone or Google often get so detached from how thinks work for literally every one else (who is not a hug company). That they don't realize that solutions proper for hug companies might not only be sub-optimal but literally crippling bad for medium and small companies.
The upside for the noob trying out/learning is huge.
I'm no longer that person, but I think GCP/AWS are just being lazy about this - perhaps because they earn a lot of money from engineer mistakes. Of course it's possible to create an actual limit. There'll be some engineering cost, like 0.5%-1% extra per service?
Edit: Being European I think legislation might be the fix, since both Amazon and Google have demonstrated an unwillingness to fix this issue, for a very long time.
"The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active."
Lol what ... this is exactly what happens any time you hit a rate limit on any AWS service. The customers application is "catastrophically interrupted" during its most popular/active period.
The only difference is in that case, it suits AWS to do that whereas in the case of respecting a billing limit, it doesn't.
If you hit a rate limit, the marginal portion of requests exceeding that limit is dropped: if you plot the requests, the graph gets clipped. Bad, but not catastrophic.
If you hit a billing limit, everything beyond that point is dropped, and the graph of requests plunges to zero. You're effectively hard down in prod.
And for some companies/individuals, if you keep charging then THEY will plunge to a large negative debt. It's not even zero, it's a lot worse than that.
I was creating some side-project. I already incurred around $100 fee. I imagine if I made some looping/recursion bug I could've easily incurred a cost of $10,000 or frankly, infinite cost. How easy would it have been for me to get this pardoned? And at the very moment I would discover that I just lost $100,000 - would I know in advance that they are definitely going to forgive this because I'd be full panic mode? It was very scary for me to use cloud in this case.
Why not alert thresholds, configurable by the user?
Email me when we cross $X amount in one day, Text when we cross $Y, and Call when we cross $Z. Additionally, allow the user to configure a hard cut-off limit if they desire.
Just provide the mechanisms and allow users to make the call. Google et al would have a much stronger leg to stand on when enforcing delinquent account collections if they provided these mechanisms and the user chose to ignore them.
Additionally, Google et al should protect _themselves_ by tracking usage patterns, and reaching out to customers that grossly surpass the average billable amount - just like OP with their near $100k bill in 1 day. Zero vetting to even have a reasonable guarantee the individual or company is even capable of paying such a large bill.
And then what? Sue a company that doesn't have $100k for $100k? This makes zero sense.
Google has alert thresholds (you set it up under your Budget). But practically speaking, an alert is not enough - what if you are unavailable to get the alert, it comes in the middle of the night, etc?
A better solution would have been 'limits' which they used to have (at least for Google App Engine) but which has been deprecated.
We had to spend sometime to research and see if there was a work around because just like the author of the article, we were quite worried about suddenly consuming a huge amount of resources, getting a spike in our bill and our accounts being cut off/suspended because we hadn't paid the bill. We've documented our solution here
Doesn't look like there's any cutoff mechanism there, and it's a separate, optional step instead of part of the setup flow with a mandatory opt-out warning.
Nor does that address the other complaint - Google (and possibly others) seem to be willing to extend an unlimited credit line to all customers without any prior vetting for ability to pay. That's crazy.
> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
Well, this is true, but this is also true of a lot of limits, like limits.conf. Sometimes you really want to spawn loads of processes or open many files, but a lot of the time you don't, so a barrier to limit the damage makes sense.
There is no one solution that will fit everyone: people should be able to choose: "scale to the max", "spend at most $100", etc. If my average bill is $100, then a limit of $500 would probably make sense, just as a proverbial seat belt. This should never be reached and prevents things going out of control (which is also the reason for limits.conf).
> It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
This could be ameliorated by using namespacing techniques to separate prod from dev resources. For example, GCP uses projects to namespace your resources. And you can delete everything in a project in one operation that is impossible to fail by just shutting down the project (no "you can't delete x, because y references it" messages).
Aggressive billing alerts and events, that delete services when thresholds are met, could be used only in the development namespace. That way, fun little projects can be shut down and prod traffic can be free to use a bit more billing when it needs to.
> "And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource."
Well there's a very easy way, adding a checkbox and an input:
[ ] I am just trying things out, don't charge me more than [ ] USD
There are ways it could be done relatively benignly, such as defaulting to paranoid and explicitly opting out.
And for those that are heading into that financial barrier it should be a straightforward problem to look at trending to anticipate the shutdown and send out an alert.
This. App Engine used to offer hard spending limits, and they were removed with precisely because so many users set them up to shoot themselves in the foot at precisely the worst possible moment.
^^ this. Hard spending limits seem great until your app/service gets super popular and you have to explain to the CEO why you were down during the exact window you needed to be serving the demand.
> Note: There is a delay of up to a few days between incurring costs and receiving budget notifications. Due to usage latency from the time that a resource is used to the time that the activity is billed, you might incur additional costs for usage that hasn't arrived at the time that all services are stopped. Following the steps in this capping example is not a guarantee that you will not spend more than your budget. Recommendation: If you have a hard funds limit, set your maximum budget below your available funds to account for billing delays.
(Source link in parent post, emphasis mine).
In this case they had a additional cost due to delay of $72k. Which, lets be honest means this feature kinda useless for anything but the somewhat harmless case.
Only by combining this with resource limits in load balancers, instance and concurrency limits and similar can the maximal worst cost be limited. But tbh. this partially cripples auto-scaling functionality and it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case.
> it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case
> I created a new GCP project ANC-AI Dev, set up $7 Cloud Billing budget, kept Firebase Project on the Free (Spark) plan.
There's a lot of middle ground between $7 and $72k. Your quote explains it perfectly though. They flat out can't because the accounting and reporting is badly designed and incapable of providing (near) real-time data.
IMHO the easiest solution to this is government regulation. If you set a budget for a pay-what-you-use online service there should be legislation forbidding companies from charging you more than that.
I also find it (sort of) hilarious they can magically lock the whole thing down once payment fails, but not before the CC is (presumably) maxed out. Lol. Talk about a good deal for Google.
There's something uncanny about understanding the situation enough to turn on the budget alerts, while at the same time not realizing it's not going to help in time if your system runs amok.
I'm not sure if you meant it this way, but your tone makes it seem like the parent just needs to "read the docs".
Unfortunately for all of us, your solution doesn't work, per the huge disclaimer on the page that says those alerts can be days late. You can rack up an almost unlimited $ bill in hours.
thats not the best thing you can do. the best thing you can do is put excessive time into quotas.
aws has way better quotas for starters than gcp has, sadly
They are broken, unreliable, hard to correctly setup guard rails.
I mean like the article mentioned they could have set the instances and concurrency settings to lower values. Which in this case would have worked.
But finding the right settings to balance intentional auto-scaling and limiting auto-scaling to limit of how fast unexpected cost might rise is hard and prone to get wrong.
Let's be honest it's in the end a very flawed workaround which maybe might help (if you know about it, and did it right).
There are already such features but a lot of indie developers are lazy to configure their infra properly. Now, default low limit does not make sense as it will piss off large customers.
I run so many websites on Google Cloud Run that sometimes I feel I might be abusing them, but I have ensured each of my site has max limit of 2 hosts.
I have no idea what they did internally, but something like this was my guess. I only communicated through customer support channel and replied to emails, and shared my doc (which cited all the loopholes) with them.
It took them 10-15 days to get back and make a one-time good will contribution. The contribution didn't cover logging cost, so we did pay few hundred dollars.
I went through this very scary experience recently as well (although in our case it was $17K, not $72K). One of our devs accidentally caused an infinite loop with one of the Google Maps paid APIs on our dev account and within hours both our prod and dev accounts were suspended (pro tip: don't link your prod account to the billing account of your dev account). The worst part was that after removing the suspension, our cloud functions were broken and had to be investigated and fixed by Google engineers themselves resulting in our prod app being down for 24 hours... be very careful.
Luckily we were able to get $11K refunded on our card and received $6K credits after spending all night with Google support.
By contrast I hear stories of AWS doing this quite often for one-off mistakes (crediting thousands of dollars). It doesn't make much sense to me not to consider well-intentioned requests for this sort of thing.
Especially if you consider the dollar value of all those approvals and the business you might lose to some other platform and/or hesitance people will have to use those platforms for such things in the future.
Whoever helped you inside Google will have gone to a LOT of trouble, opened a bunch of tickets and attended many, many meetings to make this happen.