The fact that cloud providers don't have a simple "This is how much I can afford, don't ever bill me more than that!" box on their platforms makes development a lot scarier than it really needs to be.
This is my worst nightmare. Lol. I guess now is a great time to give Azure a shoutout for sitting on their hands for 8 years without so much as a response to the community for half a decade [1].
At least AWS allows using a prepaid credit card so they’ll need to call me if things go haywire. I bet if that $72k charge went through it would have been much harder to get out of. “Sorry, we don’t have the money” is a much better negotiating position than “can we please have our money back?”
But then consider following hypothetical but possible scenario:
Sorry until you pay, no more Amazone services for your company.
Now you must move to a new cloud provider (or make a new company).
Oh, wait they now interchange (bad) customer information to better find fraud and you just got marked at "owning a lot to amazon" so no cloud for you anymore at any provider.
Now you want to buy your own hardware. So you need a credit from the bank, but dang, your owe to much to a big company and the bank now, so no credit for your company either.
While part of above's scenario is luckily not how reality currently works. But then who knows when (part of) such a horror scenario becomes reality.
In the end relaying on forcibly not paying back money you contractually own is just not a very viable strategy in my view.
I see no reason why such an arrangement couldn't be optional. Different projects, teams and people have different needs, cloud computing services are marketed specifically on this point. It makes no sense that there isn't even an option in Firebase or AWS to immediately stop services over a certain amount. The current a situation is ripe for lawsuits IMO.
> “Sorry, we don’t have the money” is a much better negotiating position than “can we please have our money back?”
I agree but why would you like to be in either position anyway? The so-called cloud services are terribly overpriced when compared to traditional servers.
Not really, computing done correctly is about avoiding all of the pitfalls and finding ways to get zero cost benefits, free computation out of necessary redundancy, etc. Selling cloud computing is about creating options around every pitfall and finding ways to charge for every mitigation that will be necessary and charge for redundancy in the mitigation strategy for the mitigation strategy..
Even if you pay for all the redundant managed blah they offer to not lose your business by having any single point of technical failure in their network, their billing and IAM are your single points of failure, if you diversify to multiple clouds all the guarantees either cloud offers is now pointless redundancy so you are paying 10X pricing for an inadequate redundancy layer.
If you look at Google's own model for computing, they didn't fall for this themselves, the computers they used were intentionally unreliable to not recursively pay for reliability and redundancy at any layer that can't provide the needed guarantee.
You can basically go all in with one of these clouds and become a franchise add-on with roughly the same rights as your average mcdonald's store owner, or you are managing a strategy that is far more complex because of the complexity of these offerings than just using metal and free software.
They're very useful if you're testing a concept, need agile scaling of computational power, or are just starting a service and don't want to / can't invest the capital in dedicated hardware. I agree with you on your last point though, making your service entirely dependent on these services makes you little more than a franchise and is a potential vulnerability if you ever compete with any important existing service. It probably isn't a good idea for a mature or rapidly maturing business to rely heavily on these services.
It is baffling why cloud providers don't have that option.
I might want to have an app because I don't mind spending 50 dollars on my pet project as a hobby, but I don't ever want to spend more than that. Not if I write a wrong query that's suddenly becomes very expensive, not when I got attacked, and not even when I have legit users.
By the way, the same goes for some companies, too, just the threshold would be different.
It's not complicated to add configurable hard limits for these companies but they don't allow it because the current situation is more interesting for them.
They want to suck the maximum money from consumers before they realize.
For one person that will complain wildly and having to do a gesture, there are hundreds other companies that will not notice or just pay without recourse.
> They want to suck the maximum money from consumers before they realize.
This is a naive understanding of how corporations like Google and Amazon work. Bad will and using gym membership tactics aren't how they scale or make money. Getting you to confidently try things knowing you won't get charged (the reason they have those free tiers) so you'll get your company, your start-up, your next side project on it is much better for business.
It's a miss that things like this aren't implemented and widespread, not by design.
> It's not complicated to add configurable hard limits for these companies but they don't allow it because the current situation is more interesting for them.
I'm not in this space, but from my observations:
- Each service has a different billing model and metering model. Most likely this data is held by the service. I'm familiar with AWS so I'll use them as an example. I'd wager only DynamoDB or only Lambda (the service owners) know how much of those services you've consumed
- Billing is most likely reconciled asynchronously after collecting all data from all services by an entirely different department with knowledge of payments and accounting
- GCP, AWS, Azure launch 50+ services a year
- Each large customer most likely has a special rate. I bet Samsung or Snap pay an entirely different set of rates than the normal customer. There are thousands of these exceptions
- Cutting your service off when your over the limit is an incredibly complex set of edge conditions. Your long running instance hosting your critical service is shut off because of experimenting on a new ML workflow?
Even with only the above I can see the difficulty in globally limiting your spending limit at an accurate level. I know there are features for both AWS and GCP and they try.
It's easy to stand on the sidelines and handwave away technical complexity at scale, but I'd encourage you to give all of these providers a more charitable view, at least on this topic.
>Bad will and using gym membership tactics aren't how they scale or make money.
Except they do that with their actions.
>Cutting your service off when your over the limit is an incredibly complex set of edge conditions.
Sure! But if they cared about customers as you claim, they'd let users set hard limits, and when one of these mishaps happened, stop the services when their system eventually knows that the quota has been exceeded... and, make the user only pay the hard limit as the maximum amount. If this continues to happen, warn the user that their account will be terminated... and that's that. But they'll never do that.
Most of their clients pay for these mistakes because they don't have the reach or skills to make this a viral social media article to get people's attention and hence get them to forgive the costs.
I'm sure they know how much they make in revenue because of these mistakes and they deliberately don't do anything about it.
I work in this space and you’re absolutely correct. Your last paragraph hits the nail on the head for pretty much every complain people have about the public clouds.
Right, so let's say Congress passes a bill that requires cloud providers to enable hard spending limits by start of February 2021, and eat any extra usage costs that exceeded a set limit.
What is your educated guess by when this feature would be essentially correctly implemented in AWS and GCP (essentially = negligible costs to the providers due to either false negatives (bills they eat) and false positives (PR fallout, when SomeSite gets shutdown despite not being over limit)?
The fact that the dashboards and alerts have a delay sounds like there might be difficult consistency stuff going on. Many nodes need to coordinate their usage and billing. It may be a difficult problem, but solving billing problems might not really motivate anyone at the company. It's not a "cool" problem for engineers and not profitable for product.
>> The fact that the dashboards and alerts have a delay sounds like there might be difficult consistency stuff going on.
I think that's true. It's easier to measure usage and aggregate that data after the fact than to meter it in real time and stop at a limit. Those are very different things. What happens if you hit the cap while running multiple processes spread across a cloud?
One improvement might be to throttle things as the cap approaches but that doesnt really change the problem at all. Do that and have provider eat any overages should solve it from the user point of view.
Theres an easy solution: You set a limit and everytime a service needs to spend some many it allocates a small portion of the budget and after some threshold it will put unused money back to the budget. The only downside is that your spending limit will be reached to optimisticly, but i prefer that to paying thousands more than i wanted to. Knowing the system works like maybe a lower and higher threshold for the budget could be set.
Every time a GCloud rep would ask us about what we need, we would say: fix the billing interface. As far as I know, it never got fixed. The feelings I would get when looking at cloud billing interfaces can be summed as: obfuscated, like a pawnshop, and caveat emptor. I kind of came to the conclusion that if the cloud giants are not fixing their billing interfaces, then just like Amazon not sending you the details of the items you ordered by email and thus causing you to use the app to help with primenesia, there is a 'business' reason why the billing interfaces are generally incomprehensible.
They want to suck the maximum money from consumers before they realize.
I have very little money so I just don't use their services because a mistake would be disastrous. They might be losing out on me making a unicorn app on their platform. It's unlikely, but while the possibility of catastrophe exists I'll stick to not using them. That extends to not recommending anyone uses them either in case the worst happens.
Then the harsh reality is: companies don't care. Yeah, your app might turn out to be a unicorn, but the overwhelming odds are that it won't. And no one cares that you'll tell your other broke friends to avoid the service.
We'd all like to think it to be different, that a company might care about appeasing my broke ass. But as already pointed out, they want the whales. I also wonder, despite the number of years "cloud services" have been around, if companies aren't still trying to figure out a gazillion other things and limiting customer spend might be a bit low on the priority list.
The highly price sensitive customer will force you to compete only on price. That's just forcing yourself into a commodity market. It's bad business. I would never try to cater to that market. Very dangerous. Competition will drive margins down to near zero.
For hobby projects you probably don't need auto-scaling, and should use a provider that charges a fixed monthly rate. You'll "waste" a little bit of money on unused uptime, but for a hobby project it will be a minuscule amount.
> It is baffling why cloud providers don't have that option.
...is it? If a lazy dev leaves their corporate account open and you can bill it for their negligence, protected by the contract you already signed, you earn a lot of money. From a purely business perspective, it is stupid(!) to provide a stopgap for that.
Edit: to be clear I am not advocating one way or the other. But it is surprising that people are "baffled" by this obvious profit optimization.
Google is around a trillion dollar company, your $75,000 is a completely immaterial amount to them. Not to mention it would be a one time payment that would drive away customers and lead to bad PR like this post.
As a former victim to the same issue as OP, I am furious every time I see a Googler promote that as a solution.
In our case, we racked up a $10000 bill on BigQuery in ~6 hours, when a job was failing and auto-retrying.
We had set up every alert correctly and our reaction time was about 5 minutes (about $100 of usage, no big deal). So how did we get a $5000 bill? Google's alert was 6 hours late (according to them, this was root-caused to us, because we were submitting jobs continuously). They pointed to their TOS and said they don't guarantee on-time delivery of the alert.
I had to write up a blog post with fancy graphs and prepare it for social media before they finally agreed to eat the bill.
you misunderstand the intent of this - you basically set this. even if it fails (because messages are delayed), google will refund.
This has happened to us before - they do a refund - since you had set the limits correctly. In general, they are not super assholes. I actually dont know a case, where they have refused to refund.
AWS is better here - since GCP doesnt have a support dashboard. So the "chasing them" experience is much worse.
> There is a delay of up to a few days between incurring costs and receiving budget notifications. Due to usage latency from the time that a resource is used to the time that the activity is billed, you might incur additional costs for usage that hasn't arrived at the time that all services are stopped.
Following the steps in this capping example is not a guarantee that you will not spend more than your budget.
This looks like it has the same problems as the post, because it also relies on those budget alerts that can happen a long while after you've exceeded them.
Very late to the post, but this seems like "eventually consistent billing". distributed systems seem to rely on "eventual consistency" but... "eventual consistency" is not what most people want in billing threshold scenarios...
"Following the steps in this capping example is not a guarantee that you will not spend more than your budget."
"Resources [...] might be irretrievably deleted."
Also it's not automatic, you have to manually write code to do it, and test it, and make sure not to break it.
A reasonable implementation of this feature would be built into the console, guarantee a maximum spend, not require writing your own fallible code, and provide an option to preserve storage (at normal cost) so that all your data isn't deleted when your compute/API stuff is shut down.
As a former App Engine PM who spent a lot of time with billing/quotas (though, not the one who deprecated this feature), it's likely due to some combination of:
- hard limits caused downtime more often than they prevent these blog posts
- hard limits were inconsistently enforced, even within GAE
- platform wide quota notifications were implemented (reached "GA"), leaving the question of "how a developer wants to handle this" to the developer, not the platform
- maintenance burden
The "I bankrupted my startup by running tests in an infinite loop" blog posts happen ~once a year, while the number of customers (including internal teams!) who inadvertently went down because of this quota was staggering. I feel like I used to see one a week, at least. Most often someone on the team was like "oh I'm going to turn this down to zero because we don't want to spend any money during development", never told anyone, and then they go live and they forgot to turn the knob back up (or didn't properly estimate traffic/costs and set it too low).
I can tell you it hurts revenue a lot more when a large customer goes down for 15 minutes due to quota issues and their usage drops to zero (both in terms of revenue and customer credibility) vs when tiny developer accidentally blows through 10k in a month and we refund it (since, obviously, the providers cost is a lot less than that).
Personally, I don't think this is a good enough reason. Worst case, if I experience an unplanned shut down, I will increase my spending limit. Removing the feature entirely because of this just doesn't make sense.
When I also think of the fact that Google tied it to requiring a credit card for almost every single transaction even if it is free gives the impression that it is for financial purposes (aka a way to get more out of developers or those who might be free-loading on the free tier of App Engine)
I gotta say that seems like a bad reason to remove the feature. If someone intentionally set a hard spend limit - hit it - and their service went down because of it that's not Google's fault. The simple solution for that customer is to just turn off or increase the limit.
This is a reasonable way of achieving the balance needed. My company would freak out if we had even a short outage that affected all our customers because we set a billing quota too low. And I'd feel a lot more comfortable experimenting with serverless on my own projects if I knew Google would have my back if I made one of those once-in-a-year mistakes.
OP claims that the budgets are not real time, they are eventually accurate but if it happens that you spend too fast you may end up with a larger than your budget sum before anything triggers.
It's surprisingly complex to do that. Let's take a simple example and say your cloud account is doing 2 things - compute & storage.
Compute is an active resource, when you exceed your budget it can be automatically shutdown.
Storage is a passive resource, when you exceed your budget it can be automatically....deleted? That's almost always the wrong action.
Providing fine-grained cost limits help some, as passive resources usually don't have massive cost spikes while active resources do, so you can better "protect" your passive resources by setting more aggressive cost limits on the active resources.
This quickly gets more complicated. Another example is most monitoring services are a combination of active (actual metric monitoring) and passive (metric history) resources. A cost limit on that monitoring service likely won't provide sub-service granularity, mostly depending on whether the service even has different charges for monitoring vs history.
Oh, also, even for a passive resource like storage, you also have active resource charges whenever you upload/download your data.
Ugh, what a mess. The best thing to do is pay attention to your spending, just like you do with your personal & corporate budget.
S3 costs money to keep your files in, even if you're not touching them, so just preventing further uploads wouldn't do much to prevent your AWS bill from increasing.
It would let you set an upper limit on the price you pay though. Better than accidentally misconfiguring a logging service and writing gigabytes of unneeded data.
>>> But we've had disk quotas before that mostly worked?
AWS has quotas on everything, including quotas on EBS storage per region.
You will realize that after you spin up some instances with disks and it's failing because you've hit 10 TB of EBS storage. Have to raise a ticket to raise the limit.
> Storage is a passive resource, when you exceed your budget it can be automatically....deleted? That's almost always the wrong action.
A better option would be to automatically reduce the budget by the amount it would cost to keep the storage forever. If doing that would reduce the budget to zero, do not allow increasing the amount of storage. That is: assume the storage will not be deleted, and budget according to that.
How does this actually work? It clearly can't be forever, since any non-zero dollar amount * infinity months is infinity dollars, which is going to reduce the budget below zero since any non-infinite number minus infinity is less than zero... thus locking it immediately.
Even if we say "you get N months of storage before we delete it" and subtract N * current storage cost/month, what happens after you're locked out of all actions because you added an extra GB? Storage APIs cost money to use, so you would get locked out of those too (note that if you're not, people would set arbitrarily low limits and get storage access for free) and couldn't retrieve anything. The only remaining actions are delete (which is free) or raise the quota and do the whole rodeo over again.
Abuse is impossible to ignore at public cloud scale, so "free storage forever" (or even, storage at a one time fixed price) as the fallback isn't a viable option.
Lastly, from an optics perspective, which blog post would you rather see on the front page of HN: "I did something dumb and spent too much money on Cloud" or "Google is holding our data hostage" (or "Google deleted all my data")?
Source: I launched Firebase Storage, which has a GCS bucket that has a hard limit.
For it to work, obviously the budget has to be per month (for instance, $100/month), instead of an absolute limit. Most of the time, that's what you'd want: if you calculated that what you use will cost $50 each month, setting a budget of $100 per month would give some room for growth while preventing billing disasters (and you can always increase it a bit if necessary).
Off the top of my head I'd say that if you're budgeting for storage, the max you can afford for the time period that you'd need to recover data in the event of a budget overrun taking into considerations the delay time for notifications would be a way to calculate that. And that sounds like something that is reasonable to put on the customer to calculate.
You've explained why it's hard for Google to not give me resources I can't pay for, but that's not what I care about, or what I'm asking for. What I'm asking for is a feature where I set a hard limit of $100 and that's the most I get billed - if my account accidently uses $5000 of resources before Google reconciles the usage with my budget then Google automatically waives the additional $4900 and then limits my account in some way until the problem is rectified.
Practically every time these blog posts come up they end with the provider refunding the costs. I just want that refund to be a feature.
So...you're saying that Google should give away $4900 of usage?
Yes. But they should also develop mechanisms to warn users that they've made a mistake before it happens, and improve the speed they can detect mistakes to lower the cost, and invent some way to detect someone intentionally abusing the feature.
But mostly they should make the fact they do give away $4900 when a mistake happens explicit. That isn't actually a change. They just need to make it clear that's what happens.
It's not really that complex. All compute should shut down. All API calls should fail. Storage should be (optionally) preserved at normal cost.
Your examples are simple given this framework. Uploading/downloading data to storage is an API call. Monitoring is compute. Metric history storage is storage.
Storage costs are predictable and slow to accumulate. They are rarely the problem people are trying to address when they set a budget. As I said, storage would optionally continue to be charged at the normal rate, the other option being immediate deletion if you really need a super hard budget cap.
Once you get the alert that your budget is tripped you can go and see what's in storage via the console and delete it, only paying for a few hours of storage for things you don't want.
Moreover, once API calls are locked, what next? You can't delete files, and even if you can delete them, you aren't able to retrieve them before deletion... If a platform allows you to do those actions, then it's rife for abuse, and at public cloud scale that ends up being a far, far bigger problem than the occasional blog post that ends up as a refund (because the other blog post is "I got free storage forever with this one weird trick").
It's really not a simple problem because the next action depends on the choice the developer wants to make: do they increase the budget or decrease usage, and no cloud provider wants to make this choice because no matter what the choice is it will be viewed as wrong. The best they can do is provide developers the best insight and tooling to make this choice themselves.
Once API calls are locked you can open the console, disable all the things that caused you to hit your budget, and then raise the budget a bit to get access to the storage APIs again and manage your storage. Or, the console's storage browser should let you browse and delete files as well. And again, there should be an option to delete all storage immediately for a hard cap on your budget if you really want that.
If the answer is "you have a dollar limit set of GCS GETs, GCS PUTs, etc." I guess I could see this working, but hot damn that'll be a horrific interface.
The other issue is that many large customers pay different prices, so billing and quota aren't really tied to each other, and it wouldn't be easy to reconcile this.
As for the button... having been on the product side of building this button, there is no right answer: people will say they never got the email (or it went to the wrong inbox, or their dog ate their phone...) or that they never checked the box to "shut down the site" ("I didn't think it would do X that made my app not work").
I'd probably want it grouped by category with a drill down interface for the specifics.
Probably arranged so you can type in a figure at the bottom for monthly expenditure and it would balance out the requirements based on typical use cases.
So enter $50 in the monthly cap figure and it allocates, say, $20 to compute, $20 to transfer operations and API calls, $10 to storage
which you could then fiddle with of course.
I can't offer much on the second point other than to say that unexpected bills annoy me much more than services that stop working.
I've also never worked anywhere with unlimited budgets. (alas)
I can see that there are probably cases where uptime is more important so they would be more annoyed the other way around.
Not only development but also running in production. You can configure alerts but you can't configure a hard limit. Thats just insane. That makes working with GCP like playing with fire.
Probably because it's not so simple on the backend.
I'm guessing there's a good chance a lot of systems are only eventually consistent, which could explain why billing takes a long time to update.
Aggregation of service usage for billing could also be an expensive operation, so it's only updated irregularly instead of being near real-time.
It would be a great feature, but I can imagine it being very complex. It's also probably cheaper for them to just wave away excess usage like this instead of building out a solution.
This is a billing question, not a technical question, and looked at through that lens it's easy to put a hard limit on a monthly bill: just don't ever issue bills greater than that amount.
If I say I only want to pay a maximum of $1000 a month, and I hit that limit but it takes a bit for the provider to shut everything down so really $1100 of resources were consumed, then the provider eats the $100 overrun and I get a bill for $1000.
With an actual hard limit you create a financial incentive for the provider to minimize this overrun. Yes it might be difficult to fix but I assure you, if hard limits existed, the technical issues would be solved soon enough because now there's a reason to invest in a solution.
It's also a mostly solved problem because advertisers have budgets and it's common to implement globally distributed budget servers to avoid showing more ads than the advertiser paid for, despite tens of thousands of individual web servers needing to know which ads in their inventory have budget left.
It's a fun exercise similar to global rate-limiting/load-balancing.
I think the simplest is a tree of servers (which can be sharded by user if necessary for load balancing). The root has the total budget and offers short-term small leases of ad views to child nodes, who may also have child nodes doing the same thing with even smaller leases.
Web servers check with the leaf nodes for every ad they want to show. If that leaf has a budget greater than zero it decrements its own budget and returns success. If the web server gets a success it shows the ad, if not it checks with another budget server or two. Web servers frequently log how many ads were served per client.
Whenever leases are up the intermediate nodes inform the parents of how much was spent and get a new lease. If nodes crash or otherwise don't return their lease then their parents have to assume the whole budget was spent, but leases are kept small to avoid big discrepancies.
If the root crashes then there are problems so the root can be a slow ACID replicated database as long as its immediate children are mostly reliable and take large enough leases to minimize load on the root.
Periodically web server logs are aggregated to adjust the root budgets to account for crashed intermediate nodes and web servers.
The tree approach allows global low latency operation guaranteeing no overspending and minimizing underserving. Nodes are provisioned from the leaves on up to handle the necessary amount of traffic and to ask for leases large enough for 99.X% percent of child requests to succeed.
Any cloud provider could use the same technology on individual hosts to grab leases of CPU, RAM, disk, etc. by the minute per user and terminate services with no budget. Leases could be a lot longer because most budgets are monthly to cover all service needs and not pathological ad campaigns with low budget, high bid, and huge audience.
It's up to cloud (or ad server) providers to decide whether to stop services if the budget system is broken. In most cases it makes sense to fail open and keep serving and eat the loss because shutting everything down will incur even bigger losses.
I think that's not really an issue though is it? If you say "never charge me more than $100" they can a) ensure they never charge you more than $100 and b) work to optimize their own systems so that they cut you off as close to $100 as humanly possible. In the beginning they might eat some costs since it takes them a day to catch it, but they could work over time to bring that down. And it's not like it's costing GCP/AWS/Azure "sticker price" to provide their services.
CloudFront is a CDN. What the poster you've replied to is talking about is a competitor setting up a server that repeatedly downloads your content to rack up a huge CDN bill. OVH is not a suitable replacement for a CDN so you can't migrate from Cloudfront to an OVH server because a server is not a viable replacement for a CDN.
There is an easy explanation: It's hard to build this feature, there is no pressing demand from upper management, it's easier to get promoted doing other simpler projects. Think about what a real time snapshot means: you need to know how much of all the services are being used, project that in the future and compute the costs.
Really, it is a bit disappointing to see a bunch of engineers in this thread talking like this is some monumental, borderline unsolveavle problem. The solution is pretty easy to figure out, even taking into consideration different needs of different customers. The implementation might not be trivial, and legal liability questions might have to be considered beforehand, but the problem is not that hard.
There are some cloud services where it's not quite this simple.
S3 -- you can't just delete customer data because they hit a billing limit
RDS -- not going to drop databases on the 27th of the month
Anything with persistent data is going to have to stay alive and accumulate costs. Admittedly these services aren't where the crazy bills come from, but it does make a simple kill switch a bit more complex.
You don't have to immediately delete customer data.
Most service that has a limit cap will have a "grace period" of a couple of days during which the service does not work but the data is not deleted. That give your some time to get notified of the issue, and fix the problem/increase the limit.
This is a solved problem for every other service out there. You don't just delete the data, you give the customer a few days, weeks, or a month to pay their bill and if they don't, then you delete their data.
The problem with this though is it opens a vector for exploitation: users could just use the grace period to store data for free for a period of time. This can quickly become a heavy financial burden if enough people do it.
You could factor that into the price, but then you're potentially making the price point even more unattractive to users than it already is, and users that are responsible with their budgets would be subsidizing those that aren't. Not a very workable solution.
I'd say a good solution is giving customers the option to stop accruing more storage capacity, and to have a max deadline accounted for in their budget to store data (basically each customer decides whether or not to pay for a grace period).
I've accidentally let my OVH subscription go unpaid, and they gave a 7 day window to pay my invoice or delete my data. That's seems pretty fair to me, and they seem to have wide enough margins to eat the cost and still offer some of the cheapest prices out there right now.
I wouldn't be too scared.
For AWS you get about $0.20 per 1 million requests on Lambda.
You can do quite a lot with a single Lambda function.
And a million of anything is a lot for a dev. Put a HTTP API Gateway infront of that with a CDN and you're hitting ~ a few dollars.
If you don't buy one coffee, or put a 20 dollar note in a book one month. Then you're fine. And if you have to use EC2, just use a t2.micro or a raspberry pi on your desk.
But really the first lesson you should learn in any cloud setup is Billing Alarms :)
If you're doing ML or CV work then it's probably cheaper to build on the desktop and port to cloud once you understand what the workloads are.
For AWS you get about $0.20 per 1 million requests on Lambda.
If you get it right, great. If you get it wrong then you end up doing billions of operations by mistake, which could cost a huge amount. That's what happened to the author of the article.
But really the first lesson you should learn in any cloud setup is Billing Alarms
Alarms only tell you that something is going wrong. They don't stop it. If your mistake is costing $1000/minute and you're an hour away from a computer you have a very expensive problem.
That's not a bad idea. You could set it up to delete all Lambdas (assuming you've got a CI/CD system capable of redeploying them quickly later) if the billing goes over. Of course, this may hurt you more because of the outage it would cause. Up to you really.
So you're taking code that you haven't validated locally to see what resources it uses, you're putting this up on the cloud to test it, then you are immediately going to the middle of nowhere without your laptop/phone/etc, and you can't arrange for a coworker or friend to pull the plug for you if something goes wrong?
> and you can't arrange for a coworker or friend to pull the plug for you if something goes wrong?
This is HN, many of us are solo founders with no coworkers or employees. Also how could a "friend" pull the plug? If it was a physical server running in your house maybe, otherwise you can't really give them access to your AWS account with all your private clients data in there.
If I'm the only developer on a project and I really need to get to market I might do just that. I sometimes do day hikes on weeknights so this is actually a likely scenario for me.
Do you go hiking alone without your phone? That seems dangerous.
And why would you start a test if you won't be there to see the results of the test? Seems more sensible to either leave after you've run the test or wait to do so until you get back.
Just to expand on this. You can have a hard limit.
For AWS, create a role/user that's essentially ~root like access. Make a lambda function that's triggered by a billing alert at your threshold to just turn off things from most expensive to least. So turn of the DB servers. So the apps error out and the users go away.
As an ex-Googler working in a customer facing role in Cloud you did very well to get a $72k bill written off! It's definitely possible but requires a lot of approvals and pulling in a few favours. I went through the process to write off a ~$50k bill for one of my customers and it required action every day for 3 months of my life.
Whoever helped you inside Google will have gone to a LOT of trouble, opened a bunch of tickets and attended many, many meetings to make this happen.
I know there's no reason for Google or AWS to do this, but man do I wish there was a way to put down a spending limit and simply disable anything that goes over that limit.
It's a little bit nuts that there are no guardrails to prevent you from incurring such huge bills (especially as a solo developer that might just be trying out their services).
In my opinion, and maybe I'm an absolutist about this, the fact that there aren't these guardrails is opportunistic and predatory. Agile, iterative design and testing will inevitably lead to failures, that's the whole point. Marketing a cloud service to developers who need scalable and changing access to computing during that process should take that into consideration.
I do not think the intentions here are to be opportunistic and predatory but inability to empathize with small developers. A large customer will very likely just pay off few hundred thousand dollars extra expenses. It is only individual developers who are at risk here and cloud operators do not have much interest in them.
I don't know about that. Large customers can almost always find the capital to run their own infrastructure and save on cost. That's not to say that there aren't big customers of these types of services or that certain business models make using them more attractive than maintaining infrastructure yourself, but I would guess that revenue from these sorts of services are largely built on appealing to smaller customers, so their needs would be taken into consideration. Not taking potential cost overruns into consideration to me seems a bit deliberate.
To me it looks very similar to the personal checking overdraft schemes banks were using up until a few years ago.
When I was a young and naive student I thought I could not charge my debit card under 0$. Got down to -3$ and had to pay a 40$ something fee when I already was out of money.
The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users. And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.
Note: Principal at AWS. Have worked to issue many credits/bill amendments, but dont work in that area nor do I speak for AWS.
> And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
What? Why wouldn't this just be an opt in thing? It could even be tied to the account being used. It's not like AWS accounts are expensive or hard to setup.
If a user opts in to the "kill if bill goes too high" and they kill a critical portion of their business, then that's on them. Similar to how a user choosing spot instances if their spot ends up being destroyed. You've already got that "I can kill your stuff if you opt into it" capability.
> On the other hand retroactively forgiving the cost of unexpected/unintentional usage doesnt impact the customers users.
Yeah, and what happens if someone isn't big enough to justify AWS's forgiveness? What if they get a rep that blows off their request or is having a bad day? You are at the mercy of your cloud provider to forgive a debt, which is a real shitty place to be for anyone.
> And with billing alerts the customer is able to make the choice of whether the cost is worth it as it happens.
And what do they do if they miss the alert? You can rack up a huge bill in very little time the right AWS services.
The point of the kill switch cap is to guard against risk. The fact is that that while 72k isn't too big for some companies, it means bankruptcy for others. Its because you might want to give your devs a training account to play with AWS services to gain expertise, but you don't want them to blow $1 million dollars screwing around with Amazon satellite services.
> What? Why wouldn't this just be an opt in thing?
"Oh cool, I'll set a $1k cap, never gonna spend that on this little side proj." Fast forward a year, the side proj has turned in to a critical piece of the business but the person who set it up has left and no one remembered to turn of the spending cap. Busy christmas shopping period comes along, AWS shuts down the whole account because they go over the spending cap, 6hr outage during peak hours, $20k sales down the pan.
Of course it is technically the customers fault but it's a shit experience. Accidentally spending $72k is also technically the customers fault and also a shit experience. I don't think there is an easy solution to this problem.
"Oh cool, I'll use spot instances, never gonna need reliability for this little side proj."
"Oh cool, I'll only scale to 1 server, never gonna see high load for this little side proj."
"Oh cool, I'll deploy only to US West 1, outages are never going to matter for this little side proj."
There are a million ways to be out of money as a company. Why should this be any different? Why is the singular particular instance one where it is simply intolerable to accept that users can screw things up?
There are lots of things that are "shit experiences" that are consumers fault.
There is an easy solution. Give consumers the option and let them deal with the consequences. There are enough valid reasons to want hard caps on spending that it's crazy to not make it available because "Someone MIGHT accidentally set the limit too low which will cause them an outage in production that MIGHT mean they lose money"
There exist totally a solution. It is also user hostile enough so it might get's adopted.
$cloud_vendor just has to (and probably will) constantly nudge people to loosen the limit.
Have a red banner that says, "you spend 3% of your monthly budget already think about increasing it".
Also routinely send out emails to remind people. " Black Friday comes up think about increasing your quota " when your service has nothing to do with E-Commerce.
> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
This is simple wrong.
Depending on your use-case disabling active resources is the right reasonable solution with less downsides.
E.g. most (smaller) companies would prefer their miscellaneous (i.e. no core-product) website/app/service to be temporary unavailable then have a massive unexpected cost they might not be able to afort which might literally force them to fire people because they can't pay them....
I mean think about it, what worth is it if my app doesn't go temporary unavailable during it's free trial phase when it means I'm going bankrupt from today to tomorrow and in turn can't benefit from it at all.
Sure hug companies can always throw more money at it and will likely prefer uninterrupted service. But then for every hug company there are hundreds smaller companies which have different priorities.
In the end it should be the users choice, a configuration settings you can set (per preferably per project).
And sure limits should probably be resource limits (like accumulated compute time) and not billing limits as prices might be in flux or dependent on your total resource usage or similar so computing it is non trivial or even impossible.
I often have the feelings that hug companies like Amazone or Google often get so detached from how thinks work for literally every one else (who is not a hug company). That they don't realize that solutions proper for hug companies might not only be sub-optimal but literally crippling bad for medium and small companies.
The upside for the noob trying out/learning is huge.
I'm no longer that person, but I think GCP/AWS are just being lazy about this - perhaps because they earn a lot of money from engineer mistakes. Of course it's possible to create an actual limit. There'll be some engineering cost, like 0.5%-1% extra per service?
Edit: Being European I think legislation might be the fix, since both Amazon and Google have demonstrated an unwillingness to fix this issue, for a very long time.
"The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active."
Lol what ... this is exactly what happens any time you hit a rate limit on any AWS service. The customers application is "catastrophically interrupted" during its most popular/active period.
The only difference is in that case, it suits AWS to do that whereas in the case of respecting a billing limit, it doesn't.
If you hit a rate limit, the marginal portion of requests exceeding that limit is dropped: if you plot the requests, the graph gets clipped. Bad, but not catastrophic.
If you hit a billing limit, everything beyond that point is dropped, and the graph of requests plunges to zero. You're effectively hard down in prod.
And for some companies/individuals, if you keep charging then THEY will plunge to a large negative debt. It's not even zero, it's a lot worse than that.
I was creating some side-project. I already incurred around $100 fee. I imagine if I made some looping/recursion bug I could've easily incurred a cost of $10,000 or frankly, infinite cost. How easy would it have been for me to get this pardoned? And at the very moment I would discover that I just lost $100,000 - would I know in advance that they are definitely going to forgive this because I'd be full panic mode? It was very scary for me to use cloud in this case.
Why not alert thresholds, configurable by the user?
Email me when we cross $X amount in one day, Text when we cross $Y, and Call when we cross $Z. Additionally, allow the user to configure a hard cut-off limit if they desire.
Just provide the mechanisms and allow users to make the call. Google et al would have a much stronger leg to stand on when enforcing delinquent account collections if they provided these mechanisms and the user chose to ignore them.
Additionally, Google et al should protect _themselves_ by tracking usage patterns, and reaching out to customers that grossly surpass the average billable amount - just like OP with their near $100k bill in 1 day. Zero vetting to even have a reasonable guarantee the individual or company is even capable of paying such a large bill.
And then what? Sue a company that doesn't have $100k for $100k? This makes zero sense.
Google has alert thresholds (you set it up under your Budget). But practically speaking, an alert is not enough - what if you are unavailable to get the alert, it comes in the middle of the night, etc?
A better solution would have been 'limits' which they used to have (at least for Google App Engine) but which has been deprecated.
We had to spend sometime to research and see if there was a work around because just like the author of the article, we were quite worried about suddenly consuming a huge amount of resources, getting a spike in our bill and our accounts being cut off/suspended because we hadn't paid the bill. We've documented our solution here
Doesn't look like there's any cutoff mechanism there, and it's a separate, optional step instead of part of the setup flow with a mandatory opt-out warning.
Nor does that address the other complaint - Google (and possibly others) seem to be willing to extend an unlimited credit line to all customers without any prior vetting for ability to pay. That's crazy.
> The downside of disabling active resources is huge. It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
Well, this is true, but this is also true of a lot of limits, like limits.conf. Sometimes you really want to spawn loads of processes or open many files, but a lot of the time you don't, so a barrier to limit the damage makes sense.
There is no one solution that will fit everyone: people should be able to choose: "scale to the max", "spend at most $100", etc. If my average bill is $100, then a limit of $500 would probably make sense, just as a proverbial seat belt. This should never be reached and prevents things going out of control (which is also the reason for limits.conf).
> It would mean a catastrophic interruption to the customers application exactly when its the most popular/active. And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource.
This could be ameliorated by using namespacing techniques to separate prod from dev resources. For example, GCP uses projects to namespace your resources. And you can delete everything in a project in one operation that is impossible to fail by just shutting down the project (no "you can't delete x, because y references it" messages).
Aggressive billing alerts and events, that delete services when thresholds are met, could be used only in the development namespace. That way, fun little projects can be shut down and prod traffic can be free to use a bit more billing when it needs to.
> "And theres no practical way to determine whether the customer is “trying it out” or running a key part of their business on any particular resource."
Well there's a very easy way, adding a checkbox and an input:
[ ] I am just trying things out, don't charge me more than [ ] USD
There are ways it could be done relatively benignly, such as defaulting to paranoid and explicitly opting out.
And for those that are heading into that financial barrier it should be a straightforward problem to look at trending to anticipate the shutdown and send out an alert.
This. App Engine used to offer hard spending limits, and they were removed with precisely because so many users set them up to shoot themselves in the foot at precisely the worst possible moment.
^^ this. Hard spending limits seem great until your app/service gets super popular and you have to explain to the CEO why you were down during the exact window you needed to be serving the demand.
> Note: There is a delay of up to a few days between incurring costs and receiving budget notifications. Due to usage latency from the time that a resource is used to the time that the activity is billed, you might incur additional costs for usage that hasn't arrived at the time that all services are stopped. Following the steps in this capping example is not a guarantee that you will not spend more than your budget. Recommendation: If you have a hard funds limit, set your maximum budget below your available funds to account for billing delays.
(Source link in parent post, emphasis mine).
In this case they had a additional cost due to delay of $72k. Which, lets be honest means this feature kinda useless for anything but the somewhat harmless case.
Only by combining this with resource limits in load balancers, instance and concurrency limits and similar can the maximal worst cost be limited. But tbh. this partially cripples auto-scaling functionality and it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case.
> it's really hard to find a good setting which doesn't allow to much "over" cost and at the same time doesn't hinder the intended auto-scaling use-case
> I created a new GCP project ANC-AI Dev, set up $7 Cloud Billing budget, kept Firebase Project on the Free (Spark) plan.
There's a lot of middle ground between $7 and $72k. Your quote explains it perfectly though. They flat out can't because the accounting and reporting is badly designed and incapable of providing (near) real-time data.
IMHO the easiest solution to this is government regulation. If you set a budget for a pay-what-you-use online service there should be legislation forbidding companies from charging you more than that.
I also find it (sort of) hilarious they can magically lock the whole thing down once payment fails, but not before the CC is (presumably) maxed out. Lol. Talk about a good deal for Google.
There's something uncanny about understanding the situation enough to turn on the budget alerts, while at the same time not realizing it's not going to help in time if your system runs amok.
I'm not sure if you meant it this way, but your tone makes it seem like the parent just needs to "read the docs".
Unfortunately for all of us, your solution doesn't work, per the huge disclaimer on the page that says those alerts can be days late. You can rack up an almost unlimited $ bill in hours.
thats not the best thing you can do. the best thing you can do is put excessive time into quotas.
aws has way better quotas for starters than gcp has, sadly
They are broken, unreliable, hard to correctly setup guard rails.
I mean like the article mentioned they could have set the instances and concurrency settings to lower values. Which in this case would have worked.
But finding the right settings to balance intentional auto-scaling and limiting auto-scaling to limit of how fast unexpected cost might rise is hard and prone to get wrong.
Let's be honest it's in the end a very flawed workaround which maybe might help (if you know about it, and did it right).
There are already such features but a lot of indie developers are lazy to configure their infra properly. Now, default low limit does not make sense as it will piss off large customers.
I run so many websites on Google Cloud Run that sometimes I feel I might be abusing them, but I have ensured each of my site has max limit of 2 hosts.
I have no idea what they did internally, but something like this was my guess. I only communicated through customer support channel and replied to emails, and shared my doc (which cited all the loopholes) with them.
It took them 10-15 days to get back and make a one-time good will contribution. The contribution didn't cover logging cost, so we did pay few hundred dollars.
I went through this very scary experience recently as well (although in our case it was $17K, not $72K). One of our devs accidentally caused an infinite loop with one of the Google Maps paid APIs on our dev account and within hours both our prod and dev accounts were suspended (pro tip: don't link your prod account to the billing account of your dev account). The worst part was that after removing the suspension, our cloud functions were broken and had to be investigated and fixed by Google engineers themselves resulting in our prod app being down for 24 hours... be very careful.
Luckily we were able to get $11K refunded on our card and received $6K credits after spending all night with Google support.
By contrast I hear stories of AWS doing this quite often for one-off mistakes (crediting thousands of dollars). It doesn't make much sense to me not to consider well-intentioned requests for this sort of thing.
Especially if you consider the dollar value of all those approvals and the business you might lose to some other platform and/or hesitance people will have to use those platforms for such things in the future.
> If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem.
Crappy situation for OP and his startup, but I find the part about reading up on bankruptcy to be a bit premature.
Perhaps not the most ethical choice, but what stops OP from just not paying the bill, and finding a different cloud provider? Obviously they'll want to not repeat the "experiment", but seriously... there's no mechanism at Google to stop a new client from running up a near-$100k bill in a single day?
That's absurd, and should be a learning lesson for Google more than this startup. Some malicious actor could apparently consume hundreds of thousands of dollars of Google resources and "get away" with it.
Wait and see what happens, then deal with it - would be sane advice.
Bankruptcy fear was real at the time. Google has at least a few thousand lawyers on payroll. They probably also have a process of handling delinquencies and sending them notices. A quick look at the lawyer fee to just manage the case, let alone fight it, is enough for bootstrapped company to raise hands.
+1 to bad actors possibility. I shared this with Google team, I'm not sure what they have done since.
We are out of that situation and I wrote the post so others, relatively new to Cloud don't make same mistakes.
However, Google's army of lawyers costs them real money, where your bill is largely made up numbers.
Perhaps the true cost is still enough to warrant sic'ing their lawyers on your company.
Even in that situation, a wait-and-see approach is still pretty advisable. The worst case scenario was already known to you - bankruptcy.
Nothing Google or their lawyers do would change that worst-case outcome, and if Google was aware you literally don't have $72k, and might just declare bankruptcy and walk away, they'll be much more eager to negotiate a more reasonable bill and settle your account. It's exactly as J. Paul Getty said...
Very glad it's being worked out and you will not have to go down that path.
> Even in that situation, a wait-and-see approach is still pretty advisable. The worst case scenario was already known to you - bankruptcy.
You could even go scorched earth, represent yourself, and drag it out as long as possible. "Your honor, I'm a free man on the land and all I was doing was travelling the information super highway. I'm not bound by your laws!" Haha.
This is why you create a shell company to use cloud services with while your real company leases the servers from that company. As soon as you run up a bill you can’t pay you shut down the whole shell company and reopen a new one.
One of my favorite quotes of all time. J. Paul Getty was quite the weirdo. His Wikipedia article is worth a look, especially the section on his frugality.
Lol. I love it. I moved to a state I'd never considered because it had the largest, cheapest building in the US.
It's 220,000 square feet, but I've lived in a tent out back for the last 6 months because I can't get an occupancy permit, it's not zoned residential, and I refuse to pay rent on an apartment.
It's the old headquarters of Varco Pruden. They manufactured steel buildings, and there are long, wide manufacturing bays with overhead cranes. You can see much more on my YouTube channel. I've got a few videos of different areas.
As an interesting coincidence, large part of Google Cloud organization resides in a building that was formerly headquarters of Getty Images, a company founded by Mark Getty, a grandson of J. Paul Getty.
> That sort of crap is the reason we host all our stuff on root servers.
Having just started my own journey into building products for myself, pretty much the first thing I realised with my tech was I need to get dedicated servers instead of cloud just because it costs 100x less.
> Just grab a dedicated server for a few bucks and put a bunch of docker containers on those.
Exactly, if you really want kubernetes coolness to act cloud like, install kubernetes it's free and is super easy to setup.
And with the cost savings you can literally buy multiple spare servers and with kubernetes using them all while keeping the usage low allowing to scale up new nodes if needed.
AWS pricing is not obscure, it's just not for you. So in that sense, you are correct to not see a reason to move to the cloud, but your advice does not apply to everyone.
And I don't believe they make "more money" that way at all. AWS margins are either very low or very high, and the higher margins and prices tend to be the "simpler" ones: packaged, managed products such as Redshift that are billed on fewer tiers and flatter prices.
When you design your application with AWS, pricing has to enter your design considerations. For example if you are designing something that will interact a lot with S3 you want to minimize PUTs. You want to minimize ram usage on lambda by streaming rather than buffering. Etc.
AWS is not a suitable product for playground stuff. The only reason it gets used as such is because it's easier if you're already using AWS for other things (or it's you're already very familiar with it).
> There is a massive secondary consulting market because of AWS's price obscurities.
While that's true, there is consulting market for most things that are complicated. Doesn't mean they are shady. It's simply not for you. You are welcome either to dive in or get a consultant.
I promise you though, that AWS pricing isn't difficult once you understand few concepts and know your way around the Cost Explorer. With proper tagging, it's easy to drill down which resource is consuming how much. I don't believe there is a way to have a simple billing for a complicated product(s).
Obscure and complex are different concepts. I'm part of that "secondary consulting market" FWIW, so I'd like to think I know a thing or two about it.
Does AWS have high-margin prices? In aggregate, somewhat, but this is mostly driven by the big ticket managed enterprise items: Aurora, Redshift, Quicksight, probably Fargate, etc. A lot of their more popular stuff (S3, Lambda, …) offer incredible value for very little money. EC2 is the exception I believe, because I understand it to be high margin for how popular it is. But EC2 pricing is one of their simplest ones.
Could AWS simplify some of their pricing? Yes, probably. There's always room for optimization. Personally for example I'd like to see their pricing be global rather than different by region (with understandable exceptions for govcloud and china).
Is AWS making its pricing complicated for nefarious purposes? No, there is no evidence to support that.
AWS pricing absolutely is not simple. It's a part of the AWS stack. You need to study AWS's events/signals system to be able to write apps that make the best use of AWS's interconnected stack. You need to study their APIs / SDKs to really understand what you're able to implement. And you need to study their billing systems to understand how to implement apps that run cheaply, and be able to predict potential runaway costs.
It has to be a part of the design. That's why you may want to hire consultants for it: People who understand it better than you do, and will be able to assist you in reducing your costs.
It's just another kind of optimization. Maybe some software engineers don't like it because it hits them where it hurts (the wallet) when they don't do it right, rather than be able to brush it off as they usually do.
It's much easier to ignore the waste produced by, say for example, the 3000 javascript dependencies shipped with the fat, unoptimized electron app they ship on their users' desktops, that do a ton of unnecessary expensive computing; when all that crap is client-side and it's the downstream user's electricity bills and CPU time that's being used.
> There is a massive secondary consulting market because of AWS’s price obscurities.
There is a massive secondary consulting market because the enterprise market is addicted to secondary consulting. This secondary consulting market includes AWS pricing because it includes pretty much any IT service the target market might be interested in.
A rational need for decomplexification isn’t necessary to explain the existence or coverage of enterprise secondary consulting, IT or otherwise.
The margin is absolutely not the same across all products.
> There is a massive secondary consulting market because of AWS's price obscurities.
Its. Not. For. You.
AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.
That consulting market is an optimization market. It's economics at its best.
If you are too small to have to take these things into account regardless, AWS is not for you. You're welcome to use it, but don't be surprised if you end up having to deal with these kinds of things which simply don't exist in the world of flat-price underprovisioned droplets.
>AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.
This is marketing.
It's like saying you want to build a house and the quote you got ends up blowing up 100x overnight.
Great example is the 100k credit for startups. You can repeat it's not for you all you want, but their business is predicated on pricing ignorance and vendor lockin.
The $100K credit (which I've been granted multiple times) is there because if Amazon can get you to invest serious work into their infra, they'll make up for it in the long run. It's not "lock in", it's sales. The only amazon "lock in" really is their bandwidth-out pricing, which is a sleazy tactic for sure but I'm not hesitant to call it out when it's the case.
You can get the $100/$300/$1000 tier if you are in "just checking it out" solo mode. $5k and up requires either connections, partnerships, or a serious application.
Anyway I don't know what your point is, I'm not even sure if you have one. They're not "marketing" their pricing, nor the fact that you are "forced to design systems that use less resources".
> Anyway I don't know what your point is, I'm not even sure if you have one. They're not "marketing" their pricing, nor the fact that you are "forced to design systems that use less resources".
I think they are referring to this statement:
> > AWS pricing is a part of your design. With some exceptions (that you aren't talking about), they charge you more for using more resources. You are forced to design systems that use less resources if you want to optimize your bill.
It is a defense that I've heard in many AWS talks in the past.
Where it turns into a 'marketing' blurb to me is my real world experience in these AWS talks in the places I work. As a real world example, we had a product that required -some- architectural work, but otherwise was solid, and could run on 3 live EC2 instances (2 web LB, 1 live backend) and 1 spare (spare backend)
The Consultant that AWS partnered us with? Suggested a very overdone architectural revamp, moving everything possible into AWS Specific technologies.
It's marketing in that in many of our experiences, we know there is often at least one person on a team who does -not- have the discipline and/or experience to -keep- a system using less resources as the field goes from green to brown.
Overengineering is easy and happens not just with AWS but with just about anything in software engineering.
I'm having trouble seeing how this changes what I'm saying: That with the way AWS pricing is structured, you are supposed to take it into account when designing your product.
When you reach a certain size / complexity and you have to design infrastructure, you should be making schematics, predictions on the usage peaks and troughs, how various parts of the infra will be affected, how active/idle they will be.
When you are dealing with AWS, pricing becomes extremely predictable because it can be derived from those plans. And it is far better to be dealing with that kind of model than to deal with "unlimited with a million asterisks" or something. AWS is predictable, reliable, and most notoriously has never ever increased their prices, so whatever you calculated will not go up because of Amazon's decisions.
> Suggested a very overdone architectural revamp, moving everything possible into AWS Specific technologies.
To be honest, depending on the technology, the savings could be worth it... for example, did you know you get a discount if your traffic is served over cloudfrount? even if your distribution is set to no cache any resource, you can front your APIs using cloudfront and save networking.
How do you take pricing into your design considerations? Does it come with experience from using an AWS service in production and understanding how it's priced, combined with the usage numbers the new system might get? I'm trying to learn more about how engineers currently do this.
It's not that complicated, it's just not something engineers are usually used to do. If you use an AWS service, you look at its pricing.
Take s3 for example: whenever you use it, you'll pay for outgoing bandwidth, PUTs, GETs, and storage.
So you seek to minimize all of these:
1. Bandwidth: use cache layers. This also minimizes GETs.
2. PUTs: design your app in a way that doesn't do unnecessary inserts into s3. Consider alternatives such as redis, postgres or filesystem depending on the need.
3. Storage: compress your objects if they compress well. If they aren't often accessed, use storage classes and auto lifecycle management.
Pricing in AWS generally reflects some kind of engineering limitations you will face at scale in the first place, so it makes sense to go through this whole exercise either way.
The shift from 'Developer/Programmer' to engineer has indeed been part of a push away from creativity towards cookie-cutter work.
An interesting analogue would be the Automotive industry; As time progressed, Companies focused more and more on 'engineering' versus art/tradition/etc. But as the industry evolved, "Flashy" vehicles that took risks became moreso either a halo product for a brand, or relegated to Luxury/Boutique.
And, of course, there was the dark side of this shift; A good example from the 70s, the level of 'engineering' driving the design of the vehicle and it's assembly didn't take into consideration the actual line worker; in Ohio the workers wound up getting overworked, burned out, and in some cases actively sabotaged the product, because they were being treated like automated machines.
Incorrect guess, and it still doesn't really change anything. You're just playing with words. It's no more useful than a full thread arguing about a misspelling; Just pure noise.
Software engineering could learn a lot from, say, civil engineering. It could also learn a lot from interface design and I'm sure even microbiologists and astronauts could teach us a lot. Engineering is not special.
This is exactly right. I host stuff in buckets/cloudfront and uses a bit of lambda/route53. I end up paying $4 a month.
now that will be very different if 10 million people suddenly decide to visit my site, but if that happens money probably won't be a problem after all.
> Even trying to read the amazon pricing for their instances, hours and what not, drives me insane.
I get your sentiment but the pricing is that way because they want to charge you exactly what you use, not for reserving stuff.
For example if you deploy a EC2 instance that comes out to be $15/mo total, and you deploy it on say the 10th of the month, do you want to be charged the whole $15? No, you want pro-rated.. But what if you only need that instance for like 6 days? Then what? You gonna do the math to figure out what it would have cost you yourself, or just read it per hour billing?
> Most apps don't need scaling anyway and if you do
Man, exactly right. Many of guys here would love crypto once you stop asking why and start asking how.
The most lucrative projects these days are completely frontend UI, they don't even their own backends as they just read state from the nearest node when the client connects their wallet.
Some people forgot that the scalability game was to convert traffic into money with o(logn) overhead costs. So ditch that, and remember you are in the money game.
Deeply dissatisfying to read. Ex-Googler uses connections to get his (understandable!) cloud mistake refunded.
Every time I read one of these stories, I get more and more convinced I will just simply never use scalable cloud tech for my side projects. I'm not going to risk my family's retirement savings on the all-too-possible chance that a small deep-implication error will cause runaway charges.
Your assumption is incorrect. I haven't been in touch with anyone in Google, and used 0 internal connections. Happy to make another post with my conversation + documentation to support this.
I reached out to the GCP through their regular channels. This is not a paid post, and we are not sponsored by Google in anyway.
You might want to take another look at your paragraph:
> Having been a Googler for ~6.5 years and written dozens of project documents, incident reports, and what not, I knew how to put the case for Google team when they would come back to work in 2 days.
That certainly reads as an advantage that most non-Googlers would not have.
FWIW, I didn't intend "deeply dissatisfying" as criticism against your writing, although I phrased that poorly so I can understand it coming across that way. If anything I feel for you when that unfair surprise hit you. It just sucks that it's possible, and that the odds feel against us when we're seeking a refund.
Yeah, that would be a very good idea. The way it is stated gives that impression, and helping people how to understand how to resolve an issue like this would be priceless IMO.
You only wrote half the story and no technical details...... why even write a post at all, it's just clickbait and you give no other information how what went wrong, how it got fixed, what exact technical problem was...
Lol to me it looked like part 2 wasn't written yet, but I clicked it anyway just to check and the page loaded, so I read it. No real downside in getting a 404.
I'm absolutely astounded that cloud providers allow stuff to get this way and don't even go "ope, this looks out of the ordinary, we should look into it" Nor do they offer the ability to straight up kill all services if exceeding a certain price set by the customer.
Cloud is still good though, I believe it's the future. I just don't believe in deceiving your customers to hopefully rack up a high bill with them.
I view not having the ability to say "Shut everything down if I go over $100/mo" the same as pre-checked hidden cross sales that MasterCard/Visa cracked down on in the adult industry few years ago. Just money grabs.
I will definitely be putting such measures in my cloud platform.
I'm with you. I have yet to use any of these cloud computing providers for building or testing anything, and it is partly due to this (and partly due to privacy and confidentiality considerations).
"Yay! NBD! Google is the best! All I had to do was work there for a few years, rub elbows, make connections and ask for favors!" Really, google would be the best if there was no way to accidentally go over your stated budget cap by 86 million percent, or at the very least have a policy to refund people who can demonstrate that this is what happened.
I think I'll treat this as the latest in a several line of warnings about not going all-in on all these Cloud services until you seriously know what you're doing.
So much of it is so unnecessary to begin with. You can do so much with a cheap VCS or two without thinking about lambdas or cloud functions or kubernetes or who knows what. But these days you'd be forgiven for thinking it's dark magic.
You're not going to run up a 5 digit bill in a day by starting up on a few $10 VPSs. And you'll probably have an architecture that fits in your head to boot.
Also: The article title should really be "Saved 72k and avoided bankruptcy by being an ex-Googler."
> Had we chosen max-instances to be “2”, our costs would’ve been 500 times less. $72,000 bill would’ve been: $144. Had we chosen concurrency of “1” request, we probably wouldn’t have even noticed the bill.
> If you count the number of pages in GCP documentation, it’s probably more than pages in few novels. Understanding Pricing, Usage, is not only time consuming, but requires a deep understanding of how Cloud services work. No wonder there are full time jobs for just this purpose!
Great write-up - thanks for sharing @bharatsb! As you say, cloud pricing has become too complex for developers to understand quickly (they want to ship features, not calculate costs). Infra-as-code is great, but it has made it even harder to understand which code/config option costs what. `terraform apply` is like a checkout screen without prices.
We're trying to solve this problem with infracost.io, initially looking at Terraform. It would be interesting to get your feedback on whether such an approach might have helped you? Probably not as it doesn't look like you were using Terraform?
(Cloud Run PM here)
I am sorry for the experience described in the blog post, we could definitely be better at bill management. I am glad that it worked out in the end and the customer was not required to pay for the bill.
Based on this experience, we decided to lower the default value of "max instances" to 100 for future deployments. We believe 100 is a better trade off between allowing customers to scale out and preventing big billing surprises. Of course, customers can always decrease it or increase it up to 1,000, or even above with a simple quota increase request.
Well, the real question for all cloud providers, for which I expect crickets as an answer, is:
Why don't cloud providers allow setting a budget which cannot be exceeded? A simple, 1-click way to say: this account should never go over $500 a month. Just stop creating resources or responding to requests if it does.
This is a outage waiting to happen for every customer:
- Early dev sets a limit.
- Product launches.
- Slowly grows.
- One day suddenly the entire business grinds to a halt. Globally. Across the carefully isolated shards. Everyone scrambles to figure out why! Tens of thousands of dollars are lost because of going $10 over a budget. End-users are lost. Trust is burned. If it's providing a critical system, maybe even people are hurt.
- Google then has to explain why they built in instant, global failure mode.
They can put it behind a clear warning, do stuff like AWS does for bucket deletion (the bucket has to be empty, you have to check a box and manually type the full name of the bucket).
There are ways to design this, they can send notifications at 60% of the threshold, 80%, 90%, 95%. They can give you a grace period, put up prominent warnings in the console and for command line tools, etc. There are ways to do it, it's far from an intractable problem.
I'm not saying that it can't happen but do you want to bet that a certain percentage of their business, for all cloud providers, is from carelessness and resources still running when they shouldn't or using more than they expected? Especially for bigger companies where it's easy to miss something. Just like gym subscriptions or other kinds of subscriptions where they're banking on you not noticing for a long time ;-)
1) But they supported this before on GAE. GAE had 'spending limits'.
2) Also if they are able to figure out when you've hit your daily free quota and cut you off almost immediately, how are they not able to figure this out?
If I recall correctly, GAE is an example of something they made specifically to be a cloud product. Products like Compute Engine, GCS, Bigtable, and Pub/Sub are things developed internally and then sold publicly once they realized others might find them useful. Perhaps the products developed first for internal use weren't developed with features like measuring billing usage in real time in mind.
> Based on this experience, we decided to lower the default value of "max instances" to 100 for future deployments. We believe 100 is a better trade off between allowing customers to scale out and preventing big billing surprises.
This is good to hear. I use Cloud Run a lot for personal projects and I always set concurrency to 80, max instances to 1, memory to 128Mi (unless it's something beefy that needs the memory), and CPU to 1. If I need to scale it up, or I decide to open it up to actual usage, I'll do it when I recognize the need.
I don't understand why developers use cloud for bootstrapping/side project. Digital Ocean is all you need $5 droplet + $15 Postgres or even better $7 dyno on Heroku.
If I knew something was going to take more than a couple dozen milliseconds to run, it was built on the DO droplet.
Why would I pay by the CPU second for something that is taking a lot of CPU seconds? That billing model doesn't make sense.
For my super quick REST endpoints, yeah, all on Firebase, the convenience of writing + deploying makes it an obvious win. (Unless something goes wrong, debugging Firebase functions is not fun...)
> To overcome the timeout limitation, I suggested using POST requests (with URL as data) to send jobs to an instance, and use multiple instances in parallel instead of using one instance serially. Because each instance in Cloud Run would only be scraping one page, it would never time out, process all pages in parallel (scale), and also be highly optimized because Cloud Run usage is accurate to milliseconds.
> If you look closely, the flow is missing few important pieces.
> Exponential Recursion without Break: The instances wouldn’t know when to break, as there was no break statement.
> The POST requests could be of the same URLs. If there’s a back link to the previous page, the Cloud Run service will be stuck in infinite recursion, but what’s worst is, that this recursion is multiplying exponentially (our max instances were set to 1000!)
Did you not consider how to stop this blowing up before implementing? Having one cloud function trigger another like this with no way to control how many functions are running at the same time with no simple and quickly met termination condition (with uncapped billing) is playing with fire. It's not going to be optimal either if most of the time each function is waiting for the URL data to download.
You need to be using something like a work queue, or just keep life simple and keep it on a single server if you can.
We've all had a program crash from a stack overflow. The problem seems to be that instead of the "serverless panacea" they were promised, the code they built can now only run on one of many Google servers, none of which are theirs. No way to kick the tires at all.
It honestly reminds me of debugging a Jenkins pipeline. Something that was designed to be super generic of a runtime but yet the tooling can inexplicably only live on computers that are not your local development machine, and all of it is maximally painful to stub or test or debug to seduce you into "just running it live".
It's like the opposite of the "small agile team" thing they were talking about. If your program requires 7 API keys and some cloud environment to do a test run, I want no part of it.
> We've all had a program crash from a stack overflow.
Launching a cloud function that recursively triggers the same cloud function, that doesn't have a simple safeguard for it looping or blowing up, and where billing scales with the number of cloud functions ticks the "very high risk" and "very high impact" boxes for me. A program running on a single server isn't similar here (you could accidentally create a DoS attack though).
Typical cloud function use is some event gets triggered like a user sign up, the function executes, then it halts. The above isn't a standard use case and is so incredibly risky this approach shouldn't be attempted in my opinion.
I'm just a student but I've spent about 10 hours trying to figure out why Azure has been charging me >$5/day for their "basic" database @5DTUs, 2gb max storage. This morning I was so exasperated I sent a letter threatening to report them for fraud if nobody could tell me why I was being charged 30x the listed rate, which so far no one has. This is an extremely cathartic post to see that I'm not alone, thanks for sharing.
Could it be listed "hourly" and you're charged "daily"? Add in VAT (equal to 25% in some countries) and you match the 30 times higher than expected charge.
Basic tier, 5 DTUs, 2 GB is listed as ~$4.8971/month or $0.0068/hour on this page. Extra storage would cost more but is not available for the basic tier.
Do you have geo-replication turned on? More regions will be an additional $5/month (plus bandwidth between regions) if you replicate. You can serve everything out of a single region but it is pretty easy to add others if you're not paying attention during initial setup.
> Google let go of our bill as a one time gesture!
We've seen this happen with similar stories on AWS. Neither platform supports prepayment with a hard limit on costs, and this seems unlikely to change.
Yeah a friend of mine wanted a real cert, not letsencrypt (I don't understand how that is more real but ok), as a bit of a noob he clicked around on the AWS website and some days later had a bill op 1500 eur. They also nulled it. Still, this scares the hell out of me.
I can sympathise with some of these stories, like the ones where an overnight DDOS attack racks up a huge unexpected bill, but this one in particular is just a story of gross incompetence and negligence. The guy hacked together some code in a few days and deployed it to a service with unlimited billing without any kind of sanity checks and without even understanding what he was paying for. He’s an ex-Googler, it’s not like he hasn’t heard stories like this before. And the takeaway? “Oops don’t deploy buggy code” and “I shouldn’t have used the default settings”. OK, sure, let me know how that works out for you.
I'm not sure I want to know how much Azure and AWS revenue comes from people spinning up test VMs or a kubernetes cluster to work through a training, and then forgetting to turn it off.
I've spent thousands extra this year because people stood up 4 MB SQL databases and let them default to charging by vCores instead of DTUs.
Much less than the amount from deals with strategic partners. The long tail of $5 a month from forgotten VMs is likely orders of magnitude less than the handshake deals you can publically read about.
This is most developers' worst nightmare when it comes to a completely new environment generally, and Cloud solutions specifically.
It's easy and pointless to say they should have done things differently. Worse than pointless. Obviously they should have, and kudos to them about being open about the compounded mistakes.
Still, this strikes at the fears that lie in the heart of any reasonable, honest developer doing something completely new.
New developers should be cautious about cloud platforms, but they were! Not cautious enough, obviously, but they did set limits they thought would be honored.
Platforms should have hard monetary limits at the account level, clearly, as well as an option to turn them off. Shame on all of them which don't.
Cloud Run PM here: I'm sorry for the bad experience the customer shared in this article, we could certainly do better with bill management.
We pick 1,000 as a default value for "maxScale", this can be considered high for some users, but low for users who expect infinite scaling from the service and start with a load test to evaluate it.
> We pick 1,000 as a default value for "maxScale", this can be considered high for some users, but low for users who expect infinite scaling from the service and start with a load test to evaluate it.
That seems absurd to me.
I think it makes much more sense to put the onus on the sophisticated customer to increase their maxScale to an unusual value. Users who "expect infinite scaling...and start with a load test" are sophisticated users.
E.g. set maxScale low, like 2 or 4. The sophisticated customer would recognize their oversight quickly. Click-click, fixed, restart test.
Effectively 100% of less-sophisticated customers will not need enormous scale on day 1. Customers with whom you do not have an existing billing relationship in the 10s of thousands of dollars per cycle will almost certainly not want it.
I'd consider that level of overspecification to be a strong anti-pattern.
Given that this is a common problem, and one that can bankrupt individuals or their businesses, when is AWS going to implement spending caps that are easy to set up for new developers or business owners?
> Google let go of our bill as a one time gesture!
Thank goodness.
And it looks like it had to do with not understanding the API / system on the first order, IMO.
This hit me hard a few months ago with CloudFront invalidations on AWS. I check billing and the things at 30usd in a single day, from a norm of <1usd per month, so it's showing a 13,000% increase (this is for documentation of open source projects). I'm writing their support and at the mercy so to speak - technically I ran up the bill. I ended up paying up, but I secretly hoped I'd get some AWS credits for the projects, heh
Aside: Amazon has some nice features for rule-based alarms on accounts so when you spend more than X dollars, you get an email.
Some platforms do that:
- Heroku costs are pretty predicable, and you can easily set a maximum scalability threshold to their auto-scalable dynos, so that they will never cost you more than a predefined amount of money;
- BunnyCDN requires me to top-up their prepaid account, so that I'll never spend more than what I have on that account.
To put it into perspective: You give me $72K and I'll set you up a 1PB replicated storage infra with a total of 100+ available CPU cores and half a TB RAM.
I saw people burning through cash in the cloud, which makes you wonder weather money is any concern at all.
Learning/Administering AWS/GCP/Azure costs time and therefore money too. Maybe less money, maybe more money than doing things yourself, depending on what you're doing. But you shouldn't disregard such costs.
I've seen enough buddies spending enormous amounts of time doing AWS devops on top of paying the AWS premium when they could have gotten away easily with a less than a handful of VPS (+ optionally $100/month worth of cloudflare as a CDN).
Once an organization reaches a certain size it will need one, who ideally should be a person that can wear the dual hats of linux/bsd sysadmin and also network engineer.
If the person is already on payroll doing a number of other duties, the time/effort to set up such an environment as described in the post could be as short as a couple of days work.
I'm just trying to explain that server cost is more than just the hardware.
In most cases cloud computing is actually still a very cost effective solution to infrastructure. But with infinite scalability also comes responsibility.
In the case of the OP, had they had their own hardware they would have noticed that had written bad code (it would have crashed or become very slow at least), but the cloud just scaled up and processed their code.
I'm not trying to defend Google in this case. Billing 72k when a 100 USD limit is set sounds like a scam.
Why full-time? It's possible to outsource IT administration to a local IT company and pay only for set-up and maintenance that is needed. Way less than 72k a year for many use cases.
Also, companies that employ bunch of developers can find a developer that has IT administration expertise and allocate some of his time to this. Still cheaper than 72k a year or employing someone full-time, if IT requirements don't call for full time job.
Exactly. Or just buy a managed dedicated server - it's more expensive, but still it's a fraction of the full time sysadmin cost, and much cheaper than AWS.
No it's not. In France companies pay ~1.5-2x gross salary in total, gross going to the employee ( some of it getting deducted by the state), rest going for health insurance, taxes, etc.
If you check the actual numbers here - https://entreprise.pole-emploi.fr/cout-salarie/ you'll see that for e.g. 30k€ gross yearly, a company shells out 41k, which is definitely not 1.5-2x total and still very far from the 75k mentioned above
And for half the use cases you still need one. Unless you go full SaaS (which may or may not be an option depending what you are doing, and what's on offer in that field), you still get stuff which needs to be administered, updated, patched, etc.
Maybe not the OS layer (or maybe even that), maybe not the DB (but then it might cost more), but you're not getting away from that.
You only really start saving at some scale (get a small core of cloud-literate admins, and now you can have them run thousands of systems for effectively no incremental cost)
Which is moot if you're just going to burn the 72k by shooting yourself in the foot.
And the way these cloud services go, the 72k was only detected because it was an one-off event. Turn that into a base-level inefficiency that costs that over a year and what have you then.
I wonder what a full time cloud engineer costs. IMO it’s trading a simple system for a complex system, so now the maintainers cost even more than sysadmins used to.
Because electricity is free. And internet is also free. And the rooms to put the servers are also free. A/C is free. And backup generators are free. And diesel is free.
You can rent a full 42U rack in a colocation center for ~$1500/mo easily. They'll handle all of that stuff, including redundant power and redundant internet.
Of course self-hosting on real hardware is not quite as simple or cheap as GP made it out to be. But everything in your post can be solved with simple fixed pricing, which is still the main point: there are no dangers of wildly variable pricing or accidental massive bills as there are with cloud hosting providers.
only a half TB of RAM? somebody recently gave me a free 4U server with 256GB of RAM in it. for zero dollars.
If you need a number or xen or kvm VMs with a lot of RAM assigned to each one for testing something, you can fairly easily set up an older Dell R910 (quad-socket system) with 512GB of RAM for under $2000.
Well, when someone lets you off a 72k bill you generally think nicely of them. But considering Google had no way of collecting that and asking for it would have resulted in loss of other business (they'll keep hosting there and keep paying them) and this isn't like Google lost 72k or like 72k even matters to them so it's just good PR, good business to get money on the backend and faster.
I have to wonder if even if they tried to get the money would they legally been able to fight it. From my experience with judges in Europe, they would have most likely looked at the budget being ignore and then upgrading someone from a free to a paid plan without consent and told Google it was their own fault and the services weren't ordered or authorised.
I'm sure that was an important factor, but wouldn't this in any case just be a bill to a startup with no money on hand? It's hard to make companies (without money) economically responsible for anything I guess, it even seems hard to make companies with money responsible sometimes.
I can’t set a limit like that on my CC, so the first charge for $5k would have cleared meaning it would have run way longer and racked up way more usage. I’d bet you my computer I would have been out at least the $5k that cleared.
I normally receive only my "part" of my corporate credit card statement, but earlier this year I was sent more of it.
That's when I found out the card has a credit limit of over €50,000.
Reading this, I wonder if we should contact the bank and ask for another card with a lower limit, to use with various cloud services. We are 99% on-premises, but have about €200/month in various GCS/AWS usage.
Ridiculous ... it's designed to charge you, upgrade you and makes u spend as much money as possible. And 24 hours after it happend they show it to you in the dashboard.
Great.
Just use dedicated servers for the start! It will hurt way less and you can easily upgrade to the cloud later, IF necessary.
Every time I see another post like this I always wonder how many people would be willing to buy "cloud insurance" where a premium would cover overages due to your mistakes in dev, outages, etc.
I don't have an exact billing model worked out, and I assume the insurance provider would mandate certain practices (e.g. setting up billing alerts that go to them, allowing them to view/manage infra), but curious if people here would be willing to pay for such a thing.
(My assumption is that most people who fall into this are too small to be willing to pay a reasonable % of their infra spend or change their infra practices to prevent this, but I'd be curious if this is something CIOs of companies who are thinking of "moving to the cloud but are leery due to cost concerns").
You might be able to get an insurance company to sign on to that sort of thing; have you shopped for quotes?
They'd probably want to look at a representative sample of "cloud overrun" events and see how much they cost, how likely they are, what reasonable measures could help prevent them, etc. But they aren't strangers to taking unusual bets.
Apparently a lot of big moonshot / X-prize type competitions fund their prize money through insurance. You present your research stating why the problem is hard, the insurance company works out how likely it is for someone to manage the challenge, and they come up with a quote for footing the bill in the unlikely event that someone wins.
You can ask for a good faith billing adjustment. GCE or AWS is well aware that things happen, and collecting something is better than collecting nothing.
"After going through our lengthy doc on this incident sharing our side of the story, various consults, talks, and internal discussions Google let go of our bill as a one time gesture!"
For anyone who wonders, the summary of the mistake: created a web crawler without any stop condition and without adding any checks for not visiting the same page twice. The crawler just kept running for a day, making around 9 million requests per minute.
I don't understand why people keep doing this to themselves. OP is ex-Googler and we can surely state that he knows his way around tech. Why would you, as a startup, go on route of using convoluted mess of tech that every cloud provider is, instead of buying cheap dedicated server, setting your own k8s or whatever and just using that platform for your project.
Cloud solutions are immensely and, for most cases, unnecessary complicated. In addition to that, they have the ability to kill any dream of yours with invoice far exceeding your darkest forecasts in less than an hour.
The time OP spent solving this issue could be better spent learning about self-hosting and avoiding this trouble all together.
I know it's not cool these days, but I strongly prefer (and advise) fully-managed cloud services like Heroku. I can fix my database size and scale/resources (dynos) easily. It's simple, and controlled.
Or just test on a good old VM, which can be had for just a few cents per hour and doesn't even allow for storage or network traffic going out of hand.
The first mistake is to deploy tests on completely opaque hyper-scalers. Pretty much any software infrastructure - from (SQL/No-SQL/In-Memory/etc.-) databases to entire web-frameworks can be found as ready-to-go VM images and containers these days.
Sure, it's a bit more work to find and setup, but in the end you gain an understanding what system is actually doing, how it might behave and the ability to deploy into any environment - from local workstations to bare metal to (a fleet of) VMs to high-level hyper-scaler services.
If there's ever been a case against using GCP or AWS this is it. You better understand those systems or you will be in for the shock of your life when you get hit with this sort of crazy billing. I got $3,000 AWS credit and was terrified of running experiments in case I made a mistake. It should not be hard to set hard billing limits, these companies know how to set hard CPU limits, why can't they do that with billing?
This is really scary. It's so unpredictable what one actually has to pay, especially for a small business moving to the cloud is much more challenging then it should be.
When creating resources it's really unclear what one might be charged, then there are saving plans and pre-commitment options and so forth.
Might be a good startup idea, basically just sell cloud resources via a simple, predictable payment model.
happened to me something very similar, I was using cloud run to fetch some subreddit posts and ended with a "recursive" way, because of that, billions of invocations was made... luckly, I was at the front of the computer and stopped before, but the bill was around $4000! I contacted the google support and explained all, they "forgot" my debit because of the bug
So basically no one's testing their code anymore and just throws it into a paid service?
Great that Google was so lenient and all, but I really don't get the appeal of using a hyper-scaler when a VM with docker support can be setup in literally seconds and on-demand pricing of less than 10 cents/hour for most quick-and-dirty tasks.
As a kind of reverse of this, I once had a Google App Engine site with a reasonable $300/day limit on it. We ended up getting a bunch of international news and usage spiked massively (hooray!) and I tried to update the limit manually only to find out I had to wait 24 hours before the new limit applied.
Meh. I see this all the time with developers who want to abstract everything away and not worry about the impact their poorly performing code is having on the infrastructure or on the $$bottom line. Time out? No problem, we will just spin up more instances. I have heard that so many times. Maybe your code is just bad.
If this happens you can usually reach out to Google to see if they will refund the charge. They don’t really benefit from making $72k of a solo developer’s buggy code. I’ve done it once and their team was very helpful and reversed the charge.
The algorithm this ex-googles came up with is hilarious, hard to believe this guy came out of google. I thought they were big on algorithms in their interview process? Maybe I should consider putting in an application. If I was interviewing an applicant and they put down the marker after producing that in a whiteboard session they would have quite a bit of work to do in order to get to the next interview phase. Not only did ex-googler not catch those mistakes, his team of 7 actually sat down and coded it without spotting the problem.
And another day that I'm again totally surprised how some people do "engineering". Basic problems like recursion don't occur to them, they use a technology they don't understand, they are not careful with the number of instances and then burn ~16000 (or close to 2y) of cpu time.
Why not test the algorithm locally, realize the problems and fix them? Why test on 1000 instances, not 1 or 2?
I get the "move fast, fail fast" attitude and why it is deemed beneficial by some, but this was essentially "goto fail".
writing as somebody who runs a big collection of bare-metal hypervisors for ISP infrastructure purposes... this post quite honestly just makes me smirk.
I have truly lost track of the numerous instances, and number of people who would be better served by buying a $1200 test/development 1U dual socket server with a few fast SSDs in it, and putting it in colocation somewhere for a few hundred dollars a month. The costs would be absolutely fixed and known.
On a tight budget? You don't even need to go as far as $1200, I see totally fine test/development environment suitable, Dell 1U servers on eBay right now for under $500 with 128GB of RAM.
Or that would be better off purchasing a fixed-configuration virtual machine (typically running on xen or kvm underneath) that has a certain specific amount of CPU, RAM and storage resources allocated to it which cannot balloon. For a fixed bill per month like $65 or $85.
You want to deploy your weird app on some cloud platform? sure, go for it, once you've got the possible scaling-up cost issues and possible bugs worked out on your own platform.
Please don't be a jerk on HN, especially in response to someone else's misfortune, even if they brought it on themselves. Maybe you don't need to treat these people better (though why not?) but you owe the community better if you're posting here. If you wouldn't mind reviewing the site guidelines and taking the intended spirit to heart, we'd be grateful. Note these ones: "Be kind" and "Please don't sneer"
p.s. I skimmed through your recent commenting history and it looks great—just the kind of thing we want here. Sharing some of what you know is exactly what we want users to do. But please don't be supercilious about it, as in this comment and https://news.ycombinator.com/item?id=25372847. Ignorance doesn't deserve humiliation, and that ingredient poisons the ecosystem (and eventually starts a degrading spiral, e.g. https://news.ycombinator.com/item?id=25373520). The rest is good.
Thanks for the feedback. I almost certainly shouldn't have included the part about the smirk, and I can definitely see how that could appear to be making fun of somebody else's misfortune. And the rest of it could have been phrased in a more diplomatic way.
For what it's worth it wasn't intended personally at the person who almost incurred the $72k bill, but more at the general concept of test/beta software gone rampant and out of control in an environment where billing has no limits. I think we've all tested some sort of software in development environments that caused havoc - but up until very recently it's been hard for that to immediately begin causing real world financial consequences...
> Maybe you don't need to treat these people better (though why not?)
IMHO the best argument for 'why not' would be that it's generally unethical to deploy software without first taking the time to read the manual and understand how your dependencies work. In this case the system wasn't live and the costs of this fuckup were solely externalized onto Google, which is fine because it was in large part their fault anyway. But when dealing with production deployments, this same behavior often results in users having all their private information leaked or deleted.
I think cautionary tale are important - but it's also possible, as I likely did above, to come down on people too harshly. Not everything has consequences as severe as a therac-25.
The crazy part to me is using the cloud for testing. It’s crazy. I have a 5 year old dual CPU Xeon with 128GB of RAM and a couple NVME disks that I’ve spent about $1000 CAD total to build ($700 USD). Something in that range on Azure is about $1 / hour if you reserve a year. ~$9000 per year.
All the people running workloads that don’t require the redundancy given, like CI, blow my mind. The costs are astronomical vs buying a cheap or used server. Sure, use the cloud for you production builds, but why not augment it with something that doesn’t cost as much?
Until the server breaks and you have to drive over in the middle of the night and try to replace it but the only available server right now is a shitty one and oh shit only half the backups work cause the onsite backups are fried too etc etc etc.
There's many good arguments against high-level BaaS such as Firebase but I'm not sure that "colo is cheaper" is one.
It absolutely is (cheaper, and a good argument). As an example: we're in the process of switching from Digital Ocean to Hetzner for a project, that will increase infrastructure performance (roughly memory/cpu/storage) by 4x and decrease costs by 4x. And no driving to the colo center is neccessary, as it's their dedicated server, so their on-site engineers do the hardware replacement.
Also, if you are not okay with your site being down for a few hours, you can always buy two, like you would with a sensible cloud setup. It'd still come up way cheaper (+ get more perf if you can do load balancing for your usage).
Also, I don't look at it from "colo is cheaper" point of view. To me, it's "I can have several times more performance and hire a full time sysadmin to worry about it, for the same price".
It's anecdotal, but I'm convinced I'm not alone here: we've had more Amazon-related failures/outages in 3 years with AWS than we had in 4 years of colo before heading to the cloud because of the exact fear you described.
Even a cloud setup needs good management and contingency planning, and in absence of such it can fail just as hard as a colo setup.
That implies you are going to be running prod in the cloud. Unless you're developing against purely synthetic data, the data transfer costs are potentially astronomical.
Just because you use "the cloud" doesn't mean you don't need backups. "the cloud" also have downtime and other failure. When deploying to the cloud you have to factor in the cost for moving to another provider if/when it will be needed.
Dell 1U servers on eBay right now for under $500 with 128GB of RAM
In part 2 the author says "Had we chosen max-instances to be “2”, our costs would’ve been 500 times less. $72,000 bill would’ve been: $144". In other words, that $500 server is several times more expensive than it would have been if Firebase and GCP had saner defaults.
That $144 would have been for a single two-day test.
Anyway, getting caught up in specific remediations that could have prevented this is beside the point. For development you want a safe testing environment because mistakes, gaps, misunderstandings, bugs are a fundamental part of it. The entire point of tests and testing environments is to discover the problems you know exist but need to test to find.
From my point of view after doing this for 20 years, it's like seeing the past 12 years of the "put everything in the cloud" era, of new different people repeating exactly the same mistake over and over again.
It's like if you lived near a public park with particularly aggressive geese that return every year, and watched new ignorant groups of people get chased by the geese every spring.
It's not callous - it's the perspective of the people who are responsible for the hypervisors that run underneath the VMs and services that cause some of these massive billing outrages.
I quite agree with everything you've said in this and your other post.
My development environment? : My own dual-booting Windows/Linux PC with 32G RAM and a few TB of SSD. Not to mention the Nvidia RTX graphics card for gaming...
I either spin up a VM to test stuff, or spin up a Python virtualenv. Postgresql also running on this machine. Whatever's needed. Need to emulate Stuff Happening From Different Servers? Why just spin up another few VM's - assign them the minimum resources required to get them doing what they need to do, set up your VM network etc. Any decently specced desktop machine can do that, never mind a noisy rack system - considering they're way better and vastly more powerful then the PC's we had 2-5 years before that, which themselves were vastly more powerful than the ones before them, and so on...
Result? Can develop at home to my heart's content, then when it comes to deployment spin up a remote VM on e.g. DigitalOcean and take it from there.
At the end of the day, "sErVeRlEsS" (I just don't like that term, for some reason it rubs me up the wrong way, perhaps because of...) just means "running stuff on someone else's kit" - the same as "tHe ClOuD", so if I'm going to be developing some system & software, I'd rather be doing it locally, setting up whatever's needed to get it running, and once satisfied, deploying it.
Like you, I see either the same people, or new people, simply Not Learning From The Past. There are many good reasons why things were done like they were - developing on a system you own, for example, rather than spinning up all sorts of Cloudy Things or "serverlessy things" right from the start.
Hardware is cheap - you don't need a supercomputer to run the beginnings of your latest Supah Scalable System[tm], you just develop and run it on a reasonably up to date box, and, sure, when you get to the stage where you need more space/bandwidth/whatever, that's the point where you deploy to some Cloudy Thing or SeRvErLeSs Thing.
My personal home office development environment at the moment, done on an ultra low budget, is a dell precision t5600 mid tower workstation PC (dual xeon, e5-2630) that I got for $350 with 64GB of RAM in it, upgraded it to 128GB, and put a $150 Samsung SATA3 SSD in. It's small and relatively quiet and sits under my desk tucked in a back corner with just a power cable and a few ethernet cables plugged into it.
Maybe some time in the near future I'll add a 2TB HDD that I have sitting around into it so that I can create VMs that have a 'fast' boot/root disk, and also give them some lvm partitions on a big slow disk.
It's running debian stable amd64 and is set up as a xen dom0 hypervisor, with 768MB of RAM assigned to the dom0 and the rest available for VMs.
The amount of capacity that's available there to create random PV or HVM VMs with as much RAM as I could want, is more than sufficient for my personal needs. If I need anything bigger I'll make it a more formal process and put it on a machine at work.
FaaS is what they call Serverless I guess, anyway, that seems a step backward to me, like going from FastCGI back to CGI, and somehow they market it as "progress".
At least, they could be using OpenFaaS or something, or even free software like firebase such as kuzzle.io or Mozilla Kinto.
> It's like if you lived near a public park with particularly aggressive geese that return every year, and watched new ignorant groups of people get chased by the geese every spring.
You're not really helping your argument here. Particularly if people have been attacked for over a decade and no one has put up a "aggressive geese" sign
Quite literally in my specific area, the former is true, and also it's a fact the city government has put up a number of signs in the nesting area. Still happens.
First, that $1200 server costs fixed money upfront, and then you pay per month for colo, and for internet, which usually includes a fixed bandwith cap or limits, with bursts which you pay for. So no, it's not fixed.
Second, a server you have to maintain, harwdare and software-wise, is much more complex, and takes much more time, than a managed service. You want a database? Install it yourself, maintain it yourself, backup it yourself, monitor it yourself. And same with everything else.
Third, there's zero redundancy in your "setup". If you want it with the most basic redundancy, you triple the costs (second server, extra networking equipment, etc.).
Fourth, geo redundancy/distributedness? Please. Good luck if you have someone far away who wants to visit your site.
Fifth, let's say you need to scale. Like, you get 10 more users today than you did yesterday, or you get featured on HN or Reddit or local news or whatever. F. You're looking at months and a lot of cash, upfront.
"A big collection of bare-metal hypervisors" makes sense in some cases, but don't pretend it doesn't come with a non-negligible time spent maintaning it and requires significant upfront capital and man hours to do the same you get easily on a public cloud platform (databases, message brokers, object storage, etc. etc . etc. etc. etc. etc.).
yes, I am serious, because as described in the original post this was somebody's test/prototype environment. Which is the ideal use case for a DIY scenario, until you're ready to send things into production.
I have seen people spend thousands of dollars on a cloud hosting platform to develop and test something when it could have been done equally well on a 4-year-old desktop PC sitting on somebody's desk. If they had only thought to bother installing the same (debian, centos, whatever) environment + packages + custom configuration on it.
But one big benefit of cloud providers is that you can spin up those servers for testing and when it doesn't work out as expected, you can just magically make them go away. If you're putting out the capital to buy servers in a rack, you have to use them all the time for the cost benefit to work out. A regular test environment that is used all the time? Yes, that could possibly be cheaper if you purchase the hardware but you also need to amortize the cost of purchasing new servers every 3 to 5 years to get the equivalent of the equipment provided by the cloud provider.
> writing as somebody who runs a big collection of bare-metal hypervisors for ISP infrastructure purposes.
I run a cloud SaaS company (3 employees). If I had the skillset that it sounds like you do, I might be inclined to hose on bare metal. But I don't. I don't know what a 1U dual socket server is.
It would take me some time to build these skills, and to match the agility that the cloud offers. I don't think it's worth my time, and probably not the author's time, either.
absolutely an understandable concern. One way of abstracting away the need to own or maintain physical servers while still achieving a definitive fixed monthly cost is to do as this other commenter has done, renting dedicated servers from a company that specializes in such:
And this comment makes you look incredibly naive and narrow minded.
Running some code on a CPU != running a startup. Great you can buy a Dell server on eBay, or you can build a powerful desktop, or rent a VM or get a droplet or scrape on lowendbox. These are not a secret and there is a great reason no one does this other than hobbyists and neckbeards.
You do your testing and it works, then what? You have to deliver scalable reliable systems in production that require identity management, security, backups, resiliency, reliability, various networking services and a million other supporting services and all the systems that come with it. Never-mind actually scaling the application, monitoring it and all the tools, systems and processes needed to run reliable systems in production.
The eBay servers provide you exactly zero of that and you've just wasted time setting up an environment that is a snowflake and doesn't represent reality. Testing on the cloud on exactly the same platform you would use for production has a lot of benefits when you look at value as limited developer time delivering value to customers and the business.
Whilst the $1200 server on eBay might be cheap today, you are entirely missing the hidden cost of lost time when your team of developers costing $M/year are wasting on testing in an environment that doesn't help them find and solve production issues. You don't need many hours of wasted time or downtime to lose all of your so called cost gains.
Optimising for absolute minimum cost is a fools errand that only slows down actually delivering production systems that deliver value to your customers.
Please spend some time thinking bigger about the opportunity cost and value delivery of technology beyond the immediate dollars and cents - it might surprise you.
Really, "a million other supporting services and all the systems that come with it" ?
You get a server with SSH, then you need something expose your container stacks over HTTPS like Traefik (which auto configures), and something for alerting such as Netdata (which auto configures too!), both of which are just a single binary to configure and setup, and probably it won't take long before you have scripts to automate that like we do[0]
Not only do you get amazing prices[1] but ...
You get to be part of an amazing community running the world on Free Software
But yeah, maybe I'm "missing the bigger picture" by "not locking myself in proprietary frameworks"
I'm a huge fan of open source and open standards, in fact I always push for and expect portability and avoid proprietary systems where possible. Abstractions like Kubernetes are a fantastic middle ground to provide portability across platforms whilst taking advantage of cloud provider services where they exist. The same for apps and frameworks built on open standards like Kubeflow and Apache Beam.
The supporting services and systems come when you run services that require strong guarantees for reliability and resiliency, and meeting the needs of different lines of business.
If I think of a mid-size company that wants to run these kind of workloads and demands minimal downtime, resilience against local disaster and minimal data loss:
- Customer facing applications in a reasonable scalable manner to meet peaks and troughs of demand without needing to size of peak demand
- CRM/ERP systems to manage customer data, payments, sales processes and inventory that CANNOT have corrupted or lost data
- Data platforms for running reasonable scale analytics and analysis on reasonable size volumes of data (say a few Petabytes online accessible, analyse 10s of Terabytes per quer)
- Capability for mid-level machine learning and access to modern acceleration hardware, up to date GPUs and maybe some NVIDIA Ampere type equipment
- Tools and platforms for operations and security that can capture, store and analyse all the logs produced by all those systems, plus some half decent cyber services - network level netflow analysis, maybe IDS if you are feeling fancy, endpoint scanning and analysis, threat intelligence capabilities to correlate against all of that data
- Tooling and platforms for developers - source control, artifact repositories, container registries, CI platforms like Jenkins ideally with automated security scanning integrated, CD for deployments like Spinnaker to canary and deploy your releases safely
- Networking for all of that equipment, ideally private backbones, leased Ethernet or MPLS - and that all needs to be resilience, redundant and duplicated
- Storage for all of the above that meets performance and cost needs, replicated, and backed up offline
Yes, you can do all of that yourself! But let's be clear, buying a server on eBay is not even a fraction of 1% of the reality of running real infrastructure for real systems for real businesses. There ARE reasons why you might do this but that is increasingly the exception due to either extreme scale, regulatory and privacy requirements typically from data sovereignty or very unique hardware requirements.
I'm not talking about /buying/ a server, but renting one as a service.
99.9% uptime is plenty enough for 99.9% of projects and that's easy to achieve with one server, k8s is not necessary here. You're not concerned with MPLS or whatnot when you rent a server.
I can tell because I'm running governmental websites on this kind of servers actually, with over a thousand admins managing thousands of user requests. I've been deploying my code on servers like that for the last 15 years and it was great really, also got fintech/legaltech project in production and much more.
I guess the project you're describing falls more in the 0.1% of projects than 99.9%.
The issue for me is that 99.9% uptime isn't a useful or meaningful metric. End users only care about the experience and if the application isn't performant, reliable and durable it doesn't matter if the lights are flashing - they will tell you that it's not working as intended. And when you rely on SLAs from third party providers the liability is not equally shared; they might credit you some % of your bill if its offline, but your reputational impact and opportunity cost is likely orders of magnitude greater. You also can't control how that 99.9% will happen, and more often than not its going to happen at the worst possible time (payroll dates, reports due, board need statistics etc.)
Mitigating these failures will always lead you down the path of replication, load balancing, high availability or at the very least frequent backups and restore strategies. And all of that is going to need to be done across multiple physical locations because I am never going to stake my reputation on a single physical site not losing power, connectivity or cooling. Now you are in the realms of worrying about network reliability, bandwidth availability for those replication and backup services in a way that doesn't impact user applications. And monitoring all of that, managing failures etc. etc.
As someone who helps organisations with their IT strategy and overall budget allocation process the focus is always on delivering reliable applications to customers and business users. Using a cloud provider helps us to ignore all the complexity behind the scenes that require significant investment in people and resources to manage once you hit a non-trivial scale. Paying a premium to do that is absolutely worthwhile compared to the downside of it going wrong, and the opportunity cost of wasting time on minute details that do not add value.
[Edit] And for context I DID used to buy servers one eBay for testing and development, and then migrated to bare metal colo, and all the while thought I was winning and it was cheaper. But over the years I've experienced enough issues and worked with enough companies to understand this was a false economy and now see the errors of my decisions and seek to help others avoid them.
Have you seen link[0] in my comment? automated backups are of course a big part of the plan but replication is not an alternative to backups in my book anyway.
I'm not talking about buying servers and colo, but about talking about renting servers as a service[1], where you get the benefits of dedicated hardware but not the inconvenience.
The added security that we have by "not sharing our hardware"[2] also deserve to be mentioned here.
Yes I read your blog and it seems like you've written plenty of shell scripts and utils to try and abstract and automate away infrastructure, but it feels a lot like you are trying to re-invent the wheel when all of this and much more is available in major cloud providers as a service.
One item that stands out for me is that your backup is a couple of shell scripts and you mention that you would dump your database to a different RAID array. That means you are now on the hook to procure/rent, manage, update and monitor that RAID array. And you even call out that you don't include offsite backups so you are at risk of total loss because you are using a single physical site for your prod data AND backup data.
You mentioned above that you are "not locking myself in proprietary frameworks" - but in the process you have built custom one off scripted systems that are bespoke. If you leave or your consulting engagement ends it will be very hard for someone to take over and manage and maintain your systems - because your design, configuration and implementation is effectively lock in to YOU as a person and your consulting company.
Personally I would rather trust a cloud provider to offer something like backup as a service where they can handle geographic replication, snapshots, restores for me as a service and deal with all the hardware, disk replacements, hardware monitoring and network fun that comes with it. The human cost of moving to another cloud provider is not that large and I can easily hire a person or consultancy that has knowledge of Cloud Provider A and Cloud Provider B to make that transition because their services and systems are well documented, conform to a contract and there are training and certifications for how they work.
I still hold my opinion that taking advantage of services offered by cloud providers is value add in the context of running a business.
Also I would much rather trust a cloud provider with a big team of security experts to run my infrastructure than a random company renting me some servers. If you are getting them as a service then there is still a shared admin control plane, likely management type networks and infrastructure around it that are managed for you by a third party. Trusting their team, processes and security capabilities is a very high bar to meet.
Would you please stop posting unsubstantive comments to HN and stop breaking the site guidelines? You've been doing it a lot and we ban that sort of account. I don't want to ban you because your good comments are good, but the bad comments are like mercury: they build up in the system and poison things.
The rules apply regardless of how bad or wrong another comment is, or you feel it is.
It seems like you could easily build a fintech app that gives you a virtual pre-paid credit card to use with these services and gets funded only up until your budget amount. That might be a safer way to work with services like this. Could have a separate card where ever you expect a problem might occur. That would also avoid one service being able to take down everything else you need to bill
Card and billing management? Probably a legit need tbh
interesting, how did the spend breakdown between cloud run and firebase?
did you have any limit to how many req/s you made to an individual site? It seems this would be difficult to implement with this architecture.
how did you deal with following links in circles/ avoiding scraping the same page multiple times?
I had built something similiar at a previous job, recursively scraping ecommerce sites. The first thing I noticed was some of the sites we were scaping couldn't handle more than a couple requests a second (in particular as we scaped uncached pages by sites running php). Other sites were quick to ip ban.
I kept things simple, a few dozen micro instances on aws (think they were like $3 a day) running puppeteer. A single server acting as a controller, keeping a per site queue and allowing us to set per site request limits if necessary. All the state of which links were already seen just kept in memory. Of course everything was also persisted to a db, and if the controller process needed to be restarted, it could restore the queue/ seen state and resume.
> I jumped out of the bed, logged into Google Cloud Billing, and saw a bill for ~$5,000. Super stressed, and not sure what happened, I clicked around, trying to figure out what was happening. I also started thinking of what may have happened, and how we could “possibly” pay the $5K bill.
> The problem was, every minute the bill kept going up.
> After 5 minutes, the bill read $15,000, in 20 mins, it said $25,000. I wasn’t sure where it would stop. Perhaps it won’t stop?
> After two hours, it settled at a little short of $72,000.
> By this time, my team and I were on a call, I was in a state of complete shock and had absolutely no clue about what we would do next. We disabled billing, closed all services.
1) Why wouldn't you shut off the service as soon as you saw the $5000 bill? Really doesn't sound like a "hop on a call with the team for a few hours" kind of decision.
2) Why was the person taking a nap the only person who could get a usage limit alert? One of the great benefits of a team is that you can have multiple eyes looking out for problems. Someone could have raised a flag as soon as the first unexpected alert came in.
3) If going over the free tier limit was your chief concern, why not check the usage after a quick run before letting it go overnight and unsupervised?
That the problem could get this bad is a UX failure, but the problem itself is easily seeable and avoidable.
Thank you for sharing. I was actually thinking of using fire base for my project. They make us so easy to sign up for free tier. Awaiting to see what happens in part 2
I recently put the (soft) kibosh on a project in my stable trying to switch to FireBase at the last minute.
It looks attractive but the business aspects are frankly frightening, and I'm not even talking about the risk of a large bill. Getting your metrics 24h late sounds like a deal killer for me. So much for observability!
Minor nit: many non-billing metrics are near-real time, e.g. DB concurrents, cloud functions CPU/RAM usage; any metrics that require aggregation (storage, billing) are going to be batched less frequently. This is going to be true across all platforms of non-trivial scale (eventual consistency + batch jobs).
Second note: the number of people who actually do this is very low (a few a year, of hundreds of thousands of developers). The blog posts are scary, but in my ~five years at Firebase, I'm pretty sure we refunded every one. As my boss (James Tamplin, CEO of Firebase) used to say, "There are lots of bad systems, but rarely are there bad people."
This dives me mad. There needs to be a max bill per month setting. If you’re near the limit, bells and whistles go off, If you exceed then new things aren’t spun up. GCP refuses to write to storage or run compute. Or they absorb the cost.
If SpaceX can send massive starship and bellyflop it, sure a company with 40,000 engineers can figure this out.
This just says Google will not focus on a thing that gives shitty experience to a user but helps them make money.
At the end of page 2, there is a good ass licking bullshit sentence:
<< It’s also a great company to collaborate with. The tools provided by Google are very developer friendly, have a great documentation (for the most part), and are consistently expanding.>>
He said that as an ex googler and as the beneficiary of a gesture, but this contradict the full history he told us.
If the doc and tools were so great, why he felt into this situation?
Let's be realistic.
You are a team of a few persons, using a 'starting' free plan.
How great tools and ux could lead you to not have clue on what they are doing and being able to incur a so high cost by surprise?
Also, for example, in which world is it nice to have your service "auto upgraded" from "trial" mode to 72k "full billing" mode without your consent first?
fwiw I've had cloud vendors be relatively willing to forgive bills when something went wrong with SLAs, bad bugs, or their internal dashes misrepresented usage.
they know their systems aren't perfect, and if you velvet hammer them long enough, they'll do the right thing.
Disclaimer I work for another cloud (not AWS), opinions are entirely my own. I try to avoid posting in a negative fashion about clouds, but holy crap this blog post...
AWS has this principle of Customer Obsession that enters in to lots of discussions, design decisions etc. "What is the customer experience of $foo?". Along with asking the positive, you ask the negative too, and explore the customer impact of shit going wrong. What does the worst experience look like, what is the impact for the customer, how might you mitigate that or make it so you can at least make it up to customers quickly, if you really can't avoid it.
I find it hard to fathom Sudeep's attitude here. So much of this article is ringing large alarm bells. These are not the things I'd want to see from a cloud provider as a customer.
Is this Stockholm Syndrome? Too much drinking of Kool-Aid as an ex-googler? Unfamiliarity with how other cloud providers operate?
(from part 1)
> Automatic Upgrade of Firebase Account to Paid Account
This is what I mean when I say look at negative vs positive use cases. I'm guessing some combination of customers having a lousy experience running in to Free Tier limits, and staff spending too long having to bump up accounts. So they implemented an automatic upgrade (What, then, is the point of a free tier? No room to experiment, no room to try it and see)
This is precisely the sort of thing that customer obsession principle is supposed to aid in. Automated upgrade certainly solves the staffing time spent bumping up accounts, and it helps customers that used to have to request limits being increased, but it massively fails in the negative customer experience side of the equation here. Someone, somewhere, should have asked the question "What if the customer has made a mistake".
Instead, make it easy and quick for anyone to click a button and get their account changed from Free to Paid, without staff engagement. Give customers easy agency to control their experience.
> Billing “Limits” don’t exist. Budgets are at least a day late.
That's insane. Clouds are about speed and dynamic scalability. Mistakes can ramp up the bill an crazy amount in a short period of time, as Sudeep found out.
How is a 24 hour delay in billing sync and budget warnings even remotely acceptable to them / Sudeep / customers?
Sure it's probably fine for the 90% cases, but that's crippling for the 10% and even if you decide you really don't give a crap about your customers, you don't want the bad press that 10% will likely give you.
Picture what financial damage someone might do if they compromised some of your credentials somehow? You screwed up, credentials got leaked, and you won't necessarily know for a day that something has gone wrong, nor will your restrictions kick in?!
Billing is the single highest TPS service in any cloud, with Identity often a close second (billing gets requests for every transaction, and internal requests related to ongoing charges). You need to handle a high rate of requests, with low latency both in request/response and processing data received. It's a hard engineering problem, and cloud platforms try to get some of the smartest engineers working on it. An organisation of Google's caliber has more than enough smart engineers to be working on these kinds of hard problems, even by temporary secondment.
Quota / Limits in a fast changing cloud environment need to be dynamic and responsive.
>I knew how to put the case for Google team when they would come back to work in 2 days.
How is 2 days even remotely acceptable? Maybe it's just how it's written, but it reads like this is just accepted as the way things are. Why would you even have to carefully work to present your case?
Where are the 24x7 response people with the ability to forgive bills? $72k is chump change for a cloud provider, and especially for a company of the scale of Google. Give your support agents the tools and authority they need to make reasonable decisions, with some appropriate kind of oversight process, and stick in feedback mechanisms so product managers know what problems customers are having.
It's not like that would actually have cost them $72k in direct running costs either. That should have been a near instant no-brainer. Forgive, move on, and reap the benefits of good customer good will. That good will will earn you way more profit than forgiving it would have cost. You're investing in their continuing business. Sometimes those investments will fail, but most of the time they'll succeed.
>In our case, it differed by 86,585,365.85 %, or 86 million percentage points. Even when the bill was notified to us, Firebase Console dashboard still said 42,000 read+writes for the month (below the daily limit).
So it's just fake observability? What's the point? 24 hours delay here is nuts, almost to the point of being useless. It can be hard to calculate these figures out yourselves. A fast feedback cycle is critical. As Sudeep here found out, 24 hours is a great way to have zero clue what's going on until it's too late. Is there really no other way to get this information more up-to-date?
Moving on to part 2:
>I had a team of ~7 engineers/interns at this time, and it would take Google about 10 days to get back to us on this incident.
Why is a 10 day response time from Google considered even remotely acceptable for a cloud provider? Your entire platform is down, you're working out ways around this situation, stressing about potential bankruptcy, and it's just cool with you that it took 10 days for them to make a business/life changing decision over what amounts to chump change?
These kinds of mistakes happen with clouds, AWS is famous for waving these shock bills from mistakes and it never takes 10 days to get it done.
Billing should be the easiest and most obvious thing. If your cloud provider is creating complicated billing structures, that's a problem the cloud provider should be solving, not expecting customers to unravel the mysteries.
Companies being spun up to help people navigate your billing should be an alarm call, not something to celebrate or for customers to consider normal.
> Fail fast, learn fast with Cloud is a bad idea
It shouldn't be. With near immediate feedback you'd have known straight away that shit was bad, and cut the experiment out before it cost you an arm and a leg.
> While creating a Cloud Run service, we chose default values in the service. The max-instances is preset to 1000, and concurrency set to 80 ...... Same goes with Cloud Run! With Concurrency == 60, max_containers == 1000 and each Request taking 400ms, number of requests Cloud Run can handle 9 million requests per minute!
Why are the default values that high on a service? That seems like you're asking customers to shoot themselves in the foot. Where was the look at the negative customer experience side of the equation? Make it easy for customers to do the right thing.
Then the bit that really bugs the crap out of me:
> Thank you Google!
He's thanking Google for having had an absolutely shitty experience on their platform: 10 days of stress from needless delays in forgiving a trivially small bill, dealings with multiple lawyers, investigating bankruptcy, risk of missing product launch date, working around the clock to dig themselves out of hell...
Do not use hosted cloud services where the implementation creates publicly accessible API keys and each HTTP request results in a charge to your account. A few specific examples are Firebase, Algolia, and AWS Lambda.
All it takes is one programming mistake or one bad actor and you can find yourself in an equally precarious situation.
Yep, you should wrap then in something like Apigee or an API Server that can throttle requests and keep traffic to reasonable numbers for your service.