Is there seriously no way in AWS/Azure/GCP to specify "Here's my budget, shut ev...

jrockway · on May 5, 2021

I think the big problem is that usage collection is a few days out of date, at least for GCP. Autoscaling can react in seconds to increased load, but it takes about 3 days before that shows up on your cost reports. You can burn through a lot of cloud resources in 3 days.

GCP at least has some provision to get very detailed information about usage (but not cost) that updates in less than an hour. That, to me, is the tool for building something like "shut down our account if usage is too high". It is annoying that you have to code this yourself, but ultimately, it kind of makes sense to me. Cloud providers exist to rent you some hardware (often with "value-add" software); it's the developer and operator's responsibility to account for every machine they request, every byte they send to the network, every query they make to the database, etc. and to have a good reason for that. To some extent, if you don't know where you're sending bytes, or what queries you're making, how do you know if your product is working? How do you know that you're not hacked? Reliability and cost go hand in hand here -- if you're monitoring what you need to assure reliability, costs probably aren't confusingly accumulating out of control.

nlitened · on May 5, 2021

Are you being sarcastic?

> I think the big problem is that usage collection is a few days out of date, at least for GCP. Autoscaling can react in seconds to increased load, but it takes about 3 days before that shows up on your cost reports.

That does not sound like a good reason, but more like a crappy implementation of usage collection.

I don’t see why a bunch of Google engineers can’t implement real-time billing properly, and see no reason to defend their inability to do their job.

NullPrefix · on May 5, 2021

IIRC the excuse is that billing is a separate department and they count all the usage dollars way after you done using it, not in realtime. You would still be able to go over your limit and then who should foot the bill?

Realtime counting would be too difficult to figure out, our brightest minds are busy figuring out engagement metrics.

NathanKP · on May 5, 2021

This is my personal opinion, though I do work at AWS:

It's not that real time counting is difficult, it is that the amount of compute resources and electricity that would be needed to power real time billing at AWS scale would be astronomical. There is a reason why banks and financial institutions generally do batch processing in the off peak hours when electricity is cheaper and there is less demand for the compute resources. Now imagine AWS billing, which is arguably far more difficult in scale and complexity.

czgs2woh4ue3 · on May 5, 2021

I also work at AWS (nowhere near billing), so the usual disclaimers apply, but:

I actually have no idea if billing is real-time or not? I think it's mostly batch, but the records aren't necessarily, though they may be aggregated a bit.

The general point in this thread certainly holds: our systems provide service first, bill second, and that by throwing a record over the wall to some other internal system. It's not unthinkable they could tally up your costs as you go, but the expense has fundamentally already happened, and that's the disconnect.

It would be hard to react ahead of time. Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons. Expensive big-bang on/off things, probably simpler, though I can think of a few sticking points.

It would be hard to react after the fact, too. Where does a bill come from? My own team is deeply internal, several layers removed from anything you're recognize on an invoice, but we're the one generating and measuring costs. Precise attribution would be a problem in and of itself- cutting off service means traversing those layers in reverse, then figuring out what "cut off" means in our specific context. That's new systems and new code all around, repeat for all of AWS- there's a huge organizational problem here on top of the technical one.

I could see some individual teams doing something like this for just their services, but AWS-wide would be a big undertaking.

I wish we had it- I'd sleep a little better at night, myself- but from my limited perspective, it sure looks like we're fundamentally not designed for this.

Retric · on May 5, 2021

“Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons”

That’s hardly necessary. Let’s suppose you have some service that costs 1 cent every 1,000 queries. If you’re billing it then you need to be keeping track of it and incrementing some counter somewhere. If old number mod x < new number mod x them do some check, that’s very cheap on average and doesn’t add latency if done after the fact.

PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

czgs2woh4ue3 · on May 5, 2021

That counter may well not exist outside of billing for longer than it takes to batch some records together. It will need to be shared and synchronized with the rest of the fleet, the other fleets in the availability zone, the other zones in the region, the other regions, and every other service in AWS. There are strict rules about crossing these boundaries for sake of failure isolation.

As an amusing extra sticking point, your service has no idea how much it actually costs, because that's calculated by billing- the rates may vary from request to request or from customer to customer.

Without spending way too long thinking about it, the complexity in figuring out exactly when to stop is significant enough that it probably cannot practically be done in the critical path of these kinds of high-volume systems, hence the reactive approach being more plausible to me.

I don't know what kinds of problems AT&T has, but at the risk of saying dumb things about an industry I know next to nothing about, your phone is only attached to one tower at a time, and that probably helps a bit. And I'm not sure when it wouldn't be simpler and just as good for them to also react after the fact, anyway.

Retric · on May 5, 2021

First arguing based on existing infrastructure ignores the fact your changing the system therefore any new system is a viable option. All the existing system changes is how much things cost. Anyway, for independent distributed systems you can use probability rather than fixed numbers.

That said, your losing the forest for the trees, the accuracy isn’t that important. You can still bill for actual usage. A 15 minute granularity is vastly better than a 30 day one. As long as you can kill processes you don’t need to check in the middle of every action. Things being asynchronous is just the cost of business at scale.

czgs2woh4ue3 · on May 5, 2021

I'm hardly saying it's impossible; I'm saying that it's not easy, and may even be hard. Doing it well would likely require a wide-reaching effort the likes of which would eventually reach my ears, and the fact that I haven't heard of such a thing implies to me that it's probably not an AWS priority.

Why that would be, I leave to you.

NullPrefix · on May 5, 2021

>your phone is only attached to one tower at a time

Not when you're being simswapped

KirillPanov · on May 5, 2021

> PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

To be fair, AT&T in particular does prepaid shutoffs on a pretty coarse granularity, I think it's like 15-minute intervals.

I know this because for a while I had to use a prepaid LTE modem as my primary internet connection. You can use as much bandwidth as you want for the remainder of the 15-minute interval in which you exceed what you've paid for -- then they shut you off.

I once managed (by accident) to get 3GB out of a 2GB plan purchase because of this.

Of course that free 1GB was only free because I consumed all of it in the 14.9 minute time period preceding NO CARRIER.

crmd · on May 5, 2021

There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

dragonwriter · on May 5, 2021

> There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

But there isn’t a lot of middle ground between distributed, strongly-consistent credit limit checks every API call and billing increment (which is, IIRC, instance-seconds or smaller for some time-based services) and a hard billing cap that is actually usable on a system structured like AWS. Partial solutions reduce the risk but don’t eliminate the problem, and at AWS scale reducing the risk means you still have significant numbers of people reliant on the “call customer service” mitigation, and how much spending and system compromise to narrow this billing issue is worthwhile if you are still in that position?

KirillPanov · on May 5, 2021

> the amount of compute resources and electricity that would be needed to power real time billing at AWS scale would be astronomical

You don't have to bill in real time.

You just have to provision funding for every resource except network bandwidth.

Customer sets a monthly spend limit. Every time they start up an instance, create a volume, allocate an IP, or do anything else that costs money, you subtract the monthly cost of that new thing from their spend limit. If the spend limit would go negative, you refuse to create the new resource.

If the spend limit is still positive, the remaining amount is divided by the number of seconds remaining in the month times the bandwidth cost. The result becomes that customer's network throughput limit. Update that setting in your routers (qdisc in Linux) as part of the API call that allocated a resource. If you claim your routers don't have a limit like this I call shenanighans.

This should work perfectly for one region.

There's probably a way to generalize it to multiple regions, but I'm sure most small/medium customers would be happy enough to have a budget for each region. They'd probably set most regions' budget to zero and just worry about one or two.

The web UI probably would need to be updated to show the customer "here is what your bandwidth limit for the rest of the month will be if you proceed; are you sure?". JSON APIs can return this value when invoked in dry-run mode.

dragonwriter · on May 5, 2021

> Customer sets a monthly spend limit. Every time they start up an instance, create a volume, allocate an IP, or do anything else that costs money, you subtract the monthly cost of that new thing from their spend limit. If the spend limit would go negative, you refuse to create the new resource.

AWS systems are highly distributed; this kind of sharp billing cap would necessarily introduce a new strong consistency requirement across multiple services, many of which aren’t even strongly consistent considered one at a time (and that’s often true even if you limit to a single region.)

> Every time they start up an instance, create a volume, allocate an IP, or do anything else that costs money, you subtract the monthly cost of that new thing from their spend limit

For the motivating use case (avoiding a bill on the scale of even $200—possibly even $1—from a free-tier-eligible account), using monthly chunks doesn’t work; you suddenly couldn’t spin up a second simultaneous EC2 instance of ant kind after an initial t3.micro instance, which would cutoff many normal free tier usage patterns.

I mean, that’s a good way of capping if you are using AWS as a super overpriced steady-state VPS, but that’s not really the usage pattern that causes the risks that the cap idea is intended to protect against.

This is a particularly poor solution to completely the wrong problem.

KirillPanov · on May 5, 2021

> AWS systems are highly distributed

Hogwash, I tried to spin up 100 of your F1 instances in us-east-1 a week or two after they first became available, and found out about this thing called "limits".

Wherever you're enforcing the limit on max number of instances per region is already a synchronization point of exactly the sort needed here.

I'm sorry, this just doesn't pass the bullshit test. Resource allocation API calls are not even remotely close to lightning-quick. There is no fundamental immutable constraint here.

> For the motivating use case (avoiding a bill on the scale of even $200—possibly even $1—from a free-tier-eligible account),

Avoiding a $1 bill is definitely not the motivating use case.

A lot of people would be happy to have a mechanism that could prevent them from being billed 5x their expected expenditure (i.e. they set their budget limit to 5x what they intend to spend). It doesn't matter that that isn't perfect. It is massively better than what you're offering right now.

dragonwriter · on May 5, 2021

> Hogwash, I tried to spin up 100 of your F1 instances

I don’t have any F1 instances. Have you mistaken me for an AWS employee rather than a user?

> in us-east-1 a week or two after they first became available, and found out about this thing called "limits".

Yes, individual services, especially in individual regions, and especially a single type of resource within a service within a region like, say, instances in EC2, are often at least enough like centralized to impose hard limits reasonably well.

Billing accounts (and individual real projects which—and this is one disadvantage AWS has vs, say, GCP—AWS has only the weakest concept of) tend to span multiple resource types in each of multiple services, and sometimes multiple regions.

> Resource allocation API calls are not even remotely close to lightning-quick.

Resource allocation API calls that have high latency aren’t the only API calls that cost money and would need coordination. Heck, API calls aren’t the only thing that costs money.

czgs2woh4ue3 · on May 5, 2021

> Update that setting in your routers (qdisc in Linux) as part of the API call that allocated a resource. If you claim your routers don't have a limit like this I call shenanighans.

Eh. AWS's edge network is highly distributed. Unless you want an even split of your rate limit across every possible way out of the network, you'd be much better off settling for an even split across your EC2 instances, and there's no room for bursting in this model. Enforcing per-instance limits (on any dimension) sounds pretty feasible, though.

This wouldn't generalize straightforwardly to services that don't have obvious choke points that can impose this sort of throttling, such as, I think, DynamoDB.

ryan29 · on May 5, 2021

> You would still be able to go over your limit and then who should foot the bill?

The provider. What they do wouldn't be accepted in any other industry. Imagine hiring an appliance repair shop who sends a repair person that can fix your stuff immediately, but can't tell you what it's going to cost until 3 days after the work is done.

Then you get a huge bill because you wanted "appliance repair" (one of them), but ended up with "appliance maintenance" (all of them).

HomeDeLaPot · on May 15, 2021

A lot of people have thoughfully responded with reasons why this doesn't or can't exist: real-time billing is far too expensive to implement, better to get a huge bill than to lose data or shut down critical systems, etc. I guess it makes sense—ideally you are monitoring your stuff, whether you're using your own tools or built-in ones—and you know ahead of time when your usage is creeping up. Also, I suppose only the customer can really know which systems can be shut down/deallocated to save money and which ones would kill the company if shut down. It sounds like if you're a small startup strapped for resources, you can avoid these bills either by self-hosting or by being careful about how you build your cloud infrastructure. I.e. maybe you could host your app on your own box OR on an equivalent VM in Azure that's just going to fail if it runs out of disk/CPU/outgoing bandwidth instead of autoscaling to outrageous levels.

maxgashkov · on May 5, 2021

That’s extremely hard to design, at least with the current state of what AWS bills and does not bill.

Example: let’s assume you’ve set the cut-off budget too strict, spun off another shard for your stateful service (DB for example), it received and stored some data during the short window before some other service brought whole account over budget (i.e. paid egress crossed the threshold).

To bring VM and EBS charges to 0 (to implement your idea of ‘shut down everything’) AWS will have to delete production data.

While it may be OK when you’re experimenting, it’s really hard to tell in automated way.

So, to fully comply w/o risking losing customer data, AWS will have to stop charging for not running instances and inactive EBS volumes which most definitely bring on many kinds of abuse.

—

There may be some other way to do this, maybe mark some services expendable or not, so they are stopped in the event of overspend.

coderintherye · on May 5, 2021

A complicated solution is not what people are really asking for though.

What I and I expect most people want is a cap which then spins down services when they reach that cap. Nobody is going to care if the cap is set to $1000 and the final bill is $1,020. The problem being solved for is not wanting to have to ever worry about missing an alert and waking up to a bill that is a factor or two beyond expectations. I can afford my bill being 10% or even 40% above my expectation. I can't afford my bill being 500% off.

maxgashkov · on May 5, 2021

I do understand that, but there are services that are still billed for when ‘spun down’. To stop getting the bills they have to be terminated.

The solution seems to be to implement ‘emergency stop’ when whole account is put to pause but no state is purged. And then you’ll probably have a week or two to react and decide if you want larger bill, salvage the data or just terminate it all.

cesarb · on May 5, 2021

> So, to fully comply w/o risking losing customer data, AWS will have to stop charging for not running instances and inactive EBS volumes which most definitely bring on many kinds of abuse.

Another option would be to "reserve" them in the budget. That is, when for instance you create a 10 GB EBS volume, count it in the budget as if it will be kept for the whole month (and all months after that). Yes, that would mean an EBS volume costing $X per day would count in the budget as 31*$X, but it would prevent both going over the limit and having to lose data.

nucleardog · on May 6, 2021

This creates a surprising situation.

We have a batch process that uses S3 as a way to temporarily store some big datasets while another streams and processes then and, once complete, removes them. This takes like an hour.

So our S3 bill could be, let’s say, $10/mo. If you went and looked at your S3 bill and saw that and thought setting a $20 cap would give you reasonable headroom you’d be sorely surprised when S3 stopped accepting writes or other services started getting shut down or failing next time the batch job triggered.

Under this system, the budget and actual bill need to be off by a factor of more than 10 ($240). And this also doesn’t stop your bill being off by a factor of 10 from what you “wanted” to set your limit to. More than the $200 under discussion.

ghaff · on May 5, 2021

I think there's a good argument that they could do better. But there's also probably an argument that harder circuit breakers would result in posts like "AWS destroyed my business right when I was featured in the New York Times"--including things like deleting production data. I'm sort of sympathetic to the idea that AWS should have an experimental "rip it all down" option and kill all stateless services option but that adds another layer of complexity and introduces more edge cases.

fapjacks · on May 5, 2021

Merely being on free tier is a signal that you do not want a $27,000 bill. There is no excuse for what AWS does here, and it is clear they use this clusterfuck as a revenue stream.

markus_zhang · on May 5, 2021

They can simply add a nuclear option. Ordinary business probably won't enable that option but individuals can and should set it up.

nucleardog · on May 6, 2021

It’s a lot easier for them to refund the individuals who make mistakes than deal with the fallout and bad press from the small businesses they’ve destroyed when they incorrectly enable that option.

Silhouette · on May 5, 2021

Yes, a useful last resort safeguard would need to be more granular than just "turn everything off", at least if we're talking about protecting production systems rather than just people learning and wanting to avoid inadvertently leaving the free tier or something like that.

Still, it's not hard to imagine some simple refinements that might be practical. For example, an option to preserve data in DB or other storage services but stop all other activity might be a useful safety net for a lot of AWS customers. It wouldn't cut costs to 0, but that's not necessarily the optimal outcome for customers running in production anyway. It would probably mitigate the greatest risks of runaway costs, though.

ta988 · on May 5, 2021

You can get alerts. And there are some limits for young accounts. But they really need a "beginner" mode. Or a budget mode, put $x on the account can't spend more. But I guess they are making a lot of money with "mistakes" so there may not be kuch incentives

manigandham · on May 5, 2021

On top of the other reasons for complexity and delay, this just creates another potential mistake where people delete their entire accounts or stop production services.

It's far easier to negotiate dollar amounts than lost data or service uptime.

nineplay · on May 5, 2021

Azure will alert you when you exceed a budget, but it won't disable anything.

Azure MongoDB billing was insanity. I was up to $800 to host a couple of GB that wasn't doing anything.

I'm still not sure what happened, even their support kept saying "it's priced by request units" and I kept saying "How does a handful of queries a day translate to $40 in request units?"

A year later, I think that I had a lot of collections and they seem to charge per-collection, but I'm still not even sure. Thank goodness we moved off of it after only a month or two.

jd_mongodb · on May 5, 2021

I work for MongoDB. The writer is referring to CosmosDB, the Microsoft clone of MongoDB. You can runa real MongoDB cluster on Azure with MongoDB Atlas. The pricing model for Atlas is per Node size + disk + memory + data transfer. It's generally easier to predict your costs using this model. Users tend to over-provision with this model so we recommend using auto-scaling. This will allow your cluster to scale up or down based on load (price will adjust as well).

nineplay · on May 5, 2021

We switched to MongoDB Atlas which was awesome but I didn't want to sound like a stealth corporate shill. Atlas rocks!

jd_mongodb · on May 5, 2021

Love to hear that.

dragonwriter · on May 5, 2021

> Is there seriously no way in AWS/Azure/GCP to specify "Here's my budget, shut everything down if I exceed $X"?

All of them have billing APIs which should, in principal, allow you to build “Shutdown everything at some indefinite time interval (and additional spend) after I exceed $X”, though you’ll need to do nontrivial custom coding to tie together the various pieces to do it, and actually stopping all spend is likely going to mean nuking storage, not just halting “active” services.

None of them provide anyway to more than probabilistically do “Shut everything down before I exceed, and in a way which prevents exceeding, $X.”

Havoc · on May 5, 2021

>though you’ll need to do nontrivial custom coding to tie together the various pieces to do it,

Let me guess...hosted on the cloud we're trying to shut down?

dragonwriter · on May 5, 2021

> Let me guess...hosted on the cloud we're trying to shut down?

It needs to consume APIs from that cloud, but there is no reason it would need to be hosted there.

snorkel · on May 5, 2021

A spending kill switch can be setup on AWS using AWS Budgets alerts and Lambda but it’s a DIY project, not a built-in feature.

cesarb · on May 5, 2021

The whole point of a "spending kill switch" is as a backstop when you make a mistake; but if you do it as a "DIY project", what prevents you from making a mistake on it? It has to be a built-in feature.

ghaff · on May 5, 2021

So what services do you kill? Everything? Including databases and S3?

yjftsjthsd-h · on May 5, 2021

Ideally tunable, but sure. At least provide that option.

sixothree · on May 5, 2021

If it can be done DIY, it can be done first party.

dragonwriter · on May 5, 2021

If its done first party, someone (realistically, af AWS’ scale, a substantial number of someones) will mess up using it and nuke their account.

Customer service can correct “we screwed up and have a giant bill” easier than “we screwed up and lost all our data”.

So its not going to happen first party.

(It’s also not really possible DIY as a hard certain cutoff at or before a limit, only at an indefinite interval in time and money beyond a limit. So you still have potentially unbounded expense.)

josephorjoe · on May 5, 2021

You can set price alerts to email you on the day when you cross a monthly spending limit, but it is not the simplest thing to figure out.

I run a small server and have a few odds and ends and get emails at $5, $15, $50, $75, and $100. Haven't broken $75 limit yet...

PeterisP · on May 5, 2021

The problem is that a billing alert at $10 won't prevent you from accidentally starting up something that will bring you a $1000 bill before you can react to that email - there needs to be a process that cuts off service at some hard limit instead of just sending an email and continuing to spend money.

josephorjoe · on May 5, 2021

completely agree. even though it creates a new class of problem, there should be a "cut me off immediately at $XXXX per month" option.

diveanon · on May 5, 2021

It's not real time billing, so there is no real way to disable service until costs have been determined.

I agree that it seems like a solvable problem, but it doesn't make them money so why should they care?

newobj · on May 5, 2021

Of course there's a way. But it's opt-in, not opt-out.