This is my personal opinion, though I do work at AWS: It's not that real time co...

czgs2woh4ue3 · on May 5, 2021

I also work at AWS (nowhere near billing), so the usual disclaimers apply, but:

I actually have no idea if billing is real-time or not? I think it's mostly batch, but the records aren't necessarily, though they may be aggregated a bit.

The general point in this thread certainly holds: our systems provide service first, bill second, and that by throwing a record over the wall to some other internal system. It's not unthinkable they could tally up your costs as you go, but the expense has fundamentally already happened, and that's the disconnect.

It would be hard to react ahead of time. Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons. Expensive big-bang on/off things, probably simpler, though I can think of a few sticking points.

It would be hard to react after the fact, too. Where does a bill come from? My own team is deeply internal, several layers removed from anything you're recognize on an invoice, but we're the one generating and measuring costs. Precise attribution would be a problem in and of itself- cutting off service means traversing those layers in reverse, then figuring out what "cut off" means in our specific context. That's new systems and new code all around, repeat for all of AWS- there's a huge organizational problem here on top of the technical one.

I could see some individual teams doing something like this for just their services, but AWS-wide would be a big undertaking.

I wish we had it- I'd sleep a little better at night, myself- but from my limited perspective, it sure looks like we're fundamentally not designed for this.

Retric · on May 5, 2021

“Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons”

That’s hardly necessary. Let’s suppose you have some service that costs 1 cent every 1,000 queries. If you’re billing it then you need to be keeping track of it and incrementing some counter somewhere. If old number mod x < new number mod x them do some check, that’s very cheap on average and doesn’t add latency if done after the fact.

PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

czgs2woh4ue3 · on May 5, 2021

That counter may well not exist outside of billing for longer than it takes to batch some records together. It will need to be shared and synchronized with the rest of the fleet, the other fleets in the availability zone, the other zones in the region, the other regions, and every other service in AWS. There are strict rules about crossing these boundaries for sake of failure isolation.

As an amusing extra sticking point, your service has no idea how much it actually costs, because that's calculated by billing- the rates may vary from request to request or from customer to customer.

Without spending way too long thinking about it, the complexity in figuring out exactly when to stop is significant enough that it probably cannot practically be done in the critical path of these kinds of high-volume systems, hence the reactive approach being more plausible to me.

I don't know what kinds of problems AT&T has, but at the risk of saying dumb things about an industry I know next to nothing about, your phone is only attached to one tower at a time, and that probably helps a bit. And I'm not sure when it wouldn't be simpler and just as good for them to also react after the fact, anyway.

Retric · on May 5, 2021

First arguing based on existing infrastructure ignores the fact your changing the system therefore any new system is a viable option. All the existing system changes is how much things cost. Anyway, for independent distributed systems you can use probability rather than fixed numbers.

That said, your losing the forest for the trees, the accuracy isn’t that important. You can still bill for actual usage. A 15 minute granularity is vastly better than a 30 day one. As long as you can kill processes you don’t need to check in the middle of every action. Things being asynchronous is just the cost of business at scale.

czgs2woh4ue3 · on May 5, 2021

I'm hardly saying it's impossible; I'm saying that it's not easy, and may even be hard. Doing it well would likely require a wide-reaching effort the likes of which would eventually reach my ears, and the fact that I haven't heard of such a thing implies to me that it's probably not an AWS priority.

Why that would be, I leave to you.

NullPrefix · on May 5, 2021

>your phone is only attached to one tower at a time

Not when you're being simswapped

KirillPanov · on May 5, 2021

> PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

To be fair, AT&T in particular does prepaid shutoffs on a pretty coarse granularity, I think it's like 15-minute intervals.

I know this because for a while I had to use a prepaid LTE modem as my primary internet connection. You can use as much bandwidth as you want for the remainder of the 15-minute interval in which you exceed what you've paid for -- then they shut you off.

I once managed (by accident) to get 3GB out of a 2GB plan purchase because of this.

Of course that free 1GB was only free because I consumed all of it in the 14.9 minute time period preceding NO CARRIER.

crmd · on May 5, 2021

There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

dragonwriter · on May 5, 2021

> There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

But there isn’t a lot of middle ground between distributed, strongly-consistent credit limit checks every API call and billing increment (which is, IIRC, instance-seconds or smaller for some time-based services) and a hard billing cap that is actually usable on a system structured like AWS. Partial solutions reduce the risk but don’t eliminate the problem, and at AWS scale reducing the risk means you still have significant numbers of people reliant on the “call customer service” mitigation, and how much spending and system compromise to narrow this billing issue is worthwhile if you are still in that position?

KirillPanov · on May 5, 2021

> the amount of compute resources and electricity that would be needed to power real time billing at AWS scale would be astronomical

You don't have to bill in real time.

You just have to provision funding for every resource except network bandwidth.

Customer sets a monthly spend limit. Every time they start up an instance, create a volume, allocate an IP, or do anything else that costs money, you subtract the monthly cost of that new thing from their spend limit. If the spend limit would go negative, you refuse to create the new resource.

If the spend limit is still positive, the remaining amount is divided by the number of seconds remaining in the month times the bandwidth cost. The result becomes that customer's network throughput limit. Update that setting in your routers (qdisc in Linux) as part of the API call that allocated a resource. If you claim your routers don't have a limit like this I call shenanighans.

This should work perfectly for one region.

There's probably a way to generalize it to multiple regions, but I'm sure most small/medium customers would be happy enough to have a budget for each region. They'd probably set most regions' budget to zero and just worry about one or two.

The web UI probably would need to be updated to show the customer "here is what your bandwidth limit for the rest of the month will be if you proceed; are you sure?". JSON APIs can return this value when invoked in dry-run mode.

dragonwriter · on May 5, 2021

> Customer sets a monthly spend limit. Every time they start up an instance, create a volume, allocate an IP, or do anything else that costs money, you subtract the monthly cost of that new thing from their spend limit. If the spend limit would go negative, you refuse to create the new resource.

AWS systems are highly distributed; this kind of sharp billing cap would necessarily introduce a new strong consistency requirement across multiple services, many of which aren’t even strongly consistent considered one at a time (and that’s often true even if you limit to a single region.)

> Every time they start up an instance, create a volume, allocate an IP, or do anything else that costs money, you subtract the monthly cost of that new thing from their spend limit

For the motivating use case (avoiding a bill on the scale of even $200—possibly even $1—from a free-tier-eligible account), using monthly chunks doesn’t work; you suddenly couldn’t spin up a second simultaneous EC2 instance of ant kind after an initial t3.micro instance, which would cutoff many normal free tier usage patterns.

I mean, that’s a good way of capping if you are using AWS as a super overpriced steady-state VPS, but that’s not really the usage pattern that causes the risks that the cap idea is intended to protect against.

This is a particularly poor solution to completely the wrong problem.

KirillPanov · on May 5, 2021

> AWS systems are highly distributed

Hogwash, I tried to spin up 100 of your F1 instances in us-east-1 a week or two after they first became available, and found out about this thing called "limits".

Wherever you're enforcing the limit on max number of instances per region is already a synchronization point of exactly the sort needed here.

I'm sorry, this just doesn't pass the bullshit test. Resource allocation API calls are not even remotely close to lightning-quick. There is no fundamental immutable constraint here.

> For the motivating use case (avoiding a bill on the scale of even $200—possibly even $1—from a free-tier-eligible account),

Avoiding a $1 bill is definitely not the motivating use case.

A lot of people would be happy to have a mechanism that could prevent them from being billed 5x their expected expenditure (i.e. they set their budget limit to 5x what they intend to spend). It doesn't matter that that isn't perfect. It is massively better than what you're offering right now.

dragonwriter · on May 5, 2021

> Hogwash, I tried to spin up 100 of your F1 instances

I don’t have any F1 instances. Have you mistaken me for an AWS employee rather than a user?

> in us-east-1 a week or two after they first became available, and found out about this thing called "limits".

Yes, individual services, especially in individual regions, and especially a single type of resource within a service within a region like, say, instances in EC2, are often at least enough like centralized to impose hard limits reasonably well.

Billing accounts (and individual real projects which—and this is one disadvantage AWS has vs, say, GCP—AWS has only the weakest concept of) tend to span multiple resource types in each of multiple services, and sometimes multiple regions.

> Resource allocation API calls are not even remotely close to lightning-quick.

Resource allocation API calls that have high latency aren’t the only API calls that cost money and would need coordination. Heck, API calls aren’t the only thing that costs money.

czgs2woh4ue3 · on May 5, 2021

> Update that setting in your routers (qdisc in Linux) as part of the API call that allocated a resource. If you claim your routers don't have a limit like this I call shenanighans.

Eh. AWS's edge network is highly distributed. Unless you want an even split of your rate limit across every possible way out of the network, you'd be much better off settling for an even split across your EC2 instances, and there's no room for bursting in this model. Enforcing per-instance limits (on any dimension) sounds pretty feasible, though.

This wouldn't generalize straightforwardly to services that don't have obvious choke points that can impose this sort of throttling, such as, I think, DynamoDB.