I also work at AWS (nowhere near billing), so the usual disclaimers apply, but: ...

Retric · on May 5, 2021

“Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons”

That’s hardly necessary. Let’s suppose you have some service that costs 1 cent every 1,000 queries. If you’re billing it then you need to be keeping track of it and incrementing some counter somewhere. If old number mod x < new number mod x them do some check, that’s very cheap on average and doesn’t add latency if done after the fact.

PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

czgs2woh4ue3 · on May 5, 2021

That counter may well not exist outside of billing for longer than it takes to batch some records together. It will need to be shared and synchronized with the rest of the fleet, the other fleets in the availability zone, the other zones in the region, the other regions, and every other service in AWS. There are strict rules about crossing these boundaries for sake of failure isolation.

As an amusing extra sticking point, your service has no idea how much it actually costs, because that's calculated by billing- the rates may vary from request to request or from customer to customer.

Without spending way too long thinking about it, the complexity in figuring out exactly when to stop is significant enough that it probably cannot practically be done in the critical path of these kinds of high-volume systems, hence the reactive approach being more plausible to me.

I don't know what kinds of problems AT&T has, but at the risk of saying dumb things about an industry I know next to nothing about, your phone is only attached to one tower at a time, and that probably helps a bit. And I'm not sure when it wouldn't be simpler and just as good for them to also react after the fact, anyway.

Retric · on May 5, 2021

First arguing based on existing infrastructure ignores the fact your changing the system therefore any new system is a viable option. All the existing system changes is how much things cost. Anyway, for independent distributed systems you can use probability rather than fixed numbers.

That said, your losing the forest for the trees, the accuracy isn’t that important. You can still bill for actual usage. A 15 minute granularity is vastly better than a 30 day one. As long as you can kill processes you don’t need to check in the middle of every action. Things being asynchronous is just the cost of business at scale.

czgs2woh4ue3 · on May 5, 2021

I'm hardly saying it's impossible; I'm saying that it's not easy, and may even be hard. Doing it well would likely require a wide-reaching effort the likes of which would eventually reach my ears, and the fact that I haven't heard of such a thing implies to me that it's probably not an AWS priority.

Why that would be, I leave to you.

NullPrefix · on May 5, 2021

>your phone is only attached to one tower at a time

Not when you're being simswapped

KirillPanov · on May 5, 2021

> PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

To be fair, AT&T in particular does prepaid shutoffs on a pretty coarse granularity, I think it's like 15-minute intervals.

I know this because for a while I had to use a prepaid LTE modem as my primary internet connection. You can use as much bandwidth as you want for the remainder of the 15-minute interval in which you exceed what you've paid for -- then they shut you off.

I once managed (by accident) to get 3GB out of a 2GB plan purchase because of this.

Of course that free 1GB was only free because I consumed all of it in the 14.9 minute time period preceding NO CARRIER.

crmd · on May 5, 2021

There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

dragonwriter · on May 5, 2021

> There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

But there isn’t a lot of middle ground between distributed, strongly-consistent credit limit checks every API call and billing increment (which is, IIRC, instance-seconds or smaller for some time-based services) and a hard billing cap that is actually usable on a system structured like AWS. Partial solutions reduce the risk but don’t eliminate the problem, and at AWS scale reducing the risk means you still have significant numbers of people reliant on the “call customer service” mitigation, and how much spending and system compromise to narrow this billing issue is worthwhile if you are still in that position?