Hacker News new | past | comments | ask | show | jobs | submit login

I also work at AWS (nowhere near billing), so the usual disclaimers apply, but:

I actually have no idea if billing is real-time or not? I think it's mostly batch, but the records aren't necessarily, though they may be aggregated a bit.

The general point in this thread certainly holds: our systems provide service first, bill second, and that by throwing a record over the wall to some other internal system. It's not unthinkable they could tally up your costs as you go, but the expense has fundamentally already happened, and that's the disconnect.

It would be hard to react ahead of time. Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons. Expensive big-bang on/off things, probably simpler, though I can think of a few sticking points.

It would be hard to react after the fact, too. Where does a bill come from? My own team is deeply internal, several layers removed from anything you're recognize on an invoice, but we're the one generating and measuring costs. Precise attribution would be a problem in and of itself- cutting off service means traversing those layers in reverse, then figuring out what "cut off" means in our specific context. That's new systems and new code all around, repeat for all of AWS- there's a huge organizational problem here on top of the technical one.

I could see some individual teams doing something like this for just their services, but AWS-wide would be a big undertaking.

I wish we had it- I'd sleep a little better at night, myself- but from my limited perspective, it sure looks like we're fundamentally not designed for this.




“Small, cheap things like "a single DynamoDB query" or "a byte of bandwidth" are often also high-volume, and you don't want to introduce billing checks to critical paths for reliability / latency / cost reasons”

That’s hardly necessary. Let’s suppose you have some service that costs 1 cent every 1,000 queries. If you’re billing it then you need to be keeping track of it and incrementing some counter somewhere. If old number mod x < new number mod x them do some check, that’s very cheap on average and doesn’t add latency if done after the fact.

PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.


That counter may well not exist outside of billing for longer than it takes to batch some records together. It will need to be shared and synchronized with the rest of the fleet, the other fleets in the availability zone, the other zones in the region, the other regions, and every other service in AWS. There are strict rules about crossing these boundaries for sake of failure isolation.

As an amusing extra sticking point, your service has no idea how much it actually costs, because that's calculated by billing- the rates may vary from request to request or from customer to customer.

Without spending way too long thinking about it, the complexity in figuring out exactly when to stop is significant enough that it probably cannot practically be done in the critical path of these kinds of high-volume systems, hence the reactive approach being more plausible to me.

I don't know what kinds of problems AT&T has, but at the risk of saying dumb things about an industry I know next to nothing about, your phone is only attached to one tower at a time, and that probably helps a bit. And I'm not sure when it wouldn't be simpler and just as good for them to also react after the fact, anyway.


First arguing based on existing infrastructure ignores the fact your changing the system therefore any new system is a viable option. All the existing system changes is how much things cost. Anyway, for independent distributed systems you can use probability rather than fixed numbers.

That said, your losing the forest for the trees, the accuracy isn’t that important. You can still bill for actual usage. A 15 minute granularity is vastly better than a 30 day one. As long as you can kill processes you don’t need to check in the middle of every action. Things being asynchronous is just the cost of business at scale.


I'm hardly saying it's impossible; I'm saying that it's not easy, and may even be hard. Doing it well would likely require a wide-reaching effort the likes of which would eventually reach my ears, and the fact that I haven't heard of such a thing implies to me that it's probably not an AWS priority.

Why that would be, I leave to you.


>your phone is only attached to one tower at a time

Not when you're being simswapped


> PS: Phone companies can pull this stuff off successfully for millions of cellphones. If you’re arguing keeping up with AT&T is to hard, you have serious organizational problems.

To be fair, AT&T in particular does prepaid shutoffs on a pretty coarse granularity, I think it's like 15-minute intervals.

I know this because for a while I had to use a prepaid LTE modem as my primary internet connection. You can use as much bandwidth as you want for the remainder of the 15-minute interval in which you exceed what you've paid for -- then they shut you off.

I once managed (by accident) to get 3GB out of a 2GB plan purchase because of this.

Of course that free 1GB was only free because I consumed all of it in the 14.9 minute time period preceding NO CARRIER.


There’s a lot of middle ground between credit limit checks within every database transaction and the current state.


> There’s a lot of middle ground between credit limit checks within every database transaction and the current state.

But there isn’t a lot of middle ground between distributed, strongly-consistent credit limit checks every API call and billing increment (which is, IIRC, instance-seconds or smaller for some time-based services) and a hard billing cap that is actually usable on a system structured like AWS. Partial solutions reduce the risk but don’t eliminate the problem, and at AWS scale reducing the risk means you still have significant numbers of people reliant on the “call customer service” mitigation, and how much spending and system compromise to narrow this billing issue is worthwhile if you are still in that position?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: