Again, Big-O is only relevant w.r.t a specific cost analysis.
Hashtable analyses don't care how you get the hash function - it can take exactly 72 hours to arrive, but as long as the oracle satisifes some probability distribution guarantee, it's good. That is an O(1) operation, a single hash.
Like, there's no going "outside" for big O, there's no "standard". It's always backed by some notion of what you are computing. There's pointer machines, turing machines, automata, circuits, for your complexity. Pointer machines can't directly address at offsets, Word-RAM can. Different tradeoffs for different analyses - pointer machines often model GC-collected languages and heavy pointer data structures, while Word-RAM deals with more compact data structures.
Again, as mentioned in the paper, Big O isn't used as a notion for number of total operations, it's only trying to *minimize the number of calls to the hashing oracle*. So your analyses can get even more fine-grained.
You might even be excited to learn about the External Memory Model. In that model, you have two blocks of memory; one block where *every single operation* is free, and one block in which you can't access, you must fetch into memory. You fetch blocks of size B, your local memory is B << N, and external memory is N << M. Even operation in block N is free, but and the only thing you can do on block M is read/write a block of size B. The question then is, to fetch the minimum number of blocks for your specific algorithm.
I've been originally confused by algorithms in this model, too - sometimes the "real number" of operations would be an order of magnitude more than the number of block accesses, but we just don't even count them? How does that make sense?
Turns out this actually models real hardware surprisingly well - it's often faster to simply recompute and do extra computation in order to minimize memory accesses. It's used a lot in databases. Although I can't find the exact wording "External memory model", the FlashAttention paper uses the wording "IO/Aware" and analyzes block transfers like this on the GPU, so... yeah.
So going 'higher up' in the abstraction stack doesn't necessarily sacrifice practicality either; in fact, it can sometimes help.
Sure, we can view the intuition as chaining together layers of abstraction, I get that view. But each layer functions perfectly logically, independently of the others. It's just like good decoupled software practice.
> you have to go outside standard big-O and incorporate practical considerations, which isn't done anywhere else as a matter of course.
Hopefully I've explained it better. Different models of computation are a fundamental part of theoretical computer science, since of course, as you've pointed out, you have to be very careful about defining operations. At the same time, part of the great part about CS (and math in general) is, that you can specify, "Assuming these conditions, what can I logically derive?" and that's just done everywhere.
Hashtable analyses don't care how you get the hash function - it can take exactly 72 hours to arrive, but as long as the oracle satisifes some probability distribution guarantee, it's good. That is an O(1) operation, a single hash.
Like, there's no going "outside" for big O, there's no "standard". It's always backed by some notion of what you are computing. There's pointer machines, turing machines, automata, circuits, for your complexity. Pointer machines can't directly address at offsets, Word-RAM can. Different tradeoffs for different analyses - pointer machines often model GC-collected languages and heavy pointer data structures, while Word-RAM deals with more compact data structures.
Again, as mentioned in the paper, Big O isn't used as a notion for number of total operations, it's only trying to *minimize the number of calls to the hashing oracle*. So your analyses can get even more fine-grained.
You might even be excited to learn about the External Memory Model. In that model, you have two blocks of memory; one block where *every single operation* is free, and one block in which you can't access, you must fetch into memory. You fetch blocks of size B, your local memory is B << N, and external memory is N << M. Even operation in block N is free, but and the only thing you can do on block M is read/write a block of size B. The question then is, to fetch the minimum number of blocks for your specific algorithm.
I've been originally confused by algorithms in this model, too - sometimes the "real number" of operations would be an order of magnitude more than the number of block accesses, but we just don't even count them? How does that make sense?
Turns out this actually models real hardware surprisingly well - it's often faster to simply recompute and do extra computation in order to minimize memory accesses. It's used a lot in databases. Although I can't find the exact wording "External memory model", the FlashAttention paper uses the wording "IO/Aware" and analyzes block transfers like this on the GPU, so... yeah.
So going 'higher up' in the abstraction stack doesn't necessarily sacrifice practicality either; in fact, it can sometimes help.
Sure, we can view the intuition as chaining together layers of abstraction, I get that view. But each layer functions perfectly logically, independently of the others. It's just like good decoupled software practice.
> you have to go outside standard big-O and incorporate practical considerations, which isn't done anywhere else as a matter of course.
Hopefully I've explained it better. Different models of computation are a fundamental part of theoretical computer science, since of course, as you've pointed out, you have to be very careful about defining operations. At the same time, part of the great part about CS (and math in general) is, that you can specify, "Assuming these conditions, what can I logically derive?" and that's just done everywhere.