For some reason I thought this article would be about how to estimate (free) disk space, with the problem being how to estimate (or know if possible) the available disk space using the least effort (or at least a fixed amount of effort).
I was expecting a deep dive into how different OSes handle storage and indexing, which file systems/drive types make it easier or harder, the tradeoffs between truly random sampling versus a sampling scheme that takes into account typical drive fragmentation patterns and speed of access, and was very excited.
I'm hoping this comment will nerd snipe someone who likes to write.
Edit: to comment on the article, 2**10=1024 is handy to know so then 2**20 is about a million and 2**30 is about a billion. That then helps you estimate common log2 values which allows you estimate sorting, searching, and tree type structures that have some logarithmic aspect to their time or space complexity.
It sounds silly or pedantic sometimes but if you are specifying how a storage subsystem must behave, it's best to use unambiguous units for things like capacity and throughput. That way suppliers aren't confused and can't game the units.
Storage shouldn't be counted in SI units in the first place. Throughput is arguably semi-legitimate to count in SI, but for storage? There is literally no reason other than marketing.
Except for unfortunately because of popular conventions the SI unit names are ambiguous with the power-of-two unit names. Maybe your point is that no one should have used "kilo" prefix to describe 1024. And yes that would've been great. But they did and now it's very much standard and expected to describe kilobytes as 1024 bytes. So now we are stuck with the ambiguous terms that the SI standardized on. Those are consistent with the meanings of other Greek prefixes, but if the standard unit can be easily confused with another standard, it's not as useful.
Counterpoint: disk sector sizes are not multiples or clean fractions of 1000, and it's unlikely to change anytime soon because we'll likely store files as octets of binary information (binary digits) for the rest of time.
It makes no sense to count it in 1000-based SI units. You'll perpetually mis-align the data in whichever underlying storage technology you're using, and it's bad for performance and resource consumption.
As much as i dislike the names, kibibytes etc, it makes sense not to re-use SI unit prefixes. SI did comes first and it broadly applicable.
Ask for disk sector alignment; we don’t have disk sectors any more, we use SSDs with blocks. And the sizes don’t need to have a neat number of kilobytes. Just the same as a litre of water doesn’t have a neat number of H2O molecules in it.
I refuse to say them outloud, but I also refuse to misuse the Si prefixes, and so I'll write KiB or MiB or GiB when I mean 2^10/2^20/2^30 bytes, and file bugs against anybody using KB as 1024 bytes.
I was expecting a deep dive into how different OSes handle storage and indexing, which file systems/drive types make it easier or harder, the tradeoffs between truly random sampling versus a sampling scheme that takes into account typical drive fragmentation patterns and speed of access, and was very excited.
I'm hoping this comment will nerd snipe someone who likes to write.
Edit: to comment on the article, 2**10=1024 is handy to know so then 2**20 is about a million and 2**30 is about a billion. That then helps you estimate common log2 values which allows you estimate sorting, searching, and tree type structures that have some logarithmic aspect to their time or space complexity.