This isn’t the reason about block addressing at all since it existed before pagi...

saurik · on July 27, 2020

(I am awake now, so I can provide a more useful answer to this question.)

> This isn’t the reason about block addressing at all since it existed before paging was a thing.

I didn't say "blocks exist because pages exist", I said "blocks exist for similar reasons to why pages exist". How you thought that required temporal causality is beyond me. :/

> Addressing storage content in blocks is closely aligned with how direct access storage is organized in “sectors”, usually 512-byte blocks.

OK, great: now you have to answer why, as you are just punting the answer. The answer, FWIW, is similar to the answer for why memory is divided into pages.

> Disks don’t have byte access interfaces, so it doesn’t make sense to use one.

But, of course, they could have a byte-access interface; in fact, that was in some sense the entire question: why don't they? Note, BTW, that RAM also doesn't have a byte-access interface: you have to access it by page. (And hell: many modern disks are just giant flash memories!)

The super high-level API for working with memory provided by the CPU as part of instructions lets you access it by the byte, but again: that's also true of disks and blocks, where the super high-level API for working with files absolutely lets you work with it accessing specific bytes.

> It also lets you to address much larger data structures with limited sized variables. With block addressing, you can access terabytes of storage with a 32-bit integer.

This is true, and is a partial answer to the original question. Essentially, a filesystem is serving the same problem space as a page table, but instead of mapping virtual pages of processes to physical pages it is mapping block-sized regions of files to blocks on disk.

In both cases, it is thereby beneficial to save some bits in that map. However, as pages of memory are usually 4k and blocks on disk are usually 512 bytes--though you see larger for both of these (64k pages are becoming popular)--that really isn't that many bits you are saving.

The real reason is that there is a cost to random access: the more entries in your page-table / filesystem block list the more space it takes and the harder it is to find the one you want; and, while you can "compress" ranges, that also comes with its own cost on random access.

On disks made out of spinning rust, this random access cost is particularly ridiculous, as it might require rotating a heavy piece of metal and physically adjusting magnets to achieve the specific offset you are looking for. You thereby never want to load a single byte at a time: you want to pull a full block.

There is also another interesting cost to random access on writes, which applies to some kinds of memory and applies to most kinds of disks, which is that you just don't have the precision required to write a single byte at a time: think about trying to isolate one byte using a magnet!

So the way these things work is that they have to read large subsets of data at one time and then write it back. If you can align your filesystem to match the blocks of a file along these write boundaries, you thereby decrease the amount of work the disk has to do.

(And, again: this is very similar to the work that memory managers are doing with respect to attempting to align pages of large allocations to make the virtual memory areas efficient in the page table, which, again, is pretty much a filesystem for memory with processes as files.)

You thereby also often find yourself attempting to tune your block/page size to that of the underlying hardware, and if you build higher-level data structures (such as databases using B-trees or dictionaries using RB-trees) you will want to align those at the higher-level as well.

(And it is thereby then also fascinating that it took as long as it did for people to realize that if you want to implement an in-memory tree that you should always prefer a B-tree--a data structure designed around disk blocks--over an RB-tree, for all of the same reasons.)

Blocks (and pages) thereby exist to help all layers of the system most efficiently take advantage of linear access patterns to get efficient parallel loading and caching to solve a problem that is otherwise subject to high random access costs and potentially-ridiculous fragmentation issues.

sedatk · on July 28, 2020

Yeah, I read your answer as if disk block addressing was directly related to paging. My bad.