> Filesystems are hard! Huh, no. Just show how the FAT file system works. It's s...

derefr · on Oct 13, 2019

I didn't say the implementation of a filesystem was hard. You can certainly communicate, as a complete standalone fact, the brute engineering perspective on what makes a particular filesystem work. Probably even in a single lesson.

But the point isn't to teach students about a filesystem; the point is to teach students what files are when they don't understand why you'd even want the abstraction that "files" represent; when "files" aren't an abstraction they're familiar with compared to other abstractions. You have to justify "files" in the context of things they do know—to explain what purpose files would serve if they were invented today; what reason you'd introduce a filesystem into a data architecture that had all these other things (KV storage, object storage, logical volumes) but not files.

Students would ask: Why are files streams? Why are the streams arbitrarily writable, when this means files can become corrupted from a partial write? Why are files extendable, but only at the end, not in the middle or the beginning? Why do files have names? (And, once you explain inodes+dirents, they'll ask why there's a directory tree if hard links exist; and why it's the dirents that have the names, but the files themselves that have the other metadata—xattrs; alternate data streams, etc.) What is the difference between a file stored in a directory, a file stored in a filesystem stored in a mounted loopback image stored in another file, and a file stored in a subvolume or filesystem container? Why do directories need special system calls to read and write them? Why is a virtual filesystem used in many OSes to represent kernel objects like named pipes and device nodes? Etc.

(This isn't even a hypothetical; engineers at AWS and GCP probably had to answer this very question when asked to justify building EFS and Cloud Filestore. Why do we need a filesystem as a service, when we already have these other services that provide these other abstractions?)

This is not altogether unlike teaching a student what a programming language is and why you'd even want one of those, when they're immersed in an environment where software is created without them. Would you just sit down and show the kids C, because it's small and easy-ish to understand?

A filesystem is a data-storage abstraction, like C is a machine abstraction. But neither are primitive abstractions. They build on numerous other, theoretically purer abstractions. It's much easier to explain the motivation for the creation of a language like C, if you already understand the motivation of the creation of a simpler language that rests upon fewer other abstractions—like ASM, or like Scheme. Likewise, it's much easier to understand the motivation behind the creation of the abstraction that is filesystems, if you already understand the motivation behind the creation of simpler data-storage abstractions, like logical volumes or memory-page-backed search trees.

pdonis · on Oct 13, 2019

The answer to "why do you want files as an abstraction" is that files are the units of ownership of data. If you don't control the files that represent your data, you don't own it. You might think you do, but you don't, because someone else ultimately decides the fate of those files.

derefr · on Oct 14, 2019

My point was that there don't need to be files anywhere to represent data at all. Computers can work entirely without files. For example, your whole hard drive could consist of an RDBMS, where you'd not "download" files, but rather "download" streams of tables, which would import directly as tables into the RDBMS.

"Files" are a very specific abstraction; thinking they're the only way to transfer chunks of data around is a symptom of lack of imagination. There are very similar abstractions to files, such as object-store objects. The only practical difference between files and objects is that updates to objects are transactional+atomic, such that nobody ever sees an "in progress" object. But an object store (backed by a block device) is a simpler system than a filesystem (backed by a block device.)

You can control the objects that represent your data. You could also control, via an RBAC-like system, the K-V or tuple-store records that represent your data. Or the Merkle-tree commits. Or the blockchain transactions. Or the graph edges. Or the live actor processes holding in-memory state. You can transfer all of these things around between computers, both by replication (copying) and by migration (moving.) What do files have to do with any of this? An accident of history is what.

thawaway1837 · on Oct 14, 2019

They are not the only way to transfer chunks of data.

They are the most successful and versatile mass way to transfer chunks of data and define ownership in the history of computing.

I’m sure an RDBMS or a graph DB can do those things as well. But no one has succeeded in doing it even close to as effectively as files managed to. And many have tried. In fact, probably the greatest computer software failure of all time, Windows Longhorn, was largely a failure in trying to replace a file based system with a graph DB based system.

People very much can imagine alternatives. There are no shortage of imaginable alternatives. There is a huge shortage of successful in use alternatives that are as versatile or effective as files.

derefr · on Oct 14, 2019

You're focusing on "files" as they compare to things very different from them. But imagine for a moment what an OS with an object store in place of a filesystem, would be like. Pretty much exactly the same, except that the scratch buffers backing temp files and databases wouldn't hang off the object-store "tree", but rather would either be anonymous (from mmap(2)), or would be represented by a device node (i.e. a logical volume) rather than being objects themselves. All the freestanding read-only asset bundles, executables, documents, etc. would stay the same, since these were always objects being emulated under a filesystem to begin with.

And downloads would also be objects. Because, when you think about it, at least over HTTP, downloads and uploads already are of objects—the source doesn't get allocated a scratch buffer on the destination that it can then freely seek(2) around and write(2) into; instead, the source just streams a representation to the dest, that gets buffered until it's complete, and then a new object is constructed on the dest from that full local stream-buffered copy. (WebDAV introduces some file semantics into HTTP's representational object semantics, but it doesn't actually go all the way to enabling you to mount a DBMS over WebDAV.) Other protocols are similar (e.g. FTP; SMTP.) Even BitTorrent is working with objects, once you realize that it's the pieces of your files that are the objects. Rsync is the only weird protocol, that would really need to be reimplemented in terms of syscalls to allocate explicit durable scratch buffers. (That and SMB/NFS/AFP/etc., but those are protocols with the explicit goal of exposing a share on a remote host as something with filesystem semantics, so you'd kind of expect them to need filesystem support on the local machine.)

Now, want to know something interesting? We already have this. Any inherently copy-on-write filesystem, like APFS or btrfs, is actually an object store masquerading as a filesystem. You get filesystem semantics, but they're layered on on top of object-storage semantics, and it's more efficient when you strip them away and use the object storage semantics directly (like when using btrfs send/receive, or when telling APFS to directly clone a container.) And these filesystems also have exactly what I mentioned above: special syscalls (or in this case, file attributes) to allocate scratch buffers that bypass Copy-on-Write, for things like databases.

There's no reason that a modern ground-up OS (e.g. Google's Fuchsia) would need to use a filesystem rather than an object store. A constructive proof's already there that it can be done, just obscured a bit behind a need for legacy compatibility; a need that wouldn't be there in a ground-up OS design.

(Or, you can take as a constructive proof any "cloud native" unikernel design that just uses IaaS object/KV/tuple/document-storage service requests as its "syscalls", and has no local persistent storage whatsoever, never even bothering to understand block devices attached to it by its hypervisor.)

pdonis · on Oct 14, 2019

> "Files" are a very specific abstraction; thinking they're the only way to transfer chunks of data around is a symptom of lack of imagination.

I didn't say they were the only way to transfer chunks of data around. I said they were the units of ownership of data. If your data is somewhere in a huge RDBMS mixed together with lots of other people's data, you don't own it, because you don't control its fate; whoever owns and manages the RDBMS does. The same goes for all the other object control and storage systems you mention: individual people who have personal data don't own any of those things.

zabzonk · on Oct 13, 2019

This is true. Way back when, I was teaching a commercial course on C to a class which included a COBOL programmer. At the end of one session, he came up to me and asked, "But what is this 'memory' stuff, and why would I ever want to use it?"