It's worth noting that Google's internal scale-out filesystem is append-only (fo...

btilly · on Dec 29, 2023

So if we ignore the technical merits based on which append-only data structures were chosen in general, we don't necessarily have additional technical merits for any particular time append-only was chosen.

The reasons why append-only structures were chosen in general apply surprisingly often to any particular system that scales to large data, which would like to be robust to a wide variety of real world problems. You won't see it in looking at the data structure because the hard parts are abstracted away from you. But you'll see it if you try to reimplement the same system from scratch, then scale it up to production.

_factor · on Dec 30, 2023

Don't forget the security implications. If only root can run the compression/deletion script, all the compromised user can do is try to write to theirs. If you get into writing other users' data, that sucks, but nothing is deleted or exfiltrated, and the log is baked in. Break into root, well good luck with that on any system.

alternative_a · on Dec 29, 2023

The subtle point I read in GP is the historic correlation between storage systems and data structures. Your point is equally valid in that this correlation is not indicative of non-general applicability.

Both points need to be kept in mind imho in design.

scottlamb · on Dec 29, 2023

Good point. It's also worth noting that raw flash also is generally treated as roughly append-only for wear leveling. "General-purpose" may be in the eye of the beholder, but I'd say these append-only data structures are useful in at least two significant environments.

pclmulqdq · on Dec 30, 2023

Wear leveling generally happens well below the user-level filesystem (and is quite complicated!), and altering your user-level behavior because you think it helps is a little bit silly. Zoned NVMe is an obvious exception to this, where the FTL takes advantage of the append-only zones (even that is an abstraction only shown to filesystems), but it will frequently remap your blocks if you do a lot of read-modify-writes to keep the wear even.

scottlamb · on Dec 31, 2023

True, which is why I called out "raw" flash in particular. I think there are embedded cases for example where it might make sense to skip that layer. Even on general purpose machines I think it'd be interesting to see alternate filesystem models that avoid double logging for databases, but I don't know if it will ever happen...

> Zoned NVMe is an obvious exception to this

And host-managed SMR HDDs, as namibj pointed out. I still haven't managed to get my hands on one, though; they seem to be hyperscaler-only for now.

pclmulqdq · on Jan 1, 2024

Even "raw" flash as exposed in most Linux servers is not anywhere near raw. For embedded use cases, it absolutely matters, but if you have Linux and an enterprise NVMe SSD, nothing you do in userland will matter.

namibj · on Dec 31, 2023

Well, you still don't get fine-grained random deletions on Zoned NVMe. Also, there is another storage target which expects append-only style: SMR HDDs. They don't support random writes into a zone.