Hacker News new | past | comments | ask | show | jobs | submit login
How the append-only btree works (2010) (bzero.se)
209 points by kevmo314 10 months ago | hide | past | favorite | 100 comments



The immutable b+tree is a great idea, except it generates a tremendous amount of garbage pages. Every update of a record generates several new pages along the path of the tree.

There are two additional techniques to make immutable b+tree practical. One is to reclaim stale unreachable pages after the transactions using them have closed. Two is to use a pair of oscillating fixed location meta pages.

A page is active if it’s in the latest snapshot or it is in use by an open transition; otherwise, it’s unreadable and can be reclaimed. When a transaction closes, it hands its list of in-use pages to the reclaimer. The reclaimer tracks these pages. When no other open transactions hold on to a page, it can be reclaimed.

When the write transaction commits, it hands the list of old pages being overwritten to the reclaimer as candidates for freeing, pending no open transactions using them.

The reclaimer can batch a number of reclaimed pages together to update the free page list, by appending to the end of the file a page of the newly freed page pointers. Update the meta page to point to the new head page of the free page list at the end of the file. This can be done as a write transaction to keep things consistent.

At a crash all the pending reclaiming pages in memory are lost and the garbage pages in disk linger. This requires periodic garbage collection or compaction.

The most frequently updated page in the tree is the meta page. Every write transaction updates it. This creates a lot of garbage pages. The second technique addresses this problem.

One insight is that the meta page is only needed when a new transaction begins by finding the tree root of the latest committed snapshot. That means we only need two copies of the meta pages, one for the last committed snapshot and one for the new pending write transaction. When the write transaction commits, the pending meta page becomes the latest committed snapshot. The other page becomes available for the next pending transaction.

We can have two fixed location meta pages, oscillating between the latest committed snapshot and the pending new transaction. The fixed location removes the need to search for the meta page from the end of file.


The Hitchhiker Tree [1] addresses the write amplification at the cost of making lookups and scans slightly more expensive, by keeping a small array of "pending writes" in every non-leaf node. The entire thing is still immutable, but instead of writing a new leaf node + path from root to that leaf, in the majority of cases we only write a new root. When there is no space left in the array, all the pending values get pushed one level down at the same time, so the write amplification is amortized.

[1]: https://www.youtube.com/watch?v=jdn617M3-P4



Here're what I found on the idea of adding buffers to branch nodes. L. Arge described the Buffer Tree idea in 1995 with (a,b)-Tree for other data structures. Brodal and Fagerberg formalized it for B+Tree in 2003. There's also the more modern Bε-Tree, the B-epsilon Tree, with similar idea, used in the BetrFS file system and various DBs. Tokutek built TokuDB using a similar idea and called it Fractal Tree (which was derived from Bε-Tree). Hitchhicker Tree is the append-only version of Fractal Tree, similar to the append-only B+Tree.

* L. Arge. The buffer tree: A new technique for optimal I/O-algorithms. In Proc. 4th Workshop on Algorithms and Data Structures (WADS), volume 955 of Lecture Notes in Computer Science, pages 334–345, Springer Verlag, Berlin. 1995. https://link.springer.com/chapter/10.1007/3-540-60220-8_74

* G. S. Brodal, R. Fagerberg. Lower Bounds for External Memory Dictionaries. 2003. https://cs.au.dk/~gerth/papers/soda03.pdf

* Bε-Tree, https://www.betrfs.org/, https://www.youtube.com/watch?v=fBt5NuNsoII. An Introduction to Bε-trees and Write-Optimization, http://supertech.csail.mit.edu/papers/BenderFaJa15.pdf, https://www.youtube.com/watch?v=v_g4eZeWAng&t=15s

* Tokutek. How Fractal Trees Work. 2011. https://www.percona.com/blog/wp-content/uploads/2011/11/how-...

* Hitchhiker Tree. 2016. https://github.com/datacrypt-project/hitchhiker-tree


can you explain the similarities and differences between arge's 'buffer tree' and the fractal tree and hitchhiker tree?

my tentative understanding is that arge's work, applied to an ordinary search tree, yields something very similar to the fractal tree, except that it will opportunistically flush buffers before they are full if the child nodes they route to must be brought in from disk anyway to answer a read query. but arge is more concerned with applying the technique to hairier data structures like segment trees and binary decision diagrams. and the hitchhiker tree seems to be precisely a copy-on-write fractal tree, no more and no less. but possibly i am misunderstanding something?

there's a clear citation path, so maybe i can untangle this: greenberg describes the hitchhiker tree as 'synthesizing fractal trees and functional data structures', kuszmaul's 02011 slide deck https://www.percona.com/blog/wp-content/uploads/2011/11/how-... cites brodal and fagerberg (02003, though he says 02002) and buchsbaum, goldwasser, venkatasubramanian, and westbrook (02000, though he says 02006), both of whom cite arge's 01996 paper


I think the main difference between Buffer Tree (Arge, Brodal and Fagerberg) and the more modern B-epsilon/Fractal/Hitchhiker Trees is what's stored in the branch buffer. Buffer Tree stores the data in the buffer. The newer Trees store the commands on data in the buffer, i.e. storing [insert(x), delete(y), update(a: 1), upsert(b: 2)] instead of [x, y, 1, 2]. The command+value are the data being migrated to the lower layers. The older papers just allude to "data" and show the asymptotic analysis.

Having commands is better because you can preserve the order of operations as multiple operations against the same record coming down the layers. Also it can short-circuit query at higher nodes, e.g. it's deleted. Upsert is much faster with a single upsert command added to the root node's buffer than the typical query-modify-write cycle.


thanks, this seems in agreement with what i thought, except that i hadn't thought about the upsert advantage


I believe the Bw-tree does the same thing, caching new updates in the intermediate branch nodes and only pushing the updates in batch down to the lower layers when running out of room at the branch node. These are the newer wave of algorithms to address the write amplification.


Yes, CoW trees have tremendous write magnification. This is well-known. The fix is to amortize this cost by writing transactions as a log and then doing a b-tree update operation for numerous accumulated transactions.

Think of ZFS and it's ZIL (ZFS Intent Log). That's exactly what the ZIL is: a CoW tree write magnification amortization mechanism.


I'm not familiar with ZFS internals beyond the deduplication part. Is that just the traditional transaction log + btree update approach most databases used? Does ZFS support transactional reading and writing?


> Does ZFS support transactional reading and writing?

I don't know that I understand your question. ZFS supports POSIX transactional semantics, which is not ACID, though ZFS's design could support ACID.

> Is that just the traditional transaction log + btree update approach most databases used?

Basically, yes. The idea is to write a sequential log of a) data blocks, b) metadata operations (e.g., renames) to support fast crash recovery. During normal operation ZFS keeps in-memory data structures up to date but delays writing new tree transactions so as to amortize write magnification. On unclean shutdown ZFS checks that all transactions recorded in the ZIL have been written to the tree or else it will do it by a) rewinding the ZFS state to the newest fully-written transaction, b) loading the contents of the ZIL from that point forward.

Because the ZIL was designed at a time when fast flash devices were expensive, it's really a separate thing from the rest of the tree. If one were to start over one might integrate the ZIL with the rest of the system more tightly so as to further minimize write magnification (e.g., data blocks written to the ZIL need not be re-written to the tree).


Good to know. Thanks for the explanation.

Basically ZFS maintains a ZIL based in-memory tree for the recent updates and a on-disk btree for the complete updates, minus the recent updates in ZIL. That's consistent with most database transaction log implementation. Some newer approach adds a Bloom filter to do fast decision between looking in the in-memory tree or in the on-disk btree.


FFS. The fix for absolutely bonkers CoW costs is to stop buying in to this idiotic notion that “immutability is easier to reason about”.

If you’re having difficulty reasoning about how to deal with major performance issues then your position is not easier to reason about. Full stop.

Stop the madness!


It is easy to reason about the performance issues of CoW, and it's easy enough to reason about how to work around those issues. Ease of reasoning is a big deal when you're dealing with hundreds of millions of lines of code, as some of us do. Cognitive load is a big deal in today's world. It's why we're migrating from C to memory-safe languages.


Tell that to NVMe zoned namespaces. Outside of RAM, immutable data structures exist for device technology (e.g. MLC NAND) inherent reasons.


it's correct that aliasing makes performance harder to reason about

but there are other things you might want to reason about, such as whether a subroutine terminates at all and what it returns, and immutability does make it easier to reason about those things


Runtime immutability does not make anything easier to reason about.

Its only result is adding bugs and performance deficits.


haha


> except it generates a tremendous amount of garbage pages

Note that this is an advantage on say, controller-less NAND Flash (JFFS-like embedded NAND Flash).

More modern Flash drives have embedded FTL (flash translation layers) that internalizes that garbage collection process. But *someone* had to write the FTL to begin with, and methinks this append-only btree would work very well for that.

-------

In NAND Flash, only 10,000ish erase/write cycles are allowed before any block becomes unusable. (Depending on tech of course: could be 1000 on QLC could be 100k on SLC). All that garbage helps cycle the write/erase cycles across the drive more evenly / more consistently. Especially if you combine that garbage with TRIM-like commands.

That might be a little bit too low level for a lot of folks though.


There are more appropriate and better data structures for an FTL especially with the assumption of some amount of auxiliary RAM (either host or on controller). Much of this literature is open access/free.

“All that garbage helps cycle the write/erase cycles across the drive more evenly”

Well no, you still want to minimize garbage production (and related GC overhead). Wear leveling doesn’t mean produce more garbage.


> Well no, you still want to minimize garbage production (and related GC overhead). Wear leveling doesn’t mean produce more garbage.

Surely some % of garbage helps as you're garbage collecting and reliably trying to shuffle data around to wear-level more evenly?

Lets say you have 99% used data and 1% garbage. You have very little room for (static) wear leveling. Ex: If you're writing 10MB of new data, and your drive is 99% full of allocated data... you'd have to move 1000MB of data around to "find" the 10MB of garbage, on the average.

In the other extreme case: 0% data used and 100% garbage (say a TRIM operation just completed), then you simply just write 10MB without having to worry about moving any data around.

The 50% data + 50% garbage scales as appropriate. 10MB of written new data will need you to find and coalesce 10MB of garbage to write the new data. This will happen after moving 20MB of overall data around.

----------

I'm oversimplifying of course. But even in a real life system, I'd expect that the more "garbage" you have sitting around, the better the various algorithms work for FTL / SSD (static or dynamic) wear leveling.


I can’t state any more clearly: minimize the production of garbage.

“Surely some % of garbage helps as you're garbage collecting ”

Garbage that doesn’t exist doesn’t need collecting.

The flaw here is confusing free space with garbage. You shouldn’t have written in the first place if you could have avoided it.

Every environmentalist knows this: RRR, the first R is reduce not recycle.


Any append-only data-structure will have data that was true (when it was written), but has become false/obsolete/garbage at a later time. This is unavoidable.

I'm not saying that we write needless garbage to the logs or filesystem or whatever. I'm saying that the amount of garbage in your stream that you leave behind aids in later steps of (static) wear-leveling. So therefore, its not a big deal. You're going to be making plenty of this (initially true data, but later garbage data) as files get updated, moved around filesystems, or whatnot.

"Garbage" in this sense is perhaps the wrong word. Its more like "Obsolete data", or "outdated data".


If you read the article though, many of the updated nodes (which are now garbage) don't see any updates to their “data” but to internal tree pointers.

So lots of “data” is being copied, and garbage is being generated, only for the benefit of tidying up tree structure, not because the actual “data” in those pages changed.

Not generating such garbage in the first place is an obvious benefit.


I think I see what you mean now. Thanks.


I am immediately nerd-sniped by this. Is there any code out there you know of I can see? The dual meta-data pages sounds precisely like double-buffering (from graphics; my domain of expertise). I am also drawn to append-only-like and persistent data-structures.


LMDB works like this. The MDB_env struct keeps pointers to both meta pages:

https://github.com/LMDB/lmdb/blob/30288c72573ceac719627183f1...

Which meta-page used is determined by the transaction ID (even IDs use first, odd IDs second).

This is the earliest description I can find of the "shadow paging" concept: https://dl.acm.org/doi/10.1145/320521.320540

And I believe its first implementation was in IBM's System R.


LMDB is the poster child of COW btree implementation. That paper is a good find. Thanks.


I had never read any papers on shadow paging when I wrote LMDB. I refer to it as double-buffering in the original LMDB paper because of my own prior experience writing graphics code. You can find the (L)MDB paper here https://openldap.org/pub/

Other commenters noting that append-only isn't efficient enough for general use are correct. We found this as well in the early days of testing LMDB with append-only behavior; 99% of pages were obsolete after only 5 transactions. That's why LMDB is CoW but not append-only, and that's why it uses a double-buffered meta page.

Bzero's work on append-only Btree was educational and illuminating, but not usable in the real world.


This description reminds me of a document I read about how ZFS is implemented. In particular, how snapshots work in ZFS, and what happens when you delete a snapshot.


It's a very similar idea.

Traditional Unix-ish filesystems, with inodes and indirect nodes are a lot like b-trees, but ones where the indices are block numbers and where you can only append indices, trim, or replace indices.

The ZIL (ZFS intent log) is a mechanism for amortizing the write magnification of copy on write trees.


Naively thinking here, how about a 2 immutable b+tree setup:

The "hot" tree is the WAL: All the data is there and copies are generated.

The "cold" tree is behind: In the background move from WAL and at the same time compact it.


That's how the traditional transaction log + btree approach work. The append-only btree removes the need for a separate transaction log.


I think you could view the append-only B-tree as a deamortized version of the traditional update-in-place B-tree + WAL. It's true that it eliminates the problem of maintaining a separate WAL, but only by trading it for an arguably harder problem: compacting the B-tree. It's easy and cheap to truncate the WAL; it's difficult and expensive to compact the B-tree. I guess LMDB solves this problem by only updating at page-level granularity so it can reuse entire pages without compaction, although I haven't studied its design in much detail.


If by compacting the tree you meant compacting the space left over by the users deleting the records, it's about the same amount of work between the two approaches. The WAL+btree case needs to merge the near empty pages and copy the remaining records while the append-only btree case needs to allocate new pages to merge in the near empty pages and fix up the parent path which can be done as a batching garbage collection phase.

If by compacting you meant cleaning up the garbage pages, while the WAL+btree is simpler in truncating the WAL, the append-only btree is pretty easy. It's just doing upkeep on the free-page list.

There're only three places to pay attention: 1. any page touched by a transaction (read/write) is added to the in-use list. 2. When a transaction closes, removes its touched pages from the in-use list by decrementing their in-use counters. 3. When a write transaction commits, adds all the old overwritten pages to a pending-delete list. Periodically check the pending-delete list against the in-use list and any page not in use is moved to the deleted-queue. When the deleted-queue reaches a large enough batch, create a new free-page to contain the pointers of the deleted pages from the queue. Chain up to the existing free-page list in the meta page by storing the pointer to the existing head of list in the new free-page. Append the new free-page to the db file. Update the new free-page as the new head of the free-page list in the meta page. That's it.


Of course, this implies a single-writer design (not necessarily a bad thing!).


Actually immutable btree has a single writer in general since there’s no translation log and the meta page pointing to one latest tree root. Even if two write transactions updating different parts of the tree, they both need to update the same meta page, causing a contention. This is one downside (or upside depending on your need since single writer simplifies things a lot and great for sequential writes).


You could quite easily check if anything you care about was touched by the previous lock holder, so that they could indeed write everything but the root in parallel/concurrently. E.g. note batched operations, like "insert list elements into the immutable container", or transactions where you don't even want to see results that soon.


The two writers will have two different new root nodes which need to be merged, and that could trigger further merging of lower level new nodes recursively. It becomes a merging of two subtrees. The subtrees can be quite different if node-splitting happens during inserts by the writers. It quickly becomes a complicate problem.


I think the insight here is that garbage is garbage and the generational hypothesis holds for DB trees as well.

So if we track and treat young nodes specially, we can then promote a subset of them into the permanent tree and dispose of the rest. Per the hypothesis, nodes in the permanent tree will also die, but at a lower rate.


LMDB [0] was derived from this project, as mentioned in its copyright notice section [1].

[0] http://www.lmdb.tech/doc/ [1] https://github.com/LMDB/lmdb/blob/30288c72573ceac719627183f1...


Importantly, the LMDB API allows append mode, and it is very fast.


One of the advantages of this kind of architecture is that if one reader is reading the old tree while it is replaced, it just works. Databases are really good at having multiple versions of the same data structure be accessible at the same time.

I hand rolled some similar data structures in higher order languages when I couldn't find any Python packages that gave me the same capability. But I couldn't figure out what a good name would be for, say, a dictionary that could have multiple concurrent versions. So I never went anything where with that.


As Phil Karlton, “There are only two hard things in computer science: cache validation and naming things.” Seems like you found a way to work on both!

https://skeptics.stackexchange.com/questions/19836/has-phil-...


I always have to add the "and off-by-one" at the end.

Nothing like reading memory you don't own!


There are actually only two hard problems in computer science:

0) Cache invalidation

1) Naming things

5) Asynchronous callbacks

2) Off-by-one errors

3) Scope creep

6) Bounds checking


There is only one hard problem in software engineering: people.


Let’s be pedantic here and get all religious over the words “Science” and “hard”.

Computer “science” has a difficult conceptual problem with caching. The optimal cache, this science tells us, is indistinguishable from a fortune teller who is never wrong (oracle). Fortune telling is a “hard” problem for a science based on reasoning. The best we can do is hedge bets (which is what the science of caching focuses on).

This same science also has a difficulty with naming things. Now numbering things is easy and science loves maths and maths love sciences, but science and letters have a more difficult history. Science would approve of “factoryfactoryfactoryImpl” btw ... it’s “a rational scheme of naming”. .

Here we see a “science” that is facing actual difficulties.

The rest of your list are difficult but not “hard”. The science of these matters is clear and the rest is up to the “scientists” struggling with “scope creep” and “bounds checking” ..


invalidation?


Yes. Stupid speech to text and I didn’t double check. Thanks.


These things are usually called "persistent" versions of the thing, e.g. persistent Dictionary. Sometimes "immutably persistent" to distinguish it from the "durable" meaning of persistence i.e. written to disk.


It seems odd to me to call a data structure "persistent" when its purpose is to allow you to cheaply keep many similar transitory copies around until you throw them away.

The specific application was using dynamic programming to build up a complex data structure. You wind up with many copies of very similar data structures, and doing a deep copy each time is prohibitively expensive.


Still that's what they're called: https://en.wikipedia.org/wiki/Persistent_data_structure

Precisely because previous versions remain valid (persist) under modification.


There is a key difference between that, and what I created.

The difference is that persistent data structures allow you to traverse the history of the data structure to find past versions. By contrast I only allowed you to see the current version, returning a new version of the root any time you made a modification. As a result, the memory associated with the old version could be freed once nothing would want to access it again. And any version that you had could still be manipulated.

For the example that I was dealing with, both made sense. On top of that it made it possible to create a hashable version of the data structure that had a good chance (not a guarantee) of matching hashes when you arrived at the same data structure through different histories.


the data structures people normally call 'persistent' behave like what you implemented; they're ubiquitous in languages like clojure, haskell, and even ocaml that privilege immutability. the map module in ocaml's standard library is an example, and so is almost everything built in to clojure. 'traverse the history of the data structure to find past versions' is not how these persistent data structures normally work

there is an unfortunate terminology clash with 'persistent' in the sense of 'not vanishing after a power cycle', so i typically use the term 'fp-persistent', because this sense of 'persistent' is associated with functional programming. this has the disadvantage that it's a term i made up, so nobody knows what i mean until i explain it


  The difference is that persistent data structures allow you to traverse the history of the data structure to find past versions.
That capabilities is not a required property of persisten datastructures. The copy on write and collect the unique parts when a root is freed semantic that you describe is exactly the common behaviour of persistent data-structures in the wild.

Libraries like Rusts "im", even do some nifty optimisations where they combine the borrow checker with reference counting to only perform copy on write when the reference you have is non-unique.

So based on your description you build a path-copying persistent data-structure.

I'd recommend this book if you want to compare your work with the state of the art: https://books.google.de/books/about/Purely_Functional_Data_S...


Well, the difference isn't that great. You wrote: “the memory associated with the old version could be freed once nothing would want to access it again.” So all it really takes is something wanting.


Append-only struxtures are not efficient enough for general use. You're paying a steep price for creating a snapshot of the data for every single operation.

I saw this when I was playing with Scala's immutable and mutable data structures - written by the same team - ages ago. The immutable structures were much slower for common operations.

The fastest databases tend to use undo logs to re-construct snapshots when they are needed.


You can build real useful systems on them. For example, Gmail used to be implemented on top of GFS as two append-only files per user: the zlog and the index. The zlog held (zlib-compressed) operations such as "add this message" and "delete this message". The index was an append-only btree as described here. It was a true index in that it didn't include the actual message contents. Those were simply pointers into the log. Thus, it was much smaller than the zlog and its efficiency was less of a concern. Also, periodically both files were "compacted" by writing a fresh log (that built the current state in the most efficient way) and matching index. This was primarily for privacy (otherwise your deleted messages would never actually go away) but also improved efficiency.

Gmail's now built on top of Spanner so uses a log-structured merge tree. Still append-only files but a bit different. Files aren't per user but per some arbitrary key range boundary; no more btree; multiple layers with the top ones getting compacted more frequently; etc.


It's worth noting that Google's internal scale-out filesystem is append-only (for various distributed systems and operational reasons), so you end up with append-only data structures proliferating in that ecosystem. That does not necessarily mean that the append-only data structures were chosen for any technical merit other than ability to be implemented on the append-only filesystem.


So if we ignore the technical merits based on which append-only data structures were chosen in general, we don't necessarily have additional technical merits for any particular time append-only was chosen.

The reasons why append-only structures were chosen in general apply surprisingly often to any particular system that scales to large data, which would like to be robust to a wide variety of real world problems. You won't see it in looking at the data structure because the hard parts are abstracted away from you. But you'll see it if you try to reimplement the same system from scratch, then scale it up to production.


Don't forget the security implications. If only root can run the compression/deletion script, all the compromised user can do is try to write to theirs. If you get into writing other users' data, that sucks, but nothing is deleted or exfiltrated, and the log is baked in. Break into root, well good luck with that on any system.


The subtle point I read in GP is the historic correlation between storage systems and data structures. Your point is equally valid in that this correlation is not indicative of non-general applicability.

Both points need to be kept in mind imho in design.


Good point. It's also worth noting that raw flash also is generally treated as roughly append-only for wear leveling. "General-purpose" may be in the eye of the beholder, but I'd say these append-only data structures are useful in at least two significant environments.


Wear leveling generally happens well below the user-level filesystem (and is quite complicated!), and altering your user-level behavior because you think it helps is a little bit silly. Zoned NVMe is an obvious exception to this, where the FTL takes advantage of the append-only zones (even that is an abstraction only shown to filesystems), but it will frequently remap your blocks if you do a lot of read-modify-writes to keep the wear even.


True, which is why I called out "raw" flash in particular. I think there are embedded cases for example where it might make sense to skip that layer. Even on general purpose machines I think it'd be interesting to see alternate filesystem models that avoid double logging for databases, but I don't know if it will ever happen...

> Zoned NVMe is an obvious exception to this

And host-managed SMR HDDs, as namibj pointed out. I still haven't managed to get my hands on one, though; they seem to be hyperscaler-only for now.


Even "raw" flash as exposed in most Linux servers is not anywhere near raw. For embedded use cases, it absolutely matters, but if you have Linux and an enterprise NVMe SSD, nothing you do in userland will matter.


Well, you still don't get fine-grained random deletions on Zoned NVMe. Also, there is another storage target which expects append-only style: SMR HDDs. They don't support random writes into a zone.


What you described is called event sourcing


LMDB is a strong counter-example to your argument. Based on its benchmarks, you're wrong. [1]

Yes immutable in-memory data structures are slower than their mutable counterpoints, but we're not talking about either of those here.

Databases are not in-memory data structures.

[1] http://www.lmdb.tech/bench/microbench/july/


Scala's immutable data structures, and especially the way they are used idiomatically and most obviously, have a lot of performance issues that are much bigger than the basic cost of persistent data structures. Namely, most commonly I will see developers default to pipelining strict operations like myList.map(...).filter(...).flatMap(...).collect(...).take(2) which forces/materializes the whole collection in memory for every operation. Better would be to first transform the list into a lazy view or iterator with myList.view or myList.iterator, and then do the pipeline.

They also lack the ability to perform multiple updates in a batch, except for some very limited cases. Other implementations like Clojure's support "transients" where you get access to mutate the data structure over and over again (as well as do reads), and then freeze the structure in place as a new persistent collection. JavaScript libraries like immer allow for the same thing. Scala's collections don't generally support this except in the form of "builders" which don't support reads and also don't support all the write access patterns (such as updating a vector at a specific index, or removing a key from a map/set, etc).


I don't think this is generally true.

Especially on mediums where sequentially writing large blocks is faster than random writes, you get much better performance to use a log-structured datastructure and put anything new/any changes at the end.

Due to the design of modern SSD's, 'write in place' is pretty much an illusion.


The biggest cost to copy-on-write data structures is write magnification. Adding a log to amortize that cost helps a lot. That's what the ZFS intent log does. Any CoW b-tree scheme can benefit from that approach.

The benefits of CoW data structures are tremendous:

  - easy to reason about
  - easy to multi-thread for reading
  - O(1) snashopts (and clones)
The downsides of CoW data structures are mainly:

  - the need to amortize write magnification
  - difficulty in multi-threading writes


A mutable approach requires a write ahead log meaning you have to copy the data twice on every write which seems worse.


There's an obvious tradeoff. With a 4K page size, the breakeven point is around 768 bytes per record. For larger records, the WAL approach is less efficient than LMDB. For smaller records, the WAL is more efficient in terms of write amplification. All thoroughly explored here http://www.lmdb.tech/bench/ondisk/


It is: two writes for write ahead log vs. log-n (tree-height) writes for CoW


In a well designed database system (hardware!) you aren't going to use the same CPU, disk or controller to write the log and the data structure. So the performance penalty if any is going to be low.


Low but non 0 and you’re still duplicating the values written to disk. So your SSD lifetime is decreased and this is effectively a RAID0 layout which means data loss is risky too.


Now you’re paying for double the hardware.


You're comparing apples and oranges. Anytime you write/modify on an SSD you're writing on a new page. So you might as well leverage this behavior to your advantage. It has several benefits when it comes to implementing transactions.


You probably need something special for batch operations. Adding five records in one transaction shouldn’t pay >= 5 times as much as a single one.

You need a different API for something like that however.


Yes, for instance Clojure's "transients" as well as the JavaScript "immer" library


That usually happens when you try to work with immutable data structures like you are used to with mutable datastructures.

For example, if you append to a mutable list then it's going to be fast. But prepending to it is much slower. With immutable lists, it's the other way around. Not knowing that will make you think that immutable datastructures are generally slow, but they are not.

That being said, I would say they are generally more tricky, so it's good to understand in which cases it's worth to sacrifize safety and readability and switch to mutable datastructures for performance reasons.


IF reads dominate writes, and multiple threads read the data concurrently, this approach may win by eliminating the need for locks.


Some years ago I was playing around with the idea of an append-only splay tree for storing a database file. The thinking was that the algorithm would keep recently-accessed items at the business end of the log (in addition to any new/updated/deleted items).

This concept is clearly quite heavy on writes, but the tradeoff is that you would then have the ability to expire unused entries simply by picking an offset in the log and chopping everything that precedes it. Any record accessed more recently than that cutoff point would be guaranteed to have been written one or more times later on in the log.

Any access would result in a new modified tree & root node being written, but the prototype did batch IO using an MPSC queue abstraction which meant that I could amortize things a bit. Multiple transactions could fit in a single IO command if they are issued from different threads, occur within a small slice of time and are smaller than the block size.


One thing that's not explicitly mentioned: append-only trees necessarily lack `parent` pointers.

This means that your "iterator" can't be a lightweight type, it has to contain an array of parent nodes to visit (again) later. You can use a fixed-size array if you can reason about the maximum height of the tree (for a balanced binary tree, this means around `2*POINTER_BITS`, which is 1024 bytes on a 64-bit platform; it will be less for a B-tree (`2*log(POINTER_MAX, CHILDREN_PER_NODE)`) but you now have to either track or re-scan for the index of the child you just returned from). Beware buffer overflow logic bugs if your tree isn't as balanced as you thought!


Every historical version of the database is available as a snapshot. The metadata block should point back to previous meta block(s), not just the new root

And of course emphasizing the closing statement:

> there is no need for a transaction log, because the database file is the transaction log


In such a system, how does a reader find the root node? I'd be concerned about a) readers seeing a partial write to disk (or page buffer) during an update or b) a crash while the writer writes to disk.

I could imagine using two files, one containing the actual b-tree, the second containing the offset to the latest root note; the second file gets overwritten only after a successful write is verifiably written to disk.

Datomic's (https://www.datomic.com/) architecture is similar to this, but uses many write-once "segments" (which could be files in EFS or S3, or rows in DynamoDB or other stores).


The pointer to the root tree node is stored in the last committed metadata page. A read transaction starts with the reading of the metadata page. Reaching the partial written pages is impossible as the writer has not committed the latest metadata page yet. The transaction is committed when the latest metadata page is written as the last step.


A read txn starts with reading the last page of the file and searching backward for a valid metapage. This can be time-consuming if the previous use crashed while a large write was in progress. Yet another reason LMDB doesn't use append-only design.


There’s a “canonical” pointer to the root node which is updated atomically on every append.

For an in-memory database, a CAS is adequate. Persistent stores ultimately need some kind of file-level locking.

If you look at the Apache Iceberg spec, you get a good idea of how this works: The only “mutability” in that universe is the root table pointer in the catalog.


Instead of cloning the branches all the way up to root, wouldn't it be more efficient to just write delta-records, so a structure per inner node that says "it's like this original node, but with the following modifications x,y,z"


Efficient in what way?

Yes, a delta record would be smaller. But now you have to fetch both the new and the old, plus do computation to figure out what you have. This is trading off space and time.

I'm going to go to a reasonably low level to explain this.

Databases have generally found that the right tradeoff between space and time is to always work in terms of pages of fixed size. Now all reads are memory aligned, of known size. All writes are as well. And inside the CPU, processing a page in the most straightforward and predictable way possible is very fast. In particular you want to avoid complex logic that introduces too many pipeline stalls.

If you're going to work with pages anyways, you want to find ways to fetch as few pages as possible. Only have this page point to that page where you really need to. And put as much as reasonable on each page. The name of the game is to fetch as few pages as you need, and get everything you need from a page when you fetch it. Because when you're getting a new page, often you have to wait for a disk read. That's slow. It used to be really slow, you needed to wait for the right part of the disk to rotate around. Those disks went at something like 7200 rpm, but that means 120 revolutions per second, which means you're waiting for anywhere from 0 to 8.333... milliseconds for the read. Now consider reading through a million record database...

That's where a BTree comes in. It is a tree structure built around a page layout. When a page gets too full, it is split and one record is promoted to the level above. When the top page gets too full, a new level is created at top. That keeps it perfectly balanced at all times. So you can get to any record in very few reads. And if you're walking the whole data structure, a single page generally has many records.


You may want to rebalance a tree, or you may want to have multiple readers while a writer exists without relying on mutexes or locking.


as long as you don't modify the original nodes that would still work, no locks needed.


One enhancement I've toyed with in the past is to have subtrees beneath each leaf node for the historical data for each record. You switch the indexing value to the timestamp of record insertion time or a TX number and keep inserting in append-only fashion. There's some enhancements you can make because the common case is for the key to be a monotonically increasing number, but nothing is strictly necessary. I'd love to build a datalog system on top of something like that at some point, even if it's just for fun.


with append-only data structures, you could turn the fact that you rewrite the path from the node to the root to your advantage and rebalance. This means you abandon the "balanced" property and use update statistics to get an optimal shape. Who cares about the length of paths to parts of the tree you never visit?


this is how splay trees work, but you might actually care about the lengths of paths to parts of the tree you rarely visit; for many real systems, worst-case response time is more important than throughput


IIRC, this is how CouchDB was implemented. The benefit is that it was resilient to crash without needing a write-ahead log. The downside was that it required running background compaction of the B+tree regularly.


LMDB works the same way. But it also stores a second data structure on disk listing all free blocks. Writes preferentially take free blocks when they’re available. The result is it doesn’t need any special compaction step. It sort of automatically compacts constantly.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: