Hacker News new | past | comments | ask | show | jobs | submit login
Counted B-Trees (2017) (greenend.org.uk)
98 points by zerojames 9 months ago | hide | past | favorite | 49 comments



Some here are questioning the motivation of this data structure. It is well motivated at least in the context of time series. We (almost) all learn 2 things at very young ages: 1) the world changes and 2) even one wild point can pollute an average. In time series contexts, we also learn early solutions to 1) a moving data window (flat or fancily weighted over time) and 2) percentiles/quantiles/order statistics are more robust.

Combining the two solutions means you want to efficiently delete old/insert new/query quantiles over moving data windows. If you need multiple (e.g. 1%, 25%, median, 75%, 99% is a classic case) and exact quantiles over large data windows (in numbers of points), a counted B-tree is about as good as you can do, and can be many orders of magnitude less work than more naive solutions (which absolutely show up In The Wild - I have personally seen them many times).

There is a case that article does not address when you have many-way ties aka many duplicate keys, but that can be handled within the same general framework.

If you know something about the dynamic range of your data then you can also get more approximate dynamic quantiles with a histogram with log/exp-spaced bins (like floats) with counters backed by a Fenwick Tree (to make both PDF & CDF updates efficient). You can further boost accuracy with Parzen mid-quantile interpolation. A fully worked example in Nim is at https://github.com/c-blake/adix/blob/master/adix/lghisto.nim In the same github repo there is also a Nim implemention of counted B-trees (though it is harder to use).


this is an excellent example of when you'd want to use counted b-trees, but much more generally, they provide a middle ground between arrays and linked lists

in arrays, indexing is fast (one instruction), but insertion and deletion is slow

in linked lists, indexing is slow, but insertion and deletion are fast (≈10 instructions)

(iteration is fast in both, though on modern cpus you have to store multiple items per linked-list node for that to remain true)

in counted b-trees, both indexing and insertion/deletion are fast, though not as fast as in arrays or linked lists: logarithmic time instead of constant time, with a larger constant factor


Also known as an order-statistic tree. Good for obscure leetcode problems, though I've never needed to implement one in my whole career.

[1] https://en.wikipedia.org/wiki/Order_statistic_tree


I’ve done a few implementations of this data structure recently. It’s an essential part of making a high performance CRDT for collaborative text editing.

I haven’t found any decent 3rd party order statistic tree online that I can use for CRDTs for 2 reasons:

1. I want my tree to be internally run-length encoded. Ie, each item in my tree represent an adjacent runs of inserted characters in the document. In my tests, this seems to make the data structure about 20x smaller. But it adds complexity - we also need to be able to later split those runs if we discover an insert in the middle of any run.

2. I also need random access. Items have a unique ID, and I need to be able to find any item by looking up its ID. The IDs are just auto incremented integers to make the run length encoding work.

You end up with a pretty custom set of data structures, but the result is a system which performs fabulously well in practice. The fastest implementation I know of of this stuff is the Cola crdt library, which is written in 100% safe rust.


FYI "run-length encoded" means compressing repetitions. "123 repeats of the character 'x'" would be RLE. You seem to be talking about keeping a tree-of-strings instead of tree-of-characters, to decrease the overhead.

https://en.wikipedia.org/wiki/Run-length_encoding


It’s not “123 repeats of the character x”, and it’s not a tree of characters either. The sibling comment is right.

The data is crdt metadata. If you type a series of characters, each character has some fields. In my case, this is an id (autoincrementing integer), and origin-left and origin-right (the id of the items immediately to the left and right when the character was typed).

If a user types a continuous run of 100 characters, we don’t need to actually store all 100 IDs - since they’re just, say, the numbers 1000-1100 in order. The origin-right for all items will be the same. And we don’t need to store the origin-left for each one. The first item in the run has an unguessable left origin (say, 600). But the rest will have an origin-left equal to the id of the previous item. Ie, 1000, 1001, 1002, and so on up to 1099.

Naively you’d need 300 integers to store the metadata for 100 contiguous items. But if you’re clever, a run of 100 items only needs to store a start id, length (or end id), the origin-left of the first item in the run and the common origin-right for all the items in the run. That’s just 4 integers, which I store in a single fixed sized entry in the btree. Because of how humans actually type, this compression makes a big difference in practice.

In code, I have a pair of rust traits for Splitable and Mergable which define functions like len(), can_append(), append(), truncate(), split() and so on. Then I have a generic order statistic tree implementation which stores Splitable & Mergable items.

Run-length encoding is the best term I’ve seen for this kind of packing. It’s definitely not a tree of strings - the document text itself is stored in a completely separate data structure. (In my case, a skip list).

(Credit for the run length encoding goes to Kevin Jahns, author of Yjs.)


I would recommend not using a well-established term for something that doesn't actually match the well-established term..


I’m encoding a set of runs. Each run has a value component and a length. The only difference from traditional RLE is that, when expanded, the values are computed from the base+offset rather than repeated. Eg the IDs expand to 1000,1001,1002,… instead of just repeating; 1000,1000,1000,…

What would you call that?


I can't tell from your writing if this is really a compression of an output format or an inherent part of your data structure. What determines when the repeating or sequential IDs are skipped? Is it part of the shape of the data structure, or is it a pure byte manipulation at encode time?

If it's purely in the encoder, "The encoded runs omit repetitive or sequential values of IDs.", and describe how the decoder knows what was done (e.g. naive RLE might have to say 1h1e2l1o, repeating lengths everywhere).

If it's part of the data structure, say something like "Within a <chunk>, only origin-left of the first item and the common origin-right need to be stored, as the individual item origin-left and origin-right values can be derived from those."


> I can't tell from your writing if this is really a compression of an output format or an inherent part of your data structure.

It’s both. It’s part of the internal in-memory data structure, and part of the encoded output format.

At runtime, the data structure also needs to support split and join operations (when possible). For example, a user types a paragraph of text then moves their cursor in the middle of the paragraph and types more text. Initially we create (and extend) a single run representing the paragraph as it’s typed. Inserting in the middle of the run will split the run with the added text. So we end up with 3 entries in our order statistic tree.

I like your suggested text - but we’re talking about this in the context of order statistic trees. I have a generic order statistic tree implementation that splits and, when possible, joins the stored items. I’ve been calling that “internal run length encoding”, but you’re right that that’s not the common use of the “rle” term. But I would like a term for that that doesn’t involve explaining my actual data. My order statistic tree implementation doesn’t understand origin-left and origin-right values. That’s an implementation detail of how my particular data implements its splitting and joining operations.


That sounds like how the Zed editor's SumTree is parameterized with Summary, which is whatever means of summarizing information makes sense for that use.

https://zed.dev/blog/zed-decoded-rope-sumtree


Yes they’re quite similar. And I don’t think that’s a coincidence.

But they have some important differences too:

- Sumtree at its heart is a tree of characters, with a summary per node in the tree. I haven’t read that link in detail, but I assume summaries are derived from the stored text itself. I’m storing a tree of metadata entries, not text. The metadata is not derived from the text. A string of 100 characters in the document may be able to be summarised by 1 metadata entry, or we may need up to 100 entries - 1 for each character. Each leaf node in my tree stores up to 32 metadata entries. These entries are split when needed, and opportunistically joined. Entries describe characters which were at some point inserted. (We also need to keep around the metadata even after the text has been deleted.)

- My metadata entries also have IDs, and my data structure also needs to be able to look up any item in log(n) time via its ID.

The text itself is stored in a separate data structure. I assume any text editor which integrates a crdt will, like zed, have its own data structure for this purpose already.

You’re right that the data structures are similar, but the implementations are not interchangeable.


you're rle-compressing the first differences of the ids (x[n] - x[n-1]) rather than the ids themselves. though maybe only in the special case of a run of 1s


Yep, only in the case of runs of 1s - since that’s the predominant case when typing.

Runs with other difference values show up when typing in multi-cursor mode in vscode and other editors, but I haven’t optimised my data structure for that use case.


yeah, that seems quite reasonable

reversibly transforming a data sequence into a more easily compressible one of the same size (sometimes called 'whitening' or 'predicting') is a very general technique for data compression, and first differences are a very commonly used transformation here. others include the burrows–wheeler transform, paeth prediction, and linear predictive coding


I didn't know that had a name! Thanks - time to do some reading.


It's almost like a RangeMap, of which there are at least 3 notable variations:

* The basic RangeMap, where a range of adjacent keys all map to the same value

* The computed RangeMap, where a range of adjacent keys all map to adjacent values (common for e.g. case-conversion of Unicode, can also come up for IP addresses)

* The DenseMap, where a range of adjacent keys map to arbitrary values, but use an array for memory optimization.

The first two actually can be made to work for string-like keys. The last is only possible for integer-like keys; use a carefully chosen trie design for string-like keys.


> Good for obscure leetcode problems (...)

God, I hate how pointless leetcode has become, and how it's used in interviews. Smug assholes using obscure data structures and algorithms trivia that no one ever used at all in any professional setting, and which the interviewer only learned about because he spent a week searching for a coding problem, to decide whether you are competent at putting a button on a window.


It's also about your communication skills and ability to use existing knowledge even when it isn't the most suitable. A while ago in an interview I was asked a problem for which the correct solution is in fact an order statistic tree, but not knowing that, I solved the problem using more standard data structures with worse complexity, and pointed out their limitation. I still passed that interview.


> that no one ever used at all in any professional setting, (…), to decide whether you are competent at putting a button on a window.

Some software jobs involve more than putting a button on a window.


Maybe they should keep interviewing for jobs that require nothing more complex, important, or dangerous than that.


I mean tbh, a list structure with logN indexed access, insert, and removal seems like a really useful structure.

I just wish it was built into languages as one implementation of the list type


> I mean tbh, a list structure with logN indexed access, insert, and removal seems like a really useful structure.

Indeed B-trees are hardly novel.

The point is that no one has a practical need to implement them themselves, unless you're working on rolling out your own database engine. If that's the case then either you're doing that for fun or there were steps in the decision process that I'd love to hear the justification.


> I'd love to hear the justification.

This is the right data structure to use for collaborative text editing. I made another comment in this thread going in to details. The fastest implementation I know of is Cola, and the author has some great posts going in to details if you’re interested:

https://nomad.foo/blog/cola


Sometimes, we get paid to hit the high notes, even if we don't sing them all the time, or even every day. See sibling josephg comment on use in CRDT implementation.


A lot of PLangs have this outside their stdlibs. For example, in Python: https://pypi.org/project/blist/

Some C & Nim variants are elsethread (https://news.ycombinator.com/item?id=40441832 & https://news.ycombinator.com/user?id=cb321 , respectively).

--

There seems to be a lot of disagreement about how big stdlibs should be. Personally, I think a large Python one was a selling point (batteries included) as well as the way C++ STL and similar Java ones won over many back in the 1990s. However, in today's modern networked-by-assumption world, stdlib maintainers seem to feel overwhelmed pretty easily and eager to "slough off" as much as they can to the ecosystem. It's likely a sub-discussion of an even more broadly contentious "how narrow/wide should dependency trees" be. What seems to be missing, IMO, are agreed upon "metrics" beyond "saleability of the PL/ecosystem".

As guidance for granularity metric relevance, it should be observed that most PL's have some kind of module/separate compilation units. A good answer would be able to work at (and across) many & various levels of the containment hierarchy - at least 3..4 of (procedure, type|module, package, developer, organization). Maybe someone has a set of such metrics in their back pocket, but if so they don't get a lot of exposure in PL chit-chat circles like this one. My guess would be there just isn't enough agreement, but it's probably been fodder for like 100s of software engineering PhD theses. Lol. ( As well as even broader human organization discussions like https://en.wikipedia.org/wiki/Conway's_law )


a relevant fact here is that python was, to a significant extent, guido's attempt to reimplement the abc teaching programming language without the design features that made it useless for anything but programming exercises; one of those, if i recall correctly, was that abc implemented lists as some kind of balanced tree structures (maybe avl trees?), in exactly the way this discussion is about, and was very slow as a result


Just to be clear, I never meant to suggest "only a b-tree", simply the option in the stdlib as per zug_zug (who I did not also take it as suggesting "only"). Different data structures for different use cases as always.

JavaScript famously implemented arrays in terms of hash tables and.. In a similar vein, I've also known people who also mistakenly thought B-trees could replace a good hash table as the default &| go-to "associative container", but "order costs" (on one side or another) as you said elsewhere.

It's hard to say how much choice needs to be in a stdlib. Hence my exploring how to better say it. :-) E.g., Go has no b-tree in the stdlib, but has a bunch of crypto stuff. Nim has little to no crypto stuff, but has a critbits trie.

EDIT: Oh, and FWIW, at least abc-unix used B-trees for list & tables, not AVL trees.


thank you for the correction about abc!

b-trees can sometimes replace a general hash table as the default associative array type. in part it depends on programming style and other parts of the language; in lisp, for example, the default associative array type was alists (linked lists of key-value pairs) for a long time, and emacs lisp still uses them pretty extensively. and i think you would be hard-pressed to find an sql implementation for which associative arrays are not usually b-trees—in part because in sql you tend to ask questions like 'which values correspond to these 1000 keys?' more often than 'which value corresponds to this key?', and in part for hysterical raisins. (when they wrote system r, oracle, and ingres, main memory was small, so locality of access was more important than it is now.)


FYI this is not that uncommon in functional languages and persistent data structure libraries.


And with predictably good cache locality on iteration. Given the ever increasing gap between the fast and the slow end off our volatile memory hierarchies, I'm surprised that we don't see more of an identifiable trend of using algorithms/structures traditionally associated with spinning disks in nonpersistent memory.


B+ Trees (which store data at leaves) rather than B-Trees are what you want for good iteration performance usually.

An ordered B-Tree iteration would jump between different arrays in the tree often, but a B+ Tree would only have you jump to another array once you’re done iterating over all the elements in the current array.

The following article says the same. https://www.scylladb.com/2021/11/23/the-taming-of-the-b-tree....


I mean, fwiw sometimes "obscure data structures" really do cleanly, efficiently, and performatively solve a problem that would have been a nightmare to address with simpler, more well known structures. That said, for most of us, it's rare enough to be memorable - but it does happen.

I think you're right that requiring such knowledge off the top of one's head and in the context of a one-hour interview is almost always unreasonable, though.

What I'd really want to know is how well a person can do the research to arrive at performant solutions involving, as you say, obscure data structures. In an interview context though, I think you basically just have to ask them - would be nice if you could get a sense of it through doing, though.


Funnily enough this is (was?) used by the Scala collection library's TreeSet/TreeMap, for fast performance of the `take` and `drop` operations. This is a red-black tree though, not a B-Tree.

Commit https://github.com/scala/scala/pull/82/commits/b7e671446892c... of PR https://github.com/scala/scala/pull/82


Hard to evaluate your comment without knowing in what field your career is.


I have had to implement it but still I wouldn't use it directly for an interview.


what were you using it for


The order statistic tree was covered in the CLRS book. In the 4th edition, it is chapter 17 "Augmenting Data Structures", subchapter 1 "Dynamic order statistics". (I know for a fact it was present in the 3rd and 2nd editions too; did not check the 1st.)

I used this knowledge to implement a tree-based list data structure, with no set functionality at all: https://www.nayuki.io/page/avl-tree-list


By separating "seek" (min/max/key/nth) from "edit" (in-del) and "balance" (AVL, red-black, etc.) operations, you can indeed cover a lot of bases as elaborated here: https://github.com/c-blake/bst (including also how you handle duplicate keys as well as the perhaps less useful neither-ranked-nor-keyed access, wherein the trees are kind of a double-ended queue with just "edge access").


>perhaps less useful neither-ranked-nor-keyed access, wherein the trees are kind of a double-ended queue with just "edge access"

For normal btrees this is absolutely true. For b+trees you will see this basically everywhere.



probably if you like this data structure you will want to read raph levien's extensive treatise on its many variations and applications, specifically with respect to writing text editors, entitled "rope science": https://xi-editor.io/docs/rope_science_00.html


Scapegoat trees make use of subtree counts. But, according to Rivest's description of the algorithm, they are not stored; the counts are dynamically calculated when needed.

Possibly, a scapegoat tree implementation could be altered to store the counts in the nodes, and then similarly support indexed access. Perhaps the caching of the sizes could help the scapegoat operations also; I would have to dive into the details again to check this idea.

However, it's a virtue of the original algorithm that the nodes don't have to store any meta-data related to the algorithm. Not even one bit (like in the red-black tree case, for color).


I've been interested in this data structure for the ability to scan forward efficiently to any depth.

I am thinking of a binary protocol that can be navigated extremely efficiently and also be inserted into at any depth.


This approach is essentially what is called an "augmented data structure". Namely, in this case a B-tree is augmented to store an element count. But there are obviously many other data structures to which this can be applied, and many types of additional values that can be stored.

https://www.geeksforgeeks.org/introduction-to-augmented-data...

https://tildesites.bowdoin.edu/~ltoma/teaching/cs231/fall09/...

https://www.cs.toronto.edu/~tabrown/csc263/2014W/week4.inter...


Maybe this is useful for real time memory management? 'arrays' for instance, if dynamically sized, can always cause fragmentation. It feels like you can get relatively efficient traversal and random access in these counted b-trees with fixed size blocks all the way down.


A working example in Go. https://github.com/tidwall/btree


Good examples of this would be the "rope" used in various text editors and Clojure's persistent vector data type.


Does anyone know of a persistent/on-disk implementation of this kind of data structure?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: