General Balanced Trees (1999) [pdf]

jandrewrogers · on Aug 18, 2019

Nice paper.

It is interesting to note how practical data structures have evolved over the last 20 years, since this paper was written. Balanced trees were pervasive back then and the primary workhorse in all database systems I designed or worked with. Today they are rarely used in new system designs, since the practical tradeoffs started making less and less sense over a decade ago. Balanced trees have a very high space overhead that becomes prohibitively costly as data sizes grow.

Succinct implicit tree structures are unbalanced, so in theory have higher latency variance, but they also can have a space overhead 2-3 orders of magnitude smaller. As a practical consequence, the cost of traversing a level in the tree is dramatically lower than for balanced trees. The number of levels traversed tends to be used in literature as a shorthand metric for "optimality" but it smuggles in the assumption that traversal costs are similar across algorithms in practice. Searching a nicely balanced O(logn) gigabyte metadata structure has a hard time competing with an equivalent unbalanced tree that fits the metadata entirely in L2 cache, even when the tree is very unbalanced (and the average case is much better).

I am not entirely sure when I stopped using balanced trees. I went from always using balanced trees to never using balanced trees, gradually over many years.

thickice · on Aug 18, 2019

Can you provide some references to open source implementations of Succinct implicit trees ? How different are they from a simple unbalanced BST ?

dpwm · on Aug 18, 2019

For anybody interested in these, Erlang has an implementation of both sets and trees apparently in its standard library. The documentation claims that they perform better than AVL trees.

trees: http://erlang.org/doc/man/gb_trees.html

sets: http://erlang.org/doc/man/gb_sets.html

source for trees: https://github.com/erlang/otp/blob/master/lib/stdlib/src/gb_...

brudgers · on Aug 18, 2019

I came across the paper in the Erlang documentation.

toast0 · on Aug 18, 2019

Erlang's ETS also has ordered_set tables which is a C based set, most likely better for performance, although certainly not as transparent.

jbapple · on Aug 18, 2019

These trees nearly identical to Scapegoat Trees, which were a rediscovery of the idea by different authors. https://cglab.ca/~morin/teaching/5408/refs/a99.pdf has the history:

"A preliminary version of this article is published in Proceedings of Workshop on Algorithms and Data Structures, WADS ’89, Ottawa. An independent article has also been presented by Galperin and Rivest at the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA."

nayuki · on Aug 18, 2019

I found it difficult to understand this paper. I understand the idea of balanced binary trees, but the paper gives a pile of inequalities and no clear actionable advice about how to construct efficient algorithms.

For self-balancing trees, my technique of choice is AVL trees. The implementation is short and the mathematical proof is straightforward. It is only a small constant factor worse than a completely balanced binary tree.

I searched for even simpler structures and found AA trees. I showed that their code is only a tiny bit shorter than AVL trees, but proving their correctness is noticeably more difficult. https://www.nayuki.io/page/aa-tree-set (see especially "Compared to AVL trees", BasicAaTreeSet.java, BasicAvlTreeSet.java).

In the real software applications though, B-trees are preferred because they play well with the cache/memory/disk hierarchy - the fact that sequential access is fast and random seeks are slow.

sesqu · on Aug 18, 2019

Log-height is a sufficient but not necessary condition. There are a few other ways generalize balanced trees, notably logit-weight.

jbapple · on Aug 18, 2019

But logit-weight trees have logarithmic height, yes?

sesqu · on Aug 18, 2019

Weight is a weaker condition, since you can construct a polynomial weight sequence that results in linear height.

In general, height is the easiest thing to restrict, but doing so restricts dynamic performance optimizations - you can't use splay trees, for instance

jbapple · on Aug 18, 2019

> Weight is a weaker condition, since you can construct a polynomial weight sequence that results in linear height.

I'd like to hear more. The varieties of weight-balanced trees I'm aware of all have logarithmic height.

> In general, height is the easiest thing to restrict, but doing so restricts dynamic performance optimizations - you can't use splay trees, for instance

Good point.

BTW, you might be interested in Bose et al.'s "De-amortizing Binary Search Trees" <https://arxiv.org/pdf/1111.1665.pdf>, shows how to keep height logarithmic with "essentially any Binary Search Tree" (their phrase).