A Cartesian tree (treap) with zip/merge change routine is a well known datastructure in competitive programming. The only difference from the paper would be rank selection algorithm - uniform instead of geometric.
>> One can view a zip tree as a treap (Seidel and Aragon 1996) in which priority ties are allowed and
>> in which insertions and deletions are done by unmerging and merging paths ("unzipping" and "zipping")
Correct me if I'm wrong, but isn't this the "standard" [1] treap with split/merge (which is somewhat acknowledged in the footnote of page 8), simply with a different distribution of priorities? The result is certainly impressive, but it seems like a small modification to an otherwise known structure.
[1]: See here http://e-maxx.ru/algo/treap for an old reference of this structure. Google translate does a decent job with it, and the code is readable without translation.
I had a similar confusion when skimming the paper. The algorithm only depends on the relative ordering of node "ranks", rather than their absolute values. If you were to treat the distributions as continuous (ignoring ties) then it would make absolutely no difference what distribution you used. (Any two continuous distributions are related by a monotonic transformation, which preserves the ordering relationships between different points of the distribution.)
(A more concrete way of looking at it: a simple way of sampling from a geometric distribution is to choose a uniform random integer and count the number of leading 1 bits. The authors suggest concatenating a few random bits to the end for tie-breaking, so that the rank is a pair of a geometric random variable and a uniform random variable, sorted lexicographically. But such a pair is equivalent to just choosing a uniform random integer, and sorting by that!)
Upon a closer reading, it makes a bit more sense. In section 4, the authors point out that by choosing ranks from a geometric distribution, they only need to store O(log log n) bits per node, instead of O(log n). The geometric distribution of ranks, and the handling of ties, allow them to prove the same time complexity bounds as a normal treap while using slightly less space.
However, I don't see this as a major advantage because either way, it still takes O(log n) bits to store each node's child pointers.
> insertion and deletion can be done purely top-down, with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring. Certain kinds of deterministic balanced search trees, in particular weak AVL trees and red-black trees achieve these bounds in the amortized sense [3], but at the cost of somewhat complicated update algorithms.
It's a neat, novel data structure with O(1) typical insert and delete times. Very cool!
My read of your quoted text is "This has the same average time complexity as WAVL/red-black trees, but is a simpler algorithm", which puts the search/insert/delete time complexity at O(log(n)). Being top-down means log(n) time complexity since that's the lower bound on the height of the tree.
> with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring
So I was curious about this, so I threw together a quick and dirty implementation of the insert function as described in the paper, counting both the number of times the insert() function is called and the number of times a pointer is written to when inserting 65535 (2^16 - 1) random numbers:
So it turns out that it is O(1) in terms of number of pointer writes but O(lg(n)) in terms of number of function calls, and since every function call will perform at least one comparison, that means this algorithm is O(1) for writes but O(lg(n)) for reads, meaning that given large enough n, the O(lg(n)) read term will dominate and thus this algorithm is O(lg(n)) for inserts.
That said, if writes are far more expensive than reads in the context this is used, this could still be useful.
Did you implement the recursive or the iterative version? Perhaps there isn't any difference in practice between them in this sense, though. I mean, the amount of function calls would differ but perhaps not the comparisons.
Ignore all the other noise about expected vs amortized vs just O(1). It's not what's relevant here.
1. Zip trees do not have O(1) insert/remove. They have O(log n) insert/remove (because they have to search the tree to find where to insert/remove), with O(1) additional restructuring time.
2. You can have an ordered data structure with O(1) insert, but not O(1) insert and O(1) find-min simultaneously. You can have O(1) insert trivially by just delaying all actual inserts by putting them into an array, and only actually insert them later when a search happens.
Even if what you meant is expected time complexity for the single insert operation is O(1), that surely cannot be the case. What would be expected time complexity of N inserts then?
Consider the humble array-list- a simpler example to be sure, but it's great for explaining.
Build an array of an initial basic size (16 let's say), and keep a counter of the current size. When you insert items 0 through 15, the time to insert is O(1)- find the current size, insert there. When you insert item 16 however, you need to make a new array of size (currentSize2), move the existing items over to it, then add your new item- which is O(n).
Let's say we insert n items (be in 64, 1024 or 2^32, it doesn't matter). What was the mean, median and mode for an insert?
Well, for any non-doubling case (1-k/k) it took O(1). For any doubling case (1/k of the time) it took O(k). This all adds up to a mode of O(1), a median of O(1) and a mean of (O(1)(k-1) + O(k)*1)/k, or roughly O(2), which is O(1).
Your worst case scenario of an insert is indeed O(n). But that's not the most common outcome.
Similar, from how I'm reading this paper, very rarely you'll have an O(ugly) restructuring during inserts or deletes, but the rest of the time you can expect O(1) while maintaining sorted order. I'm going to have to take a try at implementing it to see how it goes.
That's true for array, that expected insert time is O(1), and it still holds for N inserts, because you double the size every time, which yields O(2N) time complexity. This is not the case for this data structure, time complexity of inserting N elements is O( N log N ), therefore expected complexity of inserting single element cannot be O(1), even if single insert sometimes can be O(1),
O(1) in the paper referred to restructuring, after O( log n ) search time.
> you'll have an O(ugly) restructuring during inserts or deletes, but the rest of the time you can expect O(1) while maintaining sorted order.
It helps to understand why ugly restructuring is rare ... it's because the rank distribution is such that 1/2 of the time the height of the inserted node is 0 (a leaf), 1/4 of the time the height is 1, 1/8 of the time it is 2, etc. The ugliness of restructuring increases as the height increases, which means that ugliness becomes exponentially rarer. But it still takes O(lg n) to locate the insertion point -- remember, half the time the node is a leaf. And this is a binary search tree, so of course it is O(lg n). If insertion time were O(1) then building the tree would be O(n) which would mean an O(n) sort. That's simply not possible.
with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring.
Occasionally there's O(n) expensive restructuring, but given the nature of how it's increasing space, when and why, and when it will need to happen again, it's twice (or whatever) the prior amount of time before it needs to happen again. I'm not up on my asymptotic notations[1] besides big-O, but perhaps this is more succinctly (if more confusingly, to a general audience) by one of the additional notations?
> given the nature of how it's increasing space, when and why, and when it will need to happen again
This is really rather straightforward. The whole point of the algorithm is to randomly distribute the height of the inserted node such that half the nodes are leaves (height 0), 1/4 have height 1, 1/8 have height 2, etc. Restructuring is more expensive the larger the height, but the frequency of restructuring decreases exponentially with respect to the cost of doing it. Net result: O(1) for restructuring, + O(lg n) to find the insertion point (as is necessarily true for any binary search tree, else we would have a way to do an O(n) sort).
They might mean best-case vs average-case vs worst-case. A worst-case hash table insertion might be O(n) (rehashing the table) and in the malicious case it might even be O(n^2) (all table entries collide in a single linked list).
mabbo probably means that the average insert time is O(1), but s/he's wrong and couldn't possibly be right. An O(1) average insertion time into a binary search (i.e., ordered) tree would mean O(n) to build the tree from n elements, which would mean an O(n) search ... no can do. And the paper clearly says that the O(1) restructuring is in addition to the O(lg n) time to find the insertion point ... which anyone familiar with BST algorithms would expect.
I'm curious about how expensive the restructuring is, and importantly, can that be "scheduled": can i violate some conditions and then do the expensive restructuring at a time that is convenient?
Any links to the implementation of this? The paper says that there's a concurrent implementation in the author's thesis, but Princeton charges $5 per copy of theses, so I can't view the actual implementation:
An implementation of zip tree insertion and deletion is in the paper. concurrent zip trees is another matter. I don't know why you can't view something that Princeton charges $5 for, but in any case I think you would find the thesis disappointing ... the arxiv paper says "The third author developed a preliminary, lock-based implementation of concurrent zip trees in his senior thesis [12]. We are currently developing a non-blocking implementation." -- I would wait for the latter.
I'd also love to read the thesis, especially to see how they structured the lock.
I recently wrote a readers-writers (subtree) lock for a K-ary tree and it was quite a challenge with a nontrivial implementation. It was much easier to start with a single mutex-like (subtree) lock (one reader or writer), and then extend it to the general multi-reader/writer case.
I can only imagine that a lock-free version (of even a single reader/writer) must be even more of a challenge. I hope it's not a terribly long wait, but there could be much to learn from the simpler lock-based scheme in the meantime. :)
On the topic of access to the research, I found a talk by the first author [1] a helpful companion to the paper, with slides for what appears to be the same talk elsewhere [2]. I haven't seen any implementations outside of pseudocode in the paper -- it would be nice to see if there are any tricks to generating the random ranks cheaply, or the alternative that uses a pseudo-random function of the key.
Thanks to the OP for sharing this. I'm keen to try it on another project where I was hesitant to use red-black trees due to their complexity.