Hacker News new | past | comments | ask | show | jobs | submit login
Zip Trees (arxiv.org)
301 points by federicoponzi on Oct 24, 2018 | hide | past | favorite | 36 comments



If you’re on a phone, here’s an HTML version of the paper https://www.arxiv-vanity.com/papers/1806.06726/


Thanks for letting me know about Arxiv Vanity!


A Cartesian tree (treap) with zip/merge change routine is a well known datastructure in competitive programming. The only difference from the paper would be rank selection algorithm - uniform instead of geometric.

Unclear if authors were aware of it.


  >> One can view a zip tree as a treap (Seidel and Aragon 1996) in which priority ties are allowed and 
  >> in which insertions and deletions are done by unmerging and merging paths ("unzipping" and "zipping")


> ...rather than by doing rotations.

DS i've mentioned is using unzip/zip (called split/merge), see [1]

[1] https://cp-algorithms.com/data_structures/treap.html


Thanks for clarifying :)


Correct me if I'm wrong, but isn't this the "standard" [1] treap with split/merge (which is somewhat acknowledged in the footnote of page 8), simply with a different distribution of priorities? The result is certainly impressive, but it seems like a small modification to an otherwise known structure.

[1]: See here http://e-maxx.ru/algo/treap for an old reference of this structure. Google translate does a decent job with it, and the code is readable without translation.


I had a similar confusion when skimming the paper. The algorithm only depends on the relative ordering of node "ranks", rather than their absolute values. If you were to treat the distributions as continuous (ignoring ties) then it would make absolutely no difference what distribution you used. (Any two continuous distributions are related by a monotonic transformation, which preserves the ordering relationships between different points of the distribution.)

(A more concrete way of looking at it: a simple way of sampling from a geometric distribution is to choose a uniform random integer and count the number of leading 1 bits. The authors suggest concatenating a few random bits to the end for tie-breaking, so that the rank is a pair of a geometric random variable and a uniform random variable, sorted lexicographically. But such a pair is equivalent to just choosing a uniform random integer, and sorting by that!)

Upon a closer reading, it makes a bit more sense. In section 4, the authors point out that by choosing ranks from a geometric distribution, they only need to store O(log log n) bits per node, instead of O(log n). The geometric distribution of ranks, and the handling of ties, allow them to prove the same time complexity bounds as a normal treap while using slightly less space.

However, I don't see this as a major advantage because either way, it still takes O(log n) bits to store each node's child pointers.


> insertion and deletion can be done purely top-down, with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring. Certain kinds of deterministic balanced search trees, in particular weak AVL trees and red-black trees achieve these bounds in the amortized sense [3], but at the cost of somewhat complicated update algorithms.

It's a neat, novel data structure with O(1) typical insert and delete times. Very cool!


"O(1) expected restructuring" doesn't imply "O(1) expected insert/search/delete".

My read of your quoted text is "This has the same average time complexity as WAVL/red-black trees, but is a simpler algorithm", which puts the search/insert/delete time complexity at O(log(n)). Being top-down means log(n) time complexity since that's the lower bound on the height of the tree.


O(1) restructuring, as I understand you need to do a classical log(n) search before any restructuring.


I don't think a ordered data structure with O(1) insert time is even possible, as that would mean that you can do sort in O(n).


> with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring

So I was curious about this, so I threw together a quick and dirty implementation of the insert function as described in the paper, counting both the number of times the insert() function is called and the number of times a pointer is written to when inserting 65535 (2^16 - 1) random numbers:

Below are the results:

    Inserts      0 to      1: Avg fn calls:   2.00, Avg ptr writes:   2.00 
    Inserts      1 to      3: Avg fn calls:   3.50, Avg ptr writes:   1.00 
    Inserts      3 to      7: Avg fn calls:   5.00, Avg ptr writes:   4.00 
    Inserts      7 to     15: Avg fn calls:   6.12, Avg ptr writes:   5.00 
    Inserts     15 to     31: Avg fn calls:   7.06, Avg ptr writes:   4.62 
    Inserts     31 to     63: Avg fn calls:   7.28, Avg ptr writes:   3.59 
    Inserts     63 to    127: Avg fn calls:   9.64, Avg ptr writes:   4.81 
    Inserts    127 to    255: Avg fn calls:  11.72, Avg ptr writes:   4.08 
    Inserts    255 to    511: Avg fn calls:  13.73, Avg ptr writes:   4.73 
    Inserts    511 to   1023: Avg fn calls:  17.61, Avg ptr writes:   5.30 
    Inserts   1023 to   2047: Avg fn calls:  18.95, Avg ptr writes:   4.99 
    Inserts   2047 to   4095: Avg fn calls:  19.12, Avg ptr writes:   5.05 
    Inserts   4095 to   8191: Avg fn calls:  19.67, Avg ptr writes:   4.98 
    Inserts   8191 to  16383: Avg fn calls:  20.76, Avg ptr writes:   4.94 
    Inserts  16383 to  32767: Avg fn calls:  22.38, Avg ptr writes:   4.99 
    Inserts  32767 to  65535: Avg fn calls:  24.18, Avg ptr writes:   4.99 
So it turns out that it is O(1) in terms of number of pointer writes but O(lg(n)) in terms of number of function calls, and since every function call will perform at least one comparison, that means this algorithm is O(1) for writes but O(lg(n)) for reads, meaning that given large enough n, the O(lg(n)) read term will dominate and thus this algorithm is O(lg(n)) for inserts.

That said, if writes are far more expensive than reads in the context this is used, this could still be useful.


Did you implement the recursive or the iterative version? Perhaps there isn't any difference in practice between them in this sense, though. I mean, the amount of function calls would differ but perhaps not the comparisons.


Tail call optimization in functional programming languages comes to mind.


Ignore all the other noise about expected vs amortized vs just O(1). It's not what's relevant here.

1. Zip trees do not have O(1) insert/remove. They have O(log n) insert/remove (because they have to search the tree to find where to insert/remove), with O(1) additional restructuring time.

2. You can have an ordered data structure with O(1) insert, but not O(1) insert and O(1) find-min simultaneously. You can have O(1) insert trivially by just delaying all actual inserts by putting them into an array, and only actually insert them later when a search happens.


Radix trees and radix sort can do that, though the constant is the key size.


O(1) expected is a different thing than standard O(1).


Indeed, could you expand on that a bit?

Even if what you meant is expected time complexity for the single insert operation is O(1), that surely cannot be the case. What would be expected time complexity of N inserts then?


Consider the humble array-list- a simpler example to be sure, but it's great for explaining.

Build an array of an initial basic size (16 let's say), and keep a counter of the current size. When you insert items 0 through 15, the time to insert is O(1)- find the current size, insert there. When you insert item 16 however, you need to make a new array of size (currentSize2), move the existing items over to it, then add your new item- which is O(n).

Let's say we insert n items (be in 64, 1024 or 2^32, it doesn't matter). What was the mean, median and mode for an insert?

Well, for any non-doubling case (1-k/k) it took O(1). For any doubling case (1/k of the time) it took O(k). This all adds up to a mode of O(1), a median of O(1) and a mean of (O(1)(k-1) + O(k)*1)/k, or roughly O(2), which is O(1).

Your worst case scenario of an insert is indeed O(n). But that's not the most common outcome.

Similar, from how I'm reading this paper, very rarely you'll have an O(ugly) restructuring during inserts or deletes, but the rest of the time you can expect O(1) while maintaining sorted order. I'm going to have to take a try at implementing it to see how it goes.


That's true for array, that expected insert time is O(1), and it still holds for N inserts, because you double the size every time, which yields O(2N) time complexity. This is not the case for this data structure, time complexity of inserting N elements is O( N log N ), therefore expected complexity of inserting single element cannot be O(1), even if single insert sometimes can be O(1),

O(1) in the paper referred to restructuring, after O( log n ) search time.

We could be violently agreeing, not sure :)


> you'll have an O(ugly) restructuring during inserts or deletes, but the rest of the time you can expect O(1) while maintaining sorted order.

It helps to understand why ugly restructuring is rare ... it's because the rank distribution is such that 1/2 of the time the height of the inserted node is 0 (a leaf), 1/4 of the time the height is 1, 1/8 of the time it is 2, etc. The ugliness of restructuring increases as the height increases, which means that ugliness becomes exponentially rarer. But it still takes O(lg n) to locate the insertion point -- remember, half the time the node is a leaf. And this is a binary search tree, so of course it is O(lg n). If insertion time were O(1) then building the tree would be O(n) which would mean an O(n) sort. That's simply not possible.


So, I think that's what is meant when they say:

with O(1) expected restructuring time and exponentially infrequent occurrences of expensive restructuring.

Occasionally there's O(n) expensive restructuring, but given the nature of how it's increasing space, when and why, and when it will need to happen again, it's twice (or whatever) the prior amount of time before it needs to happen again. I'm not up on my asymptotic notations[1] besides big-O, but perhaps this is more succinctly (if more confusingly, to a general audience) by one of the additional notations?

1: https://en.wikipedia.org/wiki/Big_O_notation#Related_asympto...


> given the nature of how it's increasing space, when and why, and when it will need to happen again

This is really rather straightforward. The whole point of the algorithm is to randomly distribute the height of the inserted node such that half the nodes are leaves (height 0), 1/4 have height 1, 1/8 have height 2, etc. Restructuring is more expensive the larger the height, but the frequency of restructuring decreases exponentially with respect to the cost of doing it. Net result: O(1) for restructuring, + O(lg n) to find the insertion point (as is necessarily true for any binary search tree, else we would have a way to do an O(n) sort).


Can you expand on that please?


They might mean best-case vs average-case vs worst-case. A worst-case hash table insertion might be O(n) (rehashing the table) and in the malicious case it might even be O(n^2) (all table entries collide in a single linked list).


mabbo probably means that the average insert time is O(1), but s/he's wrong and couldn't possibly be right. An O(1) average insertion time into a binary search (i.e., ordered) tree would mean O(n) to build the tree from n elements, which would mean an O(n) search ... no can do. And the paper clearly says that the O(1) restructuring is in addition to the O(lg n) time to find the insertion point ... which anyone familiar with BST algorithms would expect.


> with O(1) typical insert

This isn't the case. Half the inserts become leaves, which don't require restructuring but do require O(lg n) to find the insertion point.


I'm curious about how expensive the restructuring is, and importantly, can that be "scheduled": can i violate some conditions and then do the expensive restructuring at a time that is convenient?


Data structure from competitive programming is also known as "treap with implicit keys", related question on stackoverflow: https://stackoverflow.com/questions/3497875/treap-with-impli...


Any links to the implementation of this? The paper says that there's a concurrent implementation in the author's thesis, but Princeton charges $5 per copy of theses, so I can't view the actual implementation:

https://dataspace.princeton.edu/jspui/handle/88435/dsp01gh93...


An implementation of zip tree insertion and deletion is in the paper. concurrent zip trees is another matter. I don't know why you can't view something that Princeton charges $5 for, but in any case I think you would find the thesis disappointing ... the arxiv paper says "The third author developed a preliminary, lock-based implementation of concurrent zip trees in his senior thesis [12]. We are currently developing a non-blocking implementation." -- I would wait for the latter.


I'd also love to read the thesis, especially to see how they structured the lock.

I recently wrote a readers-writers (subtree) lock for a K-ary tree and it was quite a challenge with a nontrivial implementation. It was much easier to start with a single mutex-like (subtree) lock (one reader or writer), and then extend it to the general multi-reader/writer case.

I can only imagine that a lock-free version (of even a single reader/writer) must be even more of a challenge. I hope it's not a terribly long wait, but there could be much to learn from the simpler lock-based scheme in the meantime. :)

On the topic of access to the research, I found a talk by the first author [1] a helpful companion to the paper, with slides for what appears to be the same talk elsewhere [2]. I haven't seen any implementations outside of pseudocode in the paper -- it would be nice to see if there are any tricks to generating the random ranks cheaply, or the alternative that uses a pseudo-random function of the key.

Thanks to the OP for sharing this. I'm keen to try it on another project where I was hesitant to use red-black trees due to their complexity.

[1] https://www.youtube.com/watch?v=NxRXhBur6Xs

[2] http://knuth80.elfbrink.se/wp-content/uploads/2018/01/Tarjan...


I wonder if this would be a good candidate for the basis of a persistent ("functional") data structure?

O(1) pointer changes for insert/remove is a good sign, since pointer changes tend to become space overhead in a persistent data structure.


Readers already familiar with similar data structures may save time by proceeding directly to section 4, "Previous Related Work".


Is cool to see new innovation in this area




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: