Binary array set

mehrdadn · 2024-03-27T15:00:16 1711551616

Wow! I independently came up with this algorithm a few years ago and wasn't even sure what to search for to find the prior art. Happy to see someone finally gave it a name and attempted to find the history.

Fun fact #1 that I also realized, which I have yet to see mentioned elsewhere (though people have almost surely realized this in various contexts):

This is not limited to binary search or mergesort.

This is a very general-purpose meta-algorithm for turning any batch algorithm into a streaming one. (!)

The merge() step is basically "redo computation on a batch 2x the size". You pay a log(n) cost for this ability. You can combine the batches in any way you want there. Here it happens to be a merge of two sorted arrays. But you can imagine this being anything, like "train your ML model again on the larger batch".

Fun fact #2: I believe you can add deletion support as well. This can be done with a hierarchical bitmap to help you quickly search for occupied slots. It comes at the cost of another log(n) factor in some operations. I have yet to search if there is a name for the hierarchical bitmap structure I have in mind, so I'm just calling it that for now.

pkhuong · 2024-03-27T18:13:20 1711563200

I think the term you're looking for is "Bentley-Saxe dynamisation" https://www.sciencedirect.com/science/article/abs/pii/019667...

mehrdadn · 2024-03-27T18:26:39 1711563999

Wow, thank you. Were you already aware of the name? Or if not, how did you search for it?

pkhuong · 2024-03-27T19:32:30 1711567950

The technique was part of the functional programming pop culture 10-20 years ago.

nayuki · 2024-03-27T17:58:12 1711562292

For deletion, I think you can augment the current data structure so that instead of just storing each value, you pair it up with a Boolean telling whether the value is actually still included or not.

But you lose out on insertion and search efficiency; the worst case is dependent on the maximum number of values ever inserted, not the number of currently valid values.

mehrdadn · 2024-03-27T18:27:29 1711564049

The reason I mentioned using a hierarchical bitmap was precisely the inefficiency of just having one bool per element.

o11c · 2024-03-27T20:13:12 1711570392

This implementation is missing a lot of the usual tricks.

By slicing a single array you only need O(1) not O(log n) overhead. This does of course means dropping down to amortized insertion due to the occasional large reallocation (but if your OS supports `mremap` and your language allows you to use it, that goes away ... but the merge means you pay anyway). It never makes sense to allow a "null" array except for the smallest items (which I tend to put at the end, not the start). Also, if under N elements (16?) you should fall back to a linear search.

Also worth noting that tombstones are a legitimate way of deleting elements. If your elements are (or contain, if structs, especially map-like entries) pointers you can just use a pointer to a private static variable (since you know nobody else can access it); if integers you can use the smallest integer except for the very first element of the array, which you can avoid turning into a tombstone. The trick is never to insert one at an intermediate level of the "tree", but instead bubble them down. (Bubbling down of self-invalidating elements like weak pointers is also quite possible, but turning all reads into potential writes complicates thread-safety if you care about that). You should also keep track of the number of deleted elements and perform a full re-sort if it exceeds the number of real elements or so. That said, there's enough branch-predictor overhead in this that you probably want to implement the delete-allowed version separately from the no-delete version.

Sketch of the `get` code as I'm familiar with it:

  def get(array, elem, start, end):
    len = end - start
    chunk_size = round_down_to_power_of_2(len)
    while True:
      assert start + len == end
      if len < SMALL:
        # note that, after the first iteration,
        # len can be 0, 1, etc. even if chunk_size is still large
        return linear_search(array, elem, start, end)
      assert chunk_size <= len
      assert start + chunk_size <= end
      rv = binary_search(array, elem, start, start + chunk_size)
      if rv != NOT_FOUND: return rv
      start += chunk_size
      len -= chunk_size
      chunk_size /= 2

o11c · 2024-03-27T22:48:19 1711579699

Too late to edit: this code is buggy. It does not suffice to do `chunk_size /= 2`; you have to redo it from scratch.

Consider SMALL = 16, len = 1050, chunk_size = 1024, but there are not 512 items remaining. So I guess we're just skipping over some "null" arrays after all.

jsenn · 2024-03-28T02:25:36 1711592736

When you do the merge step on insert do you allocate scratch space or do a fancy in-place merge?

convolvatron · 2024-03-27T17:43:12 1711561392

Im being pretty thick here. I understand that each bucket is sorted. But what distinguishes the buckets? it talks about merging into slot 0, and then merging into slot 1...with what?

I know I could spend some time and read the code, but I'm hoping someone is kind enough to post a better explanation for the structure invariant and the insertion process

ramses0 · 2024-03-27T18:20:16 1711563616

Null is always sorted.

1 is always sorted.

x, y can always be sorted (and goes in "the second slot" because it's "too big" for "the first slot").

Now you have:

    0: null
    1: [x, y]

When you add another item (z, q, whatever...) you repeat the process until you "blow up" slot[1], and mergesort into "the next doubled big" slot.

    A0: [z]
    A1: [x, y]
    --
    B0: null 
    B1: null
    B2: [q, x, y, z]
    --
    C0: null
    C1: [q, z]
    C2: [x, y]
    --
    D0: [a]
    D1: null
    D2: [q, x, y, z]

...each array (slot) is guaranteed sorted, and is guaranteed binary-searchable.

...each array (slot) is guaranteed sorted, so adjacent slots can be "merge-sorted" into the next slot "for free".

...to understand recursion, you must first understand recursion. ;-)

...the "B" and "C" steps are actually "in error", but hopefully is illustrative that they would both be valid representations and are equivalent from a searching perspective.

Basically, you're paying the pointer-cost (overhead) for log-n "Slots" (heads) making it space-efficient. Dealing _only_ with sub-sorted-arrays means you're always either binary-searching (time-efficient) or merge-sorting (popping previously sorted items from list A or list B), which is also time-efficient.

It's kindof neat b/c (for example) you can get "Z" up at the top when you think it should be "down below", but it doesn't really matter b/c you'll do a binary search in slot 1 (nope, it's not "Z"!), and then binary search down below until you find what you're looking for. When it's "time" for "Z" to move down, it'll move down to the right place "automatically" as you're merging adjacent (sorted) lists.

Maybe think of it as: 1,2,4,8,16, where the slot with 16 entries is always perfectly sorted, but you have the temporary buffers of 1,2,4,8 to be pre-sorting your stuff until you need to bump up to 32 entries.

convolvatron · 2024-03-27T18:57:00 1711565820

thank you so much. so its alot like a resizable sorted array that doubles when it gets full, but instead of copying the old one, we keep an array of arrays

codr7 · 2024-03-28T00:42:52 1711586572

Except this one is only locally sorted.

You would have to merge sort sub vectors to get all items in order.

nayuki · 2024-03-27T17:55:23 1711562123

Try the "Interactive visualization" on the page!

Each bucket is either empty or has size 2^k, where k is the index of the bucket (0-based).

okennedy · 2024-03-28T02:29:24 1711592964

This looks like an in-memory log-structured merge tree [0].

(Albeit based on sorted runs rather than trees, as in the more recent variations of the idea)

[0] https://www.cs.umb.edu/~poneil/lsmtree.pdf

colanderman · 2024-03-27T19:54:07 1711569247

Oh nice, I've designed something similar before. You can do this within a single array by always adding new elements to the end of the array and performing an in-place merge of adjacent equal-sized bins. So the example given would look like:

    1, 4, 6, 7, 8, 10, 11, 12, 2, 3, 9, 13, 5

The structure is implicit in the size (as mentioned in the article). No need for an explicit array-of-array structure.

----

A related data structure I've been playing around with in F# (but haven't had time to write up) --

1. Instead of merging arrays, just append them.

2. Instead of arrays, use binary trees.

3. Instead of an array of these arrays/binary trees, use a linked list (so, a linked list of complete binary trees of strictly increasing depth). Omit elements whose trees are empty.

4. Put two of these back to back ("large" ends touching).

5. Recognize that if one of these lists is empty, a binary tree (or half of one) can be moved from the back of the other list to populate it.

Now you have a purely functional double-ended list with log(n) indexing, insertion, deletion, and append, while retaining a high degree of sharing.

----

Similarly, another related data structure I've been playing around with in Prolog --

1. No merging or appending.

2. Instead of arrays, use binary trees.

3. Instead of an array of these arrays/binary trees, use an infinite linked list, in order of increasing depth.

4. Do not explicitly instantiate this structure. Represent everything initially as a logic variable, only expanding linked list and binary tree nodes as needed.

5. Place two of these back to back, and add one extra logic variable in-between.

Now you have a purely relational array with integer indexes, supporting log(n) indexing and assignment, no upper or lower bound, and native Prolog unification. (Which coincidentally is in identical to the notion of "array" in SMTLIBv2.)

----

TLDR sequences of complete binary trees of increasing depth can be used for many interesting data structures.

nayuki · 2024-03-28T02:47:10 1711594030

How do you do an in-place merge in linear time (i.e. without blowing up the time complexity)? I've been chasing that for many years with no solution. https://www.nayuki.io/page/is-there-an-ideal-comparison-sort

colanderman · 2024-03-28T16:55:12 1711644912

Oh, I don't know that there's a way to do that. The G++ implementation uses free memory if available for linear merge, else it falls back to n log n merge. I don't think this affects the performance of insert though (which is already log n due to the necessary lookup).

codr7 · 2024-03-28T02:15:55 1711592155

I was curious enough to have a go at implementing the same algorithm using a single vector [0], but it still runs about twice as slow as the original.

std::set is almost 4x faster on my machine.

A straight binary sorted vector is even faster. But degrades when the input is highly "unsorted" because of vector manipulations.

[0] doesn't show any sensitivity to insert order, which is a bit odd.

https://gist.github.com/codr7/d94ef02a5949cb01c57038c606838d...

goggy_googy · 2024-03-27T18:55:11 1711565711

Reminds me a little bit of a bloom filter in its functionality

keybored · 2024-03-27T18:40:59 1711564859

Considering the nested array (indirection) this doesn’t seem that attractive for testing membership/insertion compared to a hashset. What am I missing?

nayuki · 2024-03-27T18:49:34 1711565374

Compared to a hash table, the binary array set: Has better worst-case insertion and search time, has zero slack space, and relies on comparisons rather than hashing.

colanderman · 2024-03-27T20:03:50 1711569830

No need for a nested array. You can simply lay them end to end in decreasing size and perform in-place merges. The structure is implicit in the size of the set.

keybored · 2024-03-27T21:07:12 1711573632

Okay, I was looking at the Rust implementation (nested Vec).