Wow! I independently came up with this algorithm a few years ago and wasn't even sure what to search for to find the prior art. Happy to see someone finally gave it a name and attempted to find the history.
Fun fact #1 that I also realized, which I have yet to see mentioned elsewhere (though people have almost surely realized this in various contexts):
This is not limited to binary search or mergesort.
This is a very general-purpose meta-algorithm for turning any batch algorithm into a streaming one. (!)
The merge() step is basically "redo computation on a batch 2x the size". You pay a log(n) cost for this ability. You can combine the batches in any way you want there. Here it happens to be a merge of two sorted arrays. But you can imagine this being anything, like "train your ML model again on the larger batch".
Fun fact #2: I believe you can add deletion support as well. This can be done with a hierarchical bitmap to help you quickly search for occupied slots. It comes at the cost of another log(n) factor in some operations. I have yet to search if there is a name for the hierarchical bitmap structure I have in mind, so I'm just calling it that for now.
For deletion, I think you can augment the current data structure so that instead of just storing each value, you pair it up with a Boolean telling whether the value is actually still included or not.
But you lose out on insertion and search efficiency; the worst case is dependent on the maximum number of values ever inserted, not the number of currently valid values.
This implementation is missing a lot of the usual tricks.
By slicing a single array you only need O(1) not O(log n) overhead. This does of course means dropping down to amortized insertion due to the occasional large reallocation (but if your OS supports `mremap` and your language allows you to use it, that goes away ... but the merge means you pay anyway). It never makes sense to allow a "null" array except for the smallest items (which I tend to put at the end, not the start). Also, if under N elements (16?) you should fall back to a linear search.
Also worth noting that tombstones are a legitimate way of deleting elements. If your elements are (or contain, if structs, especially map-like entries) pointers you can just use a pointer to a private static variable (since you know nobody else can access it); if integers you can use the smallest integer except for the very first element of the array, which you can avoid turning into a tombstone. The trick is never to insert one at an intermediate level of the "tree", but instead bubble them down. (Bubbling down of self-invalidating elements like weak pointers is also quite possible, but turning all reads into potential writes complicates thread-safety if you care about that). You should also keep track of the number of deleted elements and perform a full re-sort if it exceeds the number of real elements or so. That said, there's enough branch-predictor overhead in this that you probably want to implement the delete-allowed version separately from the no-delete version.
Sketch of the `get` code as I'm familiar with it:
def get(array, elem, start, end):
len = end - start
chunk_size = round_down_to_power_of_2(len)
while True:
assert start + len == end
if len < SMALL:
# note that, after the first iteration,
# len can be 0, 1, etc. even if chunk_size is still large
return linear_search(array, elem, start, end)
assert chunk_size <= len
assert start + chunk_size <= end
rv = binary_search(array, elem, start, start + chunk_size)
if rv != NOT_FOUND: return rv
start += chunk_size
len -= chunk_size
chunk_size /= 2
Too late to edit: this code is buggy. It does not suffice to do `chunk_size /= 2`; you have to redo it from scratch.
Consider SMALL = 16, len = 1050, chunk_size = 1024, but there are not 512 items remaining. So I guess we're just skipping over some "null" arrays after all.
Im being pretty thick here. I understand that each bucket is sorted. But what distinguishes the buckets? it talks about merging into slot 0, and then merging into slot 1...with what?
I know I could spend some time and read the code, but I'm hoping someone is kind enough to post a better explanation for the structure invariant and the insertion process
...each array (slot) is guaranteed sorted, and is guaranteed binary-searchable.
...each array (slot) is guaranteed sorted, so adjacent slots can be "merge-sorted" into the next slot "for free".
...to understand recursion, you must first understand recursion. ;-)
...the "B" and "C" steps are actually "in error", but hopefully is illustrative that they would both be valid representations and are equivalent from a searching perspective.
Basically, you're paying the pointer-cost (overhead) for log-n "Slots" (heads) making it space-efficient. Dealing _only_ with sub-sorted-arrays means you're always either binary-searching (time-efficient) or merge-sorting (popping previously sorted items from list A or list B), which is also time-efficient.
It's kindof neat b/c (for example) you can get "Z" up at the top when you think it should be "down below", but it doesn't really matter b/c you'll do a binary search in slot 1 (nope, it's not "Z"!), and then binary search down below until you find what you're looking for. When it's "time" for "Z" to move down, it'll move down to the right place "automatically" as you're merging adjacent (sorted) lists.
Maybe think of it as: 1,2,4,8,16, where the slot with 16 entries is always perfectly sorted, but you have the temporary buffers of 1,2,4,8 to be pre-sorting your stuff until you need to bump up to 32 entries.
thank you so much. so its alot like a resizable sorted array that doubles when it gets full, but instead of copying the old one, we keep an array of arrays
Oh nice, I've designed something similar before. You can do this within a single array by always adding new elements to the end of the array and performing an in-place merge of adjacent equal-sized bins. So the example given would look like:
1, 4, 6, 7, 8, 10, 11, 12, 2, 3, 9, 13, 5
The structure is implicit in the size (as mentioned in the article). No need for an explicit array-of-array structure.
----
A related data structure I've been playing around with in F# (but haven't had time to write up) --
1. Instead of merging arrays, just append them.
2. Instead of arrays, use binary trees.
3. Instead of an array of these arrays/binary trees, use a linked list (so, a linked list of complete binary trees of strictly increasing depth). Omit elements whose trees are empty.
4. Put two of these back to back ("large" ends touching).
5. Recognize that if one of these lists is empty, a binary tree (or half of one) can be moved from the back of the other list to populate it.
Now you have a purely functional double-ended list with log(n) indexing, insertion, deletion, and append, while retaining a high degree of sharing.
----
Similarly, another related data structure I've been playing around with in Prolog --
1. No merging or appending.
2. Instead of arrays, use binary trees.
3. Instead of an array of these arrays/binary trees, use an infinite linked list, in order of increasing depth.
4. Do not explicitly instantiate this structure. Represent everything initially as a logic variable, only expanding linked list and binary tree nodes as needed.
5. Place two of these back to back, and add one extra logic variable in-between.
Now you have a purely relational array with integer indexes, supporting log(n) indexing and assignment, no upper or lower bound, and native Prolog unification. (Which coincidentally is in identical to the notion of "array" in SMTLIBv2.)
----
TLDR sequences of complete binary trees of increasing depth can be used for many interesting data structures.
Oh, I don't know that there's a way to do that. The G++ implementation uses free memory if available for linear merge, else it falls back to n log n merge. I don't think this affects the performance of insert though (which is already log n due to the necessary lookup).
Considering the nested array (indirection) this doesn’t seem that attractive for testing membership/insertion compared to a hashset. What am I missing?
Compared to a hash table, the binary array set: Has better worst-case insertion and search time, has zero slack space, and relies on comparisons rather than hashing.
No need for a nested array. You can simply lay them end to end in decreasing size and perform in-place merges. The structure is implicit in the size of the set.
Fun fact #1 that I also realized, which I have yet to see mentioned elsewhere (though people have almost surely realized this in various contexts):
This is not limited to binary search or mergesort.
This is a very general-purpose meta-algorithm for turning any batch algorithm into a streaming one. (!)
The merge() step is basically "redo computation on a batch 2x the size". You pay a log(n) cost for this ability. You can combine the batches in any way you want there. Here it happens to be a merge of two sorted arrays. But you can imagine this being anything, like "train your ML model again on the larger batch".
Fun fact #2: I believe you can add deletion support as well. This can be done with a hierarchical bitmap to help you quickly search for occupied slots. It comes at the cost of another log(n) factor in some operations. I have yet to search if there is a name for the hierarchical bitmap structure I have in mind, so I'm just calling it that for now.