Multi-Array Queue

catlifeonmars · 2024-06-02T00:21:50 1717287710

I appreciate the detailed descriptions and it’s a neat idea, but with respect, this does not strike me as particularly novel. I’d put money down that there are implementations of this in the wild that have been written and forgotten. Kind of analogous to the concept of convergent evolution, this is a natural solution to a particular problem that pops up every once in a while.

sparkie · 2024-06-02T00:50:16 1717289416

I think it's semi-novel as I have not seen a queue exactly like this, but the structure of the queue itself is not novel - it closely resembles Brodnik et al's RAOTS[1], which also uses an array of pointers to other arrays which increase geometrically in size. RAOTS offer amortized O(1) for most operatons and O(√n) excess space.

Also are Bagwell's VLists[2], which were based on RAOTS, which he presents an example deque for, but this differs from OPs implementation.

A note about the VList versus RAOTS - in Bagwell's paper he claimed the VList performs better, giving a comparison of several soft MSB calculations, which may have been required at the time this was published, but nearly all modern hardware has instructions for very quickly calculating the MSB (Either a singe-cycle instruction, or via count leading zeroes), so it's questionable that there's a real benefit to it as it requires additional metadata which also comes at the cost of power-of-2 alignment of the sub-arrays. However, the VList was designed to be used for a Lisp implementation, for which there may still be other benefits.

    msb = 8*sizeof(size_t) - __builtin_clz(idx)

Or in Java

    msb = Long.SIZE - Long.numberOfLeadingZeroes(idx)

This Multi-Array queue could perhaps benefit from this previous work. In particular, if you constrain the data arrays to be powers of 2 in size, you can use the MSB calculation to very quickly determine the index of the sub-array in the "rings" array, and by masking out the MSB, you determine the index in the sub-array. The RAOTS paper is a forgotten gem which every language developer should be aware of when they're implementing lists in their stdlib. They can be used for immutable lists too in place of linked lists, as cons only requires copying the contents of one sub-array and copying the rings array. In fact, you can modify it slightly to make the rings array a plain old linked-list to make this even cheaper for consing immutable lists, at the cost of slower random-access.

[1]:https://cs.uwaterloo.ca/research/tr/1999/09/CS-99-09.pdf

[2]:https://core.ac.uk/download/pdf/147902641.pdf

vitpro2213 · 2024-06-02T05:27:46 1717306066

Thanks, will look at it.

I actually thought the same: Given e.g. the complex structures published in ACM papers, it would be a surprise if MultiArrayQueue would be a completely new discovery.

We are not in the pioneer years of 1960's anymore (unfortunately :-))

userbinator · 2024-06-02T06:07:57 1717308477

In particular, embedded stuff (with tight resource constraints) is where you're likely to find this and other "specialised" data structures in use.

secondcoming · 2024-06-02T09:07:32 1717319252

The embedded space would definitely prefer the ‘fixed size’ problem that this data structure claims to solve.

bigdubs · 2024-06-01T19:48:35 1717271315

Many ring-buffer implementations grow the backing storage array transparently on enqueue but do so in place, discarding the old arrays; what's the advantage of keeping the previous arrays? Naively I'd say it would reduce GC churn because you wouldn't have to free the old arrays, but I'm curious what the impact of that is in benchmarks.

Separately; the simulator is cool and very helpful!

vitpro2213 · 2024-06-01T20:49:48 1717274988

If you discard the old array (and allocate a bigger one before), you must also copy all enqueued material.

Also this one enqueue will be mega expensive - a clear "fat tail" in the latency histogram.

In MultiArrayQueue you keep all already enqueued material in-place, "just" allocate the bigger array, register the new diversion, enqueue the one new element - and done.

Thanks

Szpadel · 2024-06-01T23:44:52 1717285492

why not free previous smaller chunk after reader finished reading from it?

for me it would be better to allocate new buffer but allow reading from old one until it contains data and after that dealocate it and keep only new one in use

zamalek · 2024-06-02T01:00:14 1717290014

You might run into ABA, though that isn't an issue for managed languages.

jauntywundrkind · 2024-06-02T04:07:34 1717301254

Yeah, leaving small old buffers behind seems like a major no-no to me. It could be useful if you think you'll shrink back down, but it feels like cache locality suffering and iteration/tracking penalties strongly incentive getting rid of the old buffer asap.

One other thing I want to shout out, I saw what I thought was a really neat multicast ring buffer the other day where the author has an atomic for each element, rather than the typical reader/writer atomics. The promise was having much less contention on any given atomic, in most cases. https://github.com/rezabrizi/SPMC-Queue https://news.ycombinator.com/item?id=40410172

vitpro2213 · 2024-06-02T05:32:17 1717306337

Removing anything in non-blocking structures is problematic, see e.g. the referenced lecture of Professor Scott.

You never know how many concurrent threads still "are" on the place you wish to remove.

You would have to deal with stuff like hazard pointers, limbo lists and the like.

Better to keep the small arrays there.

Szpadel · 2024-06-02T08:01:56 1717315316

Isn't this only issue when you allow referencing data in queue?

If queue only allows to copy out data you can increase reader pointer after data were copied to different buffer, therefore nothing can be at the place we are removing

vitpro2213 · 2024-06-02T08:48:10 1717318090

With ConcurrentMultiArrayQueue, there can be N threads INSIDE of the program code of the Queue, running or preempted (for a not predictable time) and you cannot control it.

jauntywundrkind · 2024-06-03T04:41:39 1717389699

I'm sorry but that doesn't seem at all like any kind of fundamental constraint. At most basic, simply not have your readers advance until after they're done reading? This seems trivial as fuck. I've seen a lot of protestations that have seemed like, ok, maybe perhaps perhaps perhaps I'm missing some factor? But no, it really seems like the problem of understanding when data is done being read isn't anywhere near as hard as the rebuttals here make it seem, and it seems like everything works much better when we can accept this constraint.

Perhaps we want to have any of the given multicast readers able to read more than one element at a time, and that does complicate things somewhat. But hardly impossible to handle.

Again: deeply disagreeing with the premise here that this can't be done. And it isn't even really a significant penalty, if your consumers do have to be async consumers that need to hold open their reading for a while. Unclear what the protests are.

vitpro2213 · 2024-06-03T16:35:16 1717432516

Hi jauntywundrkind, just to make sure we have the same understanding:

The smaller arrays are not "left behind" in the garbage sense - the queue will use them again and again in the next rounds. See simulator. Re-use, not Re-cycle - the Garbage-Free Mantra.

If the Queue re-cycles the smaller arrays, it would not be garbage-free anymore.

If you still believe that the smaller arrays should be re-cycled (would be curious why), then comes the technical problem:

Let's imagine a reader stands immediately before reading the array (e.g. to check if the writer has already written). Now the OS preempts him. For how long: We don't know. In the meantime all things in the queue move forward and the program code in some other thread (writer probably) decides to de-allocate the array (and indeed does it).

Now the preempted reader wakes up and the first thing it does is to read from that (deallocated) array ...

hansvm · 2024-06-01T22:18:48 1717280328

When I wrote it [0], the goal was semi-bounded [1] latency.

[0] It needs a refactor, but here's some version of the idea. https://github.com/hmusgrave/zcirc

[1] You're not _really_ bounded since you have blocking calls to the underlying allocator (usually somewhere close to the OS, and even that can do more work than you might expect when you ask for a contiguous array), but it's still much cheaper than a bulk copy.

pphysch · 2024-06-01T20:02:35 1717272155

> Do not send me Pull Requests - the code is small so I want to maintain it single-handedly.

Interesting contribution policy. Makes sense in lieu of producing a strict style guide, I suppose.

splix · 2024-06-01T21:16:46 1717276606

While the author probably wants to own the main code, I'm sure there are other things he may want to get help from external contributors.

For example unit tests. At least to show that it works, and how to use it. Also a build config, like gradle/maven, so others would be able to use this lib.

whstl · 2024-06-02T00:58:47 1717289927

The author says that: "Reviews, tests and comments are welcome"

lionkor · 2024-06-01T21:13:42 1717276422

I thought github let you turn off PRs on a repo -- the author may wanna do that.

debugnik · 2024-06-01T22:08:17 1717279697

No, you can't disable the PRs tab; you can disable any tab except Code and PRs. Torvalds famously complained about this because he handles pull requests through email, so all PRs on his Linux repo are useless.

tidwall · 2024-06-01T21:56:57 1717279017

PRs cannot be disabled.

kubanczyk · 2024-06-01T22:13:41 1717280021

...without archiving the repo.

Archived repos however have that ugly yellow warning.

drekipus · 2024-06-02T00:11:25 1717287085

Can you still push to the repo? Perhaps via --force?

evil-olive · 2024-06-02T00:38:02 1717288682

reminds me of the blist [0] package in Python, implemented using B+ trees and offering O(log n) performance for operations that are O(n) with the built-in list type - such as insertion or deletion of items at the start of the list, or somewhere in the middle.

sadly it seems to be abandonware, with no commits in the last 10 years and compilation errors on Python 3.9 and above [1]

0: http://stutzbachenterprises.com/blist/implementation.html

1: https://github.com/DanielStutzbach/blist/issues/90

time0ut · 2024-06-01T23:42:12 1717285332

Interesting. A lot of care went into this. Thank you for sharing.

Is there a reason you chose not implement java.util.Queue?

vitpro2213 · 2024-06-02T05:29:15 1717306155

Implementing Queue would mean also implementing Collection and Iterable, and this would bring pains and ugliness, especially with the concurrent code.

Look e.g. at the disclaimers at the size method of java.util.concurrent.ConcurrentLinkedQueue.

cyansmoker · 2024-06-01T21:00:17 1717275617

Hi @vitpro2213 it's very interesting (at least to me) to find about this data structure a few months after I had a need for its somewhat distant cousin: https://github.com/Fusion/slotmachine

In my case, I needed a way to book and release two-ports tuples really fast to accommodate a RTP simulator. So, I wrote that slotmachine data structure and have been running in in production for months and can confirm: yes, performance is good.

Note: I should mention that my approach is almost exactly opposite to yours: I create a final backing slice, then create the traversal slices.

vitpro2213 · 2024-06-05T16:25:12 1717604712

Will have a look at it, thanks

ijustlovemath · 2024-06-02T02:16:25 1717294585

This reminds me of the queue used by libuv: https://stackoverflow.com/questions/61163161/how-does-the-li...

pyrolistical · 2024-06-02T03:55:16 1717300516

So similar to arraylist where it doubles allocation automatically when it runs out of room?

TheDudeMan · 2024-06-02T00:54:06 1717289646

Why not also a multi-array List?