I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.
Running the benchmarks locally (zig is fastest, but that's probably expected)
zig (~1 week older than HEAD):
./zig-out/bin/qsort
filling
shuffling
running
took 177.46ms
rust nightly:
cargo run --release --bin qsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/qsort`
filling
shuffling
running
took 896.91656ms
cargo run --release --bin rsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/rsort`
filling
shuffling
running
took 212.190694ms
go 1.16.3:
go run qsort.go
filling
shuffling
running
took 222.90356ms
on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9
The fact that the Rust tokio version (the one that uses tokio tasks instead of threads) is slow is to be expected. tokio tasks aren't appropriate for running quicksort; they will have overhead compared to a regular threadpool because of I/O reactor, waker etc code that will never get used.
rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.
Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )
I don't think tokio's slowness here is "to be expected". There isn't much reason for tokio tasks to have that much overhead over normal thread pools. The I/O driver shouldn't be called in such a benchmark given there's no I/O work happening. Waker's only add a reference count inc/dec + an atomic cas for wake() which should only happen on the JoinHandle `awaits` [4] compared to just an atomic swap for the Zig case on the join handle.
Golang doesn't poll for I/O under such cases [0] and tokio should be using the `ParkThread` parker for putting threads to sleep [1] given `net` features aren't enabled (not sure if this is the actually the case) which you can force with a custom Runtime initialization instead of `#[tokio::main]` as an exercise.
`crossbeam-deque` requires heap allocation for the run queues, heap allocates on growth, and garbage collects said memory. This is an overhead I wished to avoid and is something tokio has been improvements with avoiding as well [2].
`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. This isn't general purpose and it takes advantage of unbounded stack allocation which can technically cause a stack overflow. Zig could do this and also take advantage of batch scheduling, but it complicates the code and is unfair here given `async` usage. Tokio, golang, and the Zig benchmarks require heap allocation on spawn so I believe it makes it a fairer comparison. This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.
The `unsafe` there is from me copying the benchmark code from the tokio version to the rayon version and forgetting to remove the hack. In tokio, ownership of the array needed to be passed into the function given the lifetime was no longer linear from the spawn() I assume (correct me if i'm wrong here, but this is what the compile error hinted at). So I needed to recreate the array after the function, hence unsafe. If there's a better way for the tokio version, please send a PR. I see you've done so for the rayon version and I gladly merged it.
>I don't think tokio's slowness here is "to be expected".
And yet on my machine the rayon version takes ~160ms and the tokio version takes ~1350s. This isn't at the level of some minor missed performance optimization.
>There isn't much reason for tokio tasks to have that much overhead over normal thread pools.
tokio is an async runtime. tokio tasks are meant to be for distributing I/O-bound work. It would be at least a little more correct to use spawn_blocking for CPU-bound tasks, but that still doesn't work for your recursive calls because that's not what it's meant for.
In general, if you have CPU-bound work in your tokio-using application, you run it on a different threadpool - tokio's blocking one, or a completely different one.
>`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. [...] This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.
My original comment was also talking about scopes, not `rayon::join`. So yes, `rayon` is absolutely a good comparison.
This actually can be at the level of a missed optimization. A run queue with a lock-shared queue amongs all the threads scales even worse than the tokio version. Sharding the run queues and changing the notification algorithm, even while keeping locks on the sharded queues improves throughput drastically.
Tokio is an async runtime, but I don't see why being an async runtime should make it worse from a throughput perspective for a thread pool. I actually started on a Rust version [0] to test out this theory of whether async-rust was the culprit, but realized that I was being nerd-sniped [1] at this point and I should continue my Zig work instead. If you're still interested, I'm open to receiving PRs and questions on that if you want to see that in action.
It's still correct to benchmark and compare tokio here given the scheduler I was designing was mean to be used with async tasks: a bunch of concurrent and small-executing work units. I mention this in the second paragraph of "Why Build Your Own?".
The thread pool in the post is meant to be used to distribute I/O bound work. A friend of mine hooked up cross-platform I/O abstractions to the thread pool [2], benchmarked it against tokio to be have greater throughput and slightly worse tail latency under a local load [3]. The thread pool serves it's purpose and the quicksort benchmark is to show how schedulers behave under relatively concurrent work-loads. I could've used a benchmark with smaller tasks than the cpu-bound partition()/insertion_sort() but this worked as a common example.
I've already mentioned why rayon isn't a good comparison: 1. It doesn't support async root concurrency. 2. scoped() waits for tasks to complete by either blocking the OS thread or using similar inline-scheduler-loop optimizations. This risks stack overflow and isn't available as a use case in other async runtimes due to primarily being a fork-join optimization.
I'm not an expert on the go scheduler, but my perception is that it is more of a focused single-purpose component whereas tokio seems like a sprawling swiss-army-knife of a library if you browse the source
The tokio scheduler and the go scheduler are roughly equivalent. Much of tokios bulk is reimplementing much of the std lib in an async compatible way (io, os, blocking).
If you browse the source, the go scheduler has complexities to deal with that tokio doesn't as well. The thread pool is unified between worker threads & blocking threads. Go also does goroutine preemption via signaling/SuspendThread + a background monitor thread called sysmon. Go does garbage collection and the tracing/race-detection/semantics are tightly coupled to both its allocator and it's scheduler. Go also exposes and maintains its entire standard library which includes an http client/server (tokio the org maintains their own as hyper but its separated from tokio the runtime). It can be fair to argue that Go is just as "swiss-army-knife" of a system as tokio.
It depends on what you want to do. If you are doing io-bound work, Tokio would be what you want -- you would use it as a runtime for the async capabilities in Rust. If you have cpu-bound work, then rayon is what you want to use. Rayon is a work-stealing parallelism crate -- it will schedule work to be done, and different threads will schedule portions of it as they become available to do work. It's very easy to get 100% CPU utilization across all cores use Rayon if your work is naturally paralellizable, and the interface is dead simple: anywhere you do a .iter(), you can turn it into a .par_iter(), and Rayon will parallelize the work.
Note there is some overhead to using Rayon -- you normally will be better off doing your work on a single thread unless you have a large number of elements in your collection... I found for my application I needed more than 1e6 elements before I saw an appreciable performance benefit to using Rayon.
As others said, Crossbeam is for sending and receiving messages across threads. I use it alongside of tokio and rayon.
Crossbeam is a library that provides various concurrency primitives, such as a queue type that allows stealing of work.
Rayon is a library for running code in parallel. Typically you'll give it a (parallel) iterator of sorts, and it will distribute the work across a pool of threads.
Erm, could somebody explain to me why I shouldn't understand this as an argument pro go given that it is about as fast as the fully optimized versions of zig and rust?
Go's value proposition is that it has good bang-for-the-buck. It's fairly easy to write in. There's easier, but it's fairly easy. It performs fairly well. There are things that perform better, but it's fairly good. It scales fairly well. There's things that scale better, but it's pretty good.
If you draw all this out on the programming landscape, I think one of the reasons Go has succeeded as well as it did is that this was a poorly covered part of the landscape. There were languages that were much easier, but performed much worse, and languages that performed much better, but generally required a lot more developer effort.
I don't expect Go to ever top the charts on performance on a task like this, but it's often hanging around at only a bit slower, and it does that across a wide variety of tasks and on a wide variety of dimensions. It's not the best at much of anything, but it's pretty darned good for pretty cheap on a lot of measures.
Well, work with several task is THE thing with Go, right? If it were too much worse Go will fail.
In situations like this, is important to remember that Rust/Zig are (system)languages to MAKE "Go" but also, million other different things with different goals.
So other way to look at this is "If I wanna designa language with a runtime like Go, this both have it that close!"
I'm not sure how this implies it is flawed. It benchmarks thread pools so on a system which allowed more parallel concurrent tasks (i.e. 32t amd cpu) the throughput is expected to scale somewhat, and you see this in your results.
Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers. The readme contains instructions to compile it with optimizations (`-Drelease-fast`), the rust versions do so as well (`--release`) so you should re-evaluate your results.
See one of my other comments in this post on why I didn't use rayon::join() by default.
> Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers
Right! This was a misunderstanding of mine. Updated results:
8-thread Intel laptop machine:
zip 0.9.0-dev.958: 230
rust 1.5.0-nightly 21-09-12/qsort: 545
rust 1.5.0-nightly 21-09-12/rsort: 255
go 1.16.8: 355
32-thread AMD workstation:
zip 0.9.0-dev.958: 125
rust 1.5.0-nightly 21-09-12/qsort: 780
rust 1.5.0-nightly 21-09-12/rsort: 135
go 1.16.8: 135
The reason why I think that conclusions should not be drawn due to the excessive variability between systems.
On the parent poster's system, zig is faster by a much larger margin than the two systems I've tried. And there's a ~45% difference in Go's performance compared to the fastest runner for the respective system.
The results will vary on different systems given how the combination of the CPU and OS handle thread synchronization + scheduling. On one of my desktops running Windows 10 with an i7-4790k, the Go qsort runs slower (~600ms) than the Rust qsort (~490ms) while on Linux systems the results are generally the opposite.
The zig thread pool appears consistently faster for this benchmark on different systems, and rayon::join appears consistently faster than the zig thread pool on different systems too. I believe you can somewhat conclude the advantages of their designs from this relative to each other rather than in an absolute sense.
At first glance, this looks great, almost 20% speedup!
But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.
But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.
I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?
fwiw Zig is taking its role as a low-level, bare bones "c replacement" as a core design constraint. For instance zig has eschewed from even fairly tame features like operator overloading because they're not considered in line with Zig's identity as a minimal, explicit language.
So feature creep is always a concern, but from what I understand Zig has a better chance at avoiding this than other projects due to the language's guiding philosophy.
> how can one be sure that this 20% gap is a guaranteed design advantage
It's no guarantee, but in my experience, Zig is a language that lends itself very well to composability. I imagine that instead of adding features to stdlib, anyone who needs "extra features" in their threading system will find it quite easy to implement their own as a library outside of stdlib that can easily be swapped in and out in A/B tests for the library consumer. If some feature or another is REALLY wanted in the stdlib for some reason it would probably be easy to drop it behind a comptime-flag as part of a type constructor function, which won't compromise (runtime) performance, teeny tiny hit to comptime performance.
> this looks to be how almost every new tech begins
On second thought, do you think this is true? The level of detail shown in this post is fairly extraordinary for event loops just getting started.
> how can one be sure that this 20% gap is a guaranteed design advantage
I would say because the design is solid and recognizes principles of mechanical sympathy that are fundamental and unlikely to change for quite some time.
Nobody can predict the future, but one thing that might help is that Protty is in the Zig core dev team and, as the post shows, he really cares about efficiency.
it seems like the design philosophy of zig is such that it's not going for feature creep/bloat, it's keeping the language very simple and eschewing unexpected complexity
Does this already use `target-cpu=native`? Not sure if that would be apples-to-apples with the Zig implementation (which is what's important here), but I'd be surprised if it didn't yield some interesting wins.
Rust is such a bloated, insane design-by-committee language. It's actually still surprising to me that corporations are defending it and using it. I sure won't.
It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!
I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.
I'm increasingly skeptical that any language with lifetimes can be anything other than "ugly to read". People complain about syntax, but what they're really complaining about is semantics.
Was? New junk is added to Rust every few weeks, using past tense seems intentionally deceitful, as does re-characterizing my comment. I really do mean ugly as in ugly. That's why I wrote that.
One of the strongest factors behind a lot of people choosing to stay away from Rust is that so many conversations with advocates are met with toxic and subtly toxic behavior. They demean software that's not memory safe the way that politicians use their words to sow anger. C has won, and Rust blew it's shot aiming at C++ instead.
As Amazon swallows Rust and spits all the refuse out, I can't help but smile, because it's a community that's been begging for obsolescence via their attitudes and behaviors since day one.
It's not just the function's body, it's any block. Blocks in rust are expressions and return their last statement's value. It's nothing like Javascript's ASI where it's just inserting ; implicitly and only `return` can return a value.
It's also nothing like Javascript because Rust is statically typed. If you accidentally insert a semi-colon (or not), you're almost guaranteed to get a compile error.
To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.
[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.
I'm aware that it's technically considered blocking due to a producer being preempted between swap and store causing the consumer to see an invalid link and report empty. I address this in the run queue section by noting that it's OK if empty is spuriously reported since the thread waking/waiting mechanism takes care of rescheduling such a case.
Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.
Is this using JVM/JIT or using Graal to make a native-image binary and calling that? You'd get better results due to shaving off startup time with that if it does include it.
I think .NET 6 latest preview would slightly beat out Java here as well.
This isn't exactly apples-to-apples since Zig/Rust/Go are considered "systems languages", but the JVM has GraalVM for compiling to native binaries/libraries and even importing/exporting C functions and structs. And .NET of course has .NET Native + "Dotnet Native Exports", including the LLVM experiment where it uses LLVM bitcode instead of Ryu.
So you can make the argument that writing a native binary or a library which exported this benchmark function as a C-callable method with a C header in each language would technically equivalent.
The JVM and .NET ones would have a large size (several MB each) but would otherwise still fill this requirement.
It's plain Java (i.e. JVM/JIT), ForkJoinTask based implementation. As in the original implementation, measurement is done around the quickSort() call.
One point is actually that the parallel quick sort algorithm is a bad benchmark for task schedulers (it doesn't scale well for one thing). Another point is, well, that one can spend two years of deep technical work and then be easily beaten by some "legacy" tech in the course of a morning exercise. Maybe those bearded guys were good for something after all :)
First of all you, along with a few others, misunderstood the goal of the scheduler. I note in the post that it's primarily for async execution. See a previous comment of mine on how a fork-join optimized thread pool which hooks into the scheduler to wait for the other-forked side and run the poll-loop inline is ideal for a fork-join case, and why I'm intentionally not benchmarking that. Given rayon::join actually does this, i'd be curious to see the results of that vs Java's ForkJoinPool on your machine to see if the optimizations match up.
Second, parallel quicksort isn't a bad benchmark, and it does scale enough to stress-test the spawning and joining aspects of the scheduler. Keep in mind, the best thing to scale AFAIK is one that is embarrassingly parallel and takes enough time to offset the cost of any scheduler overhead and contention. Again, this thread pool is optimized to execute small tasks. To that, there are indeed better benchmarks but quicksort with small-size optimization is one that is most widely understood.
Finally, you're in the game of trying to invalidate others work due to novelty, lack of understanding on your part, and generalizations on culture. I'm here to learn about cool scheduler designs. Would appreciate if you would contribute in that aspect instead of the former.
This is most likely using HotSpot as I don’t believe Graal has released anything past Java 11.
I don’t know if native-image would perform better. I’ve mostly found that it performs worse than HotSpot overall, especially once you start generating garbage and the heap gets larger the Serial GC won’t keep up with G1.
It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.
I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.
I don't think that it's the threadpool's fault that an application uses it incorrectly. Also, I think there are a lot of developers who have not considered that on today's machines, just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case. I wouldn't say it's the case of bad software but rather software that was written for the CPUs of 5 years ago. And generally, I don't think this will be an easy problem to solve due to the variety of heterogeneous and non-heterogeneous topology of modern SoCs. In fact, I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.
> I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.
To do that efficiently you need to pin thread groups to cores based on having information of data usage.
This smells like over-optimizing architectures to me, if you want to go beyond separating stuff like hyper-threads on io.
Additional annoyance: There is no POSIX way to get hyperthreads and physical ones.
I think a general purpose threadpool should work well on general purpose hardware, and it seems like the most popular SoCs on consumer devices will have heterogeneous cores et al, so a good implementation would schedule the threadpool appropriately. I agree that there is no POSIX way to distinguish between hyper threads and regular threads, and this is something that should be improved. I'm not saying that the achievements made by the threadpool implementation are lackluster or that any of the other solutions solve the issues I outline any better. What I am saying that the comment I was originally referring was somewhat mistaken about the benefits of a more optimal, yet naive threadpool library.
This isn't just about hyperthreads, by the way. As long as the workload isn't compute heavy and often stalls on memory, hyperthreads are just as good as regular threads. On a hardware level, there is no distinction between a regular and a hyperthread core. Either you multiplex a single physical core or you don't. Anyway, there is more to it than slower threads and faster threads - accessing memory between threads will be slower depending on which core is trying to access which bits of memory - a core stealing work from a sibling core on the same chiplet will probably be able to do that quicker than stealing work from a core across the cluster if the data prefetcher has been doing it's job correctly. Spawning more threads than necessary might force a CPU to power up more cores than necessary, resulting in slower performance per core performance and worse power efficiency, especially if a fast or slow cluster needs to be turned on, where a more optimal scheduling of threads might not force that to happen.
I think a general purpose thread pool by default should no longer spawn as many threads as there are _cores_, whatever that term even means, with optional toggles to inform whether the work that'll be scheduled will be compute heavy or not.
> just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case
Absolutely not.
Any task with any amount of I/O will have a significant amount of blocking. A GPU kernel may take microseconds or milliseconds to respond. RDMA (a memory-access over Ethernet) may take many microseconds.
Having multiple threads per core would be more efficient: it gives the cores something to do while waiting for SSD, RDMA, or GPUs. Remember: even the earliest single-core systems from the 1970s had multiple threads on one core: its just more efficient to have multiple terminals to read/write to at a time.
--------
One hardware-thread per thread (since SMT8 machines exist like POWER9 / POWER10) is only efficient in the most computationally expensive situations. Which is in fact, a rarity in today's world. Your typical programs will be waiting on the network interface or SSD.
IIRC: there's professional thread-pools out there that are 1.5x threads per hardware thread as a default option, and then scale up/down depending on how computationally expensive things look. That is: a 64-core/128-thread Threadripper would be overloaded with 192 threads, under the assumption that at least 33% of them would be waiting on I/O at any given time.
>Any task with any amount of I/O will have a significant amount of blocking. A GPU kernel may take microseconds or milliseconds to respond. RDMA (a memory-access over Ethernet) may take many microseconds.
The argument I failed to make was that with heterogeneous distribution of memory bandwidth and compute resources, most user applications would benefit from spawning less threads than all available cores. In I/O heavy workloads, the correct thing to do is to do asynchronous I/O. This can be done for SSDs and GPUs. On contemporary systems where there's heavy costs associated with context switching ,avoiding them and servicing multiple tasks without blocking will always be superior to spawning more threads to do more blocking.
When it comes to hyperthreading, I assume that the threads are cores - because from the OS perspective, they are, and you cannot distinguish two hyperthreads running on a single core anyway.
Also, I apologize, but the first sentence you're quoting is not not what I intended to write - my point is that most application developers might still think that spawning as many threads as there are cores is a reasonable thing to do in all cases - but with CPUs that comprise of 4 core clusters with 16 cores each, it's often better to spawn far less than the total amount of cores available.
> In I/O heavy workloads, the correct thing to do is to do asynchronous I/O
You can't async mmap into memory reads (or the GPU-equivalent: cudaMallocManaged).
Today's I/O is looking more-and-more like a memory read/write. As such, your typical "node = node->next" pointer traversals could very well be an incredible string of I/O. That indirection could be in RAM, over SSD (mmap), over RDMA (ethernet pretending to be RAM), or into GPU-RAM (cudaMallocManaged)... with an appropriate PCIe command (possibly atomic-PCIe) to boot.
Async only works on the simplest of reading/writing patterns.
EDIT: Consider the "persistent kernel" pattern on GPGPU (which starts off with a lot of the similar thought process you have on CPU-land. Launch only X wavefronts, where X is the size of the GPU). You primarily communicate with a persistent kernel over RAM / atomics. You don't use the classic cuda <<<blocks, threads>>>() (which admittingly has an async interface)... Instead: you read/write to magic managed-memory locations that will be eventually sync'd over PCIe to the GPU. This is because the "persistent kernel" is always executing. You launched it upon program start, there's no event to async-trigger on. You just pass data to the GPU. The GPU operates on the data at its leisure. Then it eventually returns data to a buffer elsewhere (probably with atomic compare-and-swaps, which traverse PCIe these days, to ensure coherence)
I recently spent some time reading the kernel code for io-uring - however only for older revisions (5.4 - 5.7). There I found out that a lot of it is actually implemented on top of existing kernel functions for blocking IO and polled IO, and does not replace it and make the system fully asynchronous. Pratically that means with a 5.4 kernel doing lots of disk IO you will still have lots of threads being blocked on IO - only in this case those will be threads inside a kernel threadpool instead of in userspace. With 5.7 that model changed, and the threadpool is no longer necessary for read/write operations on sockets. Maybe also for files - but I don't really understand the kernel code well enough to confirm or deny that. And things might obviously also have changed for newer Kernel versions.
CUDA-streams probably already use io_uring under the hood.
The issue is that a CUDA-stream based execution kernel will spend lots of time pulling from the queue and spinning up threads (sound like a familiar problem?).
Instead: you sidestep this issue with persistent kernels: you launch exactly the number of kernels that matches your GPU-size, and then pass data to those kernels through an alternative means.
-------
The future of I/O is going to be atomic-operations across PCIe with stronger-and-stronger coherency models. PCIe 3.0 introduced atomic-PCIe commands. PCIe 4.0 strengthened them. PCIe 5.0 is rumored to have even stronger coherency rules and "coherent grids" are being seriously discussed / implemented in high-speed computers (see Compute eXpress Link, or CXL)
Any serious I/O (and GPUs will be part of that I/O future), will likely use PCIe atomics in some regards, quite similar to atomic-compare-and-swaps that the CPU does already between their cores.
I/O is becoming indistinguishable from memcpy + atomics.
-------
EDIT: In effect, I'm saying that GPU-programmers are more concerned about GPU-efficiency rather than CPU-efficiency. Sure, the CPU is less efficient spinning on these memory-reads/writes. But the GPU is where the bulk of the calculations are happening. Therefore, it behooves the architect to optimize the GPU (even if the CPU ends up slightly less efficient).
Whatever argument you have about "queues being more efficient in hypothetical architecture", the GPU has more calculations and more cores to feed (aka: far more overhead when it comes to Amdahl's law). That means you want to apply those principles first to the GPU, and then the CPU "makes up for it", so to speak.
I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?
Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.
Recent datacenter network controlers (Mellanox, marvell) have 'gpu direct' capabilities, so direct interactions with devices with no cpu interaction. I've also seen fpga+network boards do that with success. And with libraries like nccl and 200gbe eth links you could almost forget you have CPUs or network links between.
What I miss is a simple but efficient data queue between cpu and gpu. Everyone's doing manual memory reservation and cudamemcpy, I want an async send (gpu->cpu) with an mpi or socket-like interface. I've seen someone posting stuff on io_uring from gpu code, but just bragging, no code.
Buying Mellanox, and their bluefield dpu (integrated 8 or 16 arm cores in the NIC) stuff I feel, nvidia could probably go the way you're seeing. Haven't seen any Mellanox/NVIDIA tech convergence yet.
On windows, no. You can use affinity to prevent the application from running on some of your cores but it will still spin up a ton of threads (which makes the situation even worse sometimes)
What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.
Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?
Id encourage you to try recording them yourself. The results can vary depending on your system, how much concurrent tasks it can make parallel, if scheduling resources are being used elsewhere, etc. The zig code contains an example of using timers + the spawning and joining is in quickSort() function so it should hopefully be easy to add the timing logic. I can answer questions regarding it if you hop on the IRC or Discord.
In regards to the overflow queue, yes some pathologica tasks may see long delyas but this is true for any mostly-FIFO or sharded queue scenario. Both Golang and Tokio overflow from their local buffer into a shared lock-protected linked list (tokio is a bit more eager in this regard) so they can suffer similar fates. They actually do an optimization which is to check the shared queue before the local buffer every few local scheduling ticks (% 61 or 64 for each task run iirc) to decrease the change of local starvation. Could try adding that to Zig's thread pool after the timing logic and see if that helps tail latencies. I'm curious about the outcome either way, but I may not have time to work on that.
Yes, as others have said, it's based on the same open source CMS.
I've created zig.news for people who want to write about Zig but who don't necessarily want to maintain their own blog and to the work necessary to publicize it.
This should hopefully help more people write down their experience and opinions on using Zig and alleviate the problem of having a "blogger aristocracy" who has a much stronger voice than anyone else.
A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.
The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.
”Lock free” has a very precise meaning in computer science, and it’s this: ”given some number of threads of execution, an algorithm is lock free if in some finite number of steps, at least one thread is guaranteed to make forward progress”. When using mutexes, this is not guaranteed under any amount contention: the scheduler is perfectly free to preempt a thread holding a lock, which means no threads make progress.
However, if you use a CAS loop (like this code does) to update some value (i.e. a while loop that repeatedly tries to execute a compare-and-swap), that loop will only fail to make progress if some other thread DID make progress. This means that such an algorithm is in fact lock-free: if thread X didn’t progress, it is necessarily because some other thread did. In addition, a scheduler preempting a thread at any point will not stop other threads making progress.
Whether an algorithm is lock free or not is not a matter of opinion, it is a mathematical statement. The fact that it annoys you is… weird.
Yes, I'm probably wrong on the semantics of lock free. I should read up on that. My point was that maybe you can gain a factor of 2 by using atomic ops in a "lockless" fashion compared to mutexes but it still scales extremely poorly compared to what you could do if the hardware had real async mechanisms that did not rely on cache coherency. The cache line that is subjected to an atomic operation will be evicted from all other CPUs cache lines, back and forth.
A simple example is just letting all CPUs in a multi core CPU doing an atomic add at a memory address. The scaling will be exactly 1 with the number of available cores. I realize this is very of topic wrt to this cool article an implementation in zig. It's just that this problem can't really be solved in an efficient manner with todays hardware.
It's a lock associated with the cache line that the atomic operation operates on. This is because they are built on top of the cache coherency mechanism, a synchronous blocking operation on the hardware level to implement an asynchronous mechanism on the software level.
The big frustration with today's multi core CPUs is that there's simply not efficient way to communicate using message passing mechanisms. I think this is something the hardware guys should focus on :-) Provide an async mechanism to communicate between cores, not relying on the cache coherency.
The coherency protocol guarantees that a core can own in exclusive mode a cache line for a bounded number of cycles, this guarantees forward progress and it is different from an unbounded critical section. It has also nothing to do with the lock prefix and also applies to normal non atomic writes.
What the lock prefix does is delay the load associated with the RMW so that it is executed together with the store before the core has a chance to lose the ownership of the line (technically this can also be implemented optimistically with speculation and replays, but you still need a pessimistic fallback to maintain forward progress).
A message passing feature would be nice, but it would be necessarily non coherent which means you can only use it to pass serialised values, not pointers to other threads.
An alternative more usable solution would be to aggressively speculate around memory barriers as the memory subsystem is already highly asynchronous.
The effect is the same. If someone touch the cache line, it's evicted from all other caches that has it, triggering a cache miss when other cores touch it. Everyone knows this. I just think it's a bit depressing if you try to optimize message passing on many-core CPUs you'll realize you can't make it fast. No-one has been able to make it fast (I've checked Intel message passing code as well). If you get more than 20 Mevents through some kind of shared queue, you are lucky. That is slow if you compare to how many instructions a CPU can retire.
So all these event loop frameworks try to implement an async behaviour ontop of a hardware mechanism that is synchronous in it's core. The main issue is that the lowest software construct, the queue used to communicate is built on this cache line ping pong match. What the software want it still a queue, you could still send pointers if the memory they point to has been committed to memory when the receiving core see them. Just remove the inefficient atomic operations way of synchronizing the data between CPUs. Send them using some kind of network pipe instead :-) As you say the memory subsystem is really asynchronous.
I'm convinces this is coming, it should just have arrived 10 years ago...
>What the software want it still a queue, you could still send pointers if the memory they point to has been committed to memory when the receiving core see them
Which means that the sender will need to pessimistically flush anything that the receiver might need to access, which is likely more expensive than optimistic cache coherency.
Non-CC systems were a thing in the past, and still are for things like GPUs and distributed systems, but are much harder to program and not necessarily more efficient.
As the saying goes there are only two hard things in programming...
The former (Inversion of Control) is a general method and too unspecific to answer. I can only hint that one of the biggest caveats of async is brittle composability due to missing cancellation routines. This will eventually be addressed.
Weellll... I guess you could get rid of the OS and switch to a stack-only ABI with a callee saves-everything policy? Then "thread switching" and "multiprocess" both map onto saving/restoring the stack (on ARM you'd have to have SP/LR motion to stack), and process communication would just be over a unified stack.
What'd be cool is if every core (and, thus, affinitized process) had its own virtual address. That way if any single core went down, its neighbors could stand it back up. Of course, you'd have to have some sort of hideous last-level shared virtual addressing scheme — probably with large last level pages — to "migrate" a "process" between cores.
EDIT: And security be damned! We need that last .5%!
GPUs are practically the "OS-free" compute device of choice these days.
The way GPUs work in most cases is through work-queues. The CPU submits work to GPUs (either OpenCL queues or CUDA Streams), and the GPU "plucks" work off of the queue to do. There are dozens of SMs (NVidia) or CUs (AMD GCN) or WGPs (AMD RDNA) that pluck the work off concurrently.
The kernels are then run to completion. Upon completion, the stream gets an event that triggers (which often times, automatically enqueues another task). If the CPU has work to do, it can get an async message (or maybe be waiting through blocking behavior).
--------
There are other models of course, not everyone runs the standard methodology. "Persistent kernels" start up early on and infinite-loop. You interact with those by passing data over PCIe and the kernel itself contains a queue/load-balancing logic to pickup the data somehow (rather than letting the device drivers / hardware logic do that)
------
The thing about GPU-level tasks is that you have so many little tasks (ex: shoot a ray, or render pixel (0,0), render pixel(1,0), etc. etc.) that just pulling tasks off of a queue is your most efficient methodology. Lots of small, roughly equal-sized tasks can be load balanced with simple methodologies.
The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli.... The source code for the product implementation is now open source. Most is in https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP... and https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....