Resource efficient Thread Pools with Zig

scott_s · on Sept 13, 2021

I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.

The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli.... The source code for the product implementation is now open source. Most is in https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP... and https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....

Jarred · on Sept 13, 2021

Running the benchmarks locally (zig is fastest, but that's probably expected)

zig (~1 week older than HEAD):

    ./zig-out/bin/qsort
    filling
    shuffling
    running
    took 177.46ms

rust nightly:

    cargo run --release --bin qsort
    warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
    See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
        Finished release [optimized] target(s) in 0.02s
        Running `target/release/qsort`
    filling
    shuffling
    running
    took 896.91656ms

    cargo run --release --bin rsort
    warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
    See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
    Finished release [optimized] target(s) in 0.02s
        Running `target/release/rsort`
    filling
    shuffling
    running
    took 212.190694ms

go 1.16.3:

    go run qsort.go
    filling
    shuffling
    running
    took 222.90356ms

on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9

Arnavion · on Sept 13, 2021

The fact that the Rust tokio version (the one that uses tokio tasks instead of threads) is slow is to be expected. tokio tasks aren't appropriate for running quicksort; they will have overhead compared to a regular threadpool because of I/O reactor, waker etc code that will never get used.

rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.

Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )

kprotty · on Sept 13, 2021

I don't think tokio's slowness here is "to be expected". There isn't much reason for tokio tasks to have that much overhead over normal thread pools. The I/O driver shouldn't be called in such a benchmark given there's no I/O work happening. Waker's only add a reference count inc/dec + an atomic cas for wake() which should only happen on the JoinHandle `awaits` [4] compared to just an atomic swap for the Zig case on the join handle.

Golang doesn't poll for I/O under such cases [0] and tokio should be using the `ParkThread` parker for putting threads to sleep [1] given `net` features aren't enabled (not sure if this is the actually the case) which you can force with a custom Runtime initialization instead of `#[tokio::main]` as an exercise.

`crossbeam-deque` requires heap allocation for the run queues, heap allocates on growth, and garbage collects said memory. This is an overhead I wished to avoid and is something tokio has been improvements with avoiding as well [2].

`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. This isn't general purpose and it takes advantage of unbounded stack allocation which can technically cause a stack overflow. Zig could do this and also take advantage of batch scheduling, but it complicates the code and is unfair here given `async` usage. Tokio, golang, and the Zig benchmarks require heap allocation on spawn so I believe it makes it a fairer comparison. This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.

The `unsafe` there is from me copying the benchmark code from the tokio version to the rayon version and forgetting to remove the hack. In tokio, ownership of the array needed to be passed into the function given the lifetime was no longer linear from the spawn() I assume (correct me if i'm wrong here, but this is what the compile error hinted at). So I needed to recreate the array after the function, hence unsafe. If there's a better way for the tokio version, please send a PR. I see you've done so for the rayon version and I gladly merged it.

[0]: See `atomic.Load(&netpollWaiters) > 0` in https://golang.org/src/runtime/proc.go

[1]: https://github.com/tokio-rs/tokio/blob/master/tokio/src/runt...

[2]: https://tokio.rs/blog/2019-10-scheduler#a-better-run-queue

[3]: https://docs.rs/rayon-core/1.9.1/src/rayon_core/join/mod.rs....

[4]: https://github.com/tokio-rs/tokio/blob/98578a6f4a494e709f000...

Arnavion · on Sept 13, 2021

>I don't think tokio's slowness here is "to be expected".

And yet on my machine the rayon version takes ~160ms and the tokio version takes ~1350s. This isn't at the level of some minor missed performance optimization.

>There isn't much reason for tokio tasks to have that much overhead over normal thread pools.

tokio is an async runtime. tokio tasks are meant to be for distributing I/O-bound work. It would be at least a little more correct to use spawn_blocking for CPU-bound tasks, but that still doesn't work for your recursive calls because that's not what it's meant for.

In general, if you have CPU-bound work in your tokio-using application, you run it on a different threadpool - tokio's blocking one, or a completely different one.

>`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. [...] This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.

My original comment was also talking about scopes, not `rayon::join`. So yes, `rayon` is absolutely a good comparison.

kprotty · on Sept 14, 2021

This actually can be at the level of a missed optimization. A run queue with a lock-shared queue amongs all the threads scales even worse than the tokio version. Sharding the run queues and changing the notification algorithm, even while keeping locks on the sharded queues improves throughput drastically.

Tokio is an async runtime, but I don't see why being an async runtime should make it worse from a throughput perspective for a thread pool. I actually started on a Rust version [0] to test out this theory of whether async-rust was the culprit, but realized that I was being nerd-sniped [1] at this point and I should continue my Zig work instead. If you're still interested, I'm open to receiving PRs and questions on that if you want to see that in action.

It's still correct to benchmark and compare tokio here given the scheduler I was designing was mean to be used with async tasks: a bunch of concurrent and small-executing work units. I mention this in the second paragraph of "Why Build Your Own?".

The thread pool in the post is meant to be used to distribute I/O bound work. A friend of mine hooked up cross-platform I/O abstractions to the thread pool [2], benchmarked it against tokio to be have greater throughput and slightly worse tail latency under a local load [3]. The thread pool serves it's purpose and the quicksort benchmark is to show how schedulers behave under relatively concurrent work-loads. I could've used a benchmark with smaller tasks than the cpu-bound partition()/insertion_sort() but this worked as a common example.

I've already mentioned why rayon isn't a good comparison: 1. It doesn't support async root concurrency. 2. scoped() waits for tasks to complete by either blocking the OS thread or using similar inline-scheduler-loop optimizations. This risks stack overflow and isn't available as a use case in other async runtimes due to primarily being a fork-join optimization.

[0]: https://github.com/kprotty/zap/blob/blog-rust/src/thread_poo...

[1]: https://xkcd.com/356/

[2]: https://github.com/lithdew/hyperia

[3]: https://gist.github.com/kprotty/5a41e9612657de00788478a7dde4...

skohan · on Sept 13, 2021

Honestly tokio is so complex, and it serves so many use-cases it's hard for me to believe it could ever be truly optimal for any one use-case.

kristoff_it · on Sept 13, 2021

Isn't Go's scheduler just as versatile?

skohan · on Sept 13, 2021

I'm not an expert on the go scheduler, but my perception is that it is more of a focused single-purpose component whereas tokio seems like a sprawling swiss-army-knife of a library if you browse the source

bbatha · on Sept 13, 2021

The tokio scheduler and the go scheduler are roughly equivalent. Much of tokios bulk is reimplementing much of the std lib in an async compatible way (io, os, blocking).

kprotty · on Sept 14, 2021

If you browse the source, the go scheduler has complexities to deal with that tokio doesn't as well. The thread pool is unified between worker threads & blocking threads. Go also does goroutine preemption via signaling/SuspendThread + a background monitor thread called sysmon. Go does garbage collection and the tracing/race-detection/semantics are tightly coupled to both its allocator and it's scheduler. Go also exposes and maintains its entire standard library which includes an http client/server (tokio the org maintains their own as hyper but its separated from tokio the runtime). It can be fair to argue that Go is just as "swiss-army-knife" of a system as tokio.

tick_tock_tick · on Sept 13, 2021

It's much more versatile but also has many more years of optimization.

gefhfff · on Sept 13, 2021

Can you explain that Tokio vs crossbeam vs rayon a bit more? Should I default to rayon or crossbeam because of that?

ModernMech · on Sept 13, 2021

It depends on what you want to do. If you are doing io-bound work, Tokio would be what you want -- you would use it as a runtime for the async capabilities in Rust. If you have cpu-bound work, then rayon is what you want to use. Rayon is a work-stealing parallelism crate -- it will schedule work to be done, and different threads will schedule portions of it as they become available to do work. It's very easy to get 100% CPU utilization across all cores use Rayon if your work is naturally paralellizable, and the interface is dead simple: anywhere you do a .iter(), you can turn it into a .par_iter(), and Rayon will parallelize the work.

Note there is some overhead to using Rayon -- you normally will be better off doing your work on a single thread unless you have a large number of elements in your collection... I found for my application I needed more than 1e6 elements before I saw an appreciable performance benefit to using Rayon.

As others said, Crossbeam is for sending and receiving messages across threads. I use it alongside of tokio and rayon.

YorickPeterse · on Sept 13, 2021

Crossbeam is a library that provides various concurrency primitives, such as a queue type that allows stealing of work.

Rayon is a library for running code in parallel. Typically you'll give it a (parallel) iterator of sorts, and it will distribute the work across a pool of threads.

Tokio is a library for async IO operations.

nicoburns · on Sept 13, 2021

Tokio is designed for concurrent io, rayon, or tokio-threadpool for concurrent cpu-bound task.

knuthsat · on Sept 13, 2021

What about smol?

ModernMech · on Sept 14, 2021

smol is in the same category as Tokio - it's a runtime for the async capabilities in Rust.

stewbrew · on Sept 13, 2021

Erm, could somebody explain to me why I shouldn't understand this as an argument pro go given that it is about as fast as the fully optimized versions of zig and rust?

jerf · on Sept 13, 2021

Go's value proposition is that it has good bang-for-the-buck. It's fairly easy to write in. There's easier, but it's fairly easy. It performs fairly well. There are things that perform better, but it's fairly good. It scales fairly well. There's things that scale better, but it's pretty good.

If you draw all this out on the programming landscape, I think one of the reasons Go has succeeded as well as it did is that this was a poorly covered part of the landscape. There were languages that were much easier, but performed much worse, and languages that performed much better, but generally required a lot more developer effort.

I don't expect Go to ever top the charts on performance on a task like this, but it's often hanging around at only a bit slower, and it does that across a wide variety of tasks and on a wide variety of dimensions. It's not the best at much of anything, but it's pretty darned good for pretty cheap on a lot of measures.

kristoff_it · on Sept 13, 2021

I think that's a perfectly reasonable way of looking at it.

mamcx · on Sept 13, 2021

Well, work with several task is THE thing with Go, right? If it were too much worse Go will fail.

In situations like this, is important to remember that Rust/Zig are (system)languages to MAKE "Go" but also, million other different things with different goals.

So other way to look at this is "If I wanna designa language with a runtime like Go, this both have it that close!"

pizza234 · on Sept 13, 2021

I get inconsistent results, both in terms of platform and language. The benchmark may be flawed, and no conclusions should be drawn, to be prudent.

Running on Linux (zig -Dc), rounded average of 3 runs.

8-thread Intel laptop machine:

  zip 0.9.0-dev.958:                 945
  rust 1.5.0-nightly 21-09-12/qsort: 545
  rust 1.5.0-nightly 21-09-12/rsort: 255
  go 1.16.8:                         355

32-thread AMD workstation:

  zip 0.9.0-dev.958:                 410
  rust 1.5.0-nightly 21-09-12/qsort: 780
  rust 1.5.0-nightly 21-09-12/rsort: 135
  go 1.16.8:                         135

Additionally, I haven't done a thorough review, but enabling the commented code:

        // Optimized version (hooks directly into the scheduler)
        // rayon::join(
        //     || quick_sort(low),
        //     || quick_sort(high)
        // );

yields a 110ms run on the faster machine.

kprotty · on Sept 13, 2021

I'm not sure how this implies it is flawed. It benchmarks thread pools so on a system which allowed more parallel concurrent tasks (i.e. 32t amd cpu) the throughput is expected to scale somewhat, and you see this in your results.

Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers. The readme contains instructions to compile it with optimizations (`-Drelease-fast`), the rust versions do so as well (`--release`) so you should re-evaluate your results.

See one of my other comments in this post on why I didn't use rayon::join() by default.

pizza234 · on Sept 14, 2021

> Also, `zig -Dc` links to libc (`-Dc`) but builds in debug mode by default, similar to Rust and C/C++ compilers

Right! This was a misunderstanding of mine. Updated results:

8-thread Intel laptop machine:

  zip 0.9.0-dev.958:                 230
  rust 1.5.0-nightly 21-09-12/qsort: 545
  rust 1.5.0-nightly 21-09-12/rsort: 255
  go 1.16.8:                         355

32-thread AMD workstation:

  zip 0.9.0-dev.958:                 125
  rust 1.5.0-nightly 21-09-12/qsort: 780
  rust 1.5.0-nightly 21-09-12/rsort: 135
  go 1.16.8:                         135

The reason why I think that conclusions should not be drawn due to the excessive variability between systems.

On the parent poster's system, zig is faster by a much larger margin than the two systems I've tried. And there's a ~45% difference in Go's performance compared to the fastest runner for the respective system.

kprotty · on Sept 14, 2021

The results will vary on different systems given how the combination of the CPU and OS handle thread synchronization + scheduling. On one of my desktops running Windows 10 with an i7-4790k, the Go qsort runs slower (~600ms) than the Rust qsort (~490ms) while on Linux systems the results are generally the opposite.

The zig thread pool appears consistently faster for this benchmark on different systems, and rayon::join appears consistently faster than the zig thread pool on different systems too. I believe you can somewhat conclude the advantages of their designs from this relative to each other rather than in an absolute sense.

rdsubhas · on Sept 13, 2021

At first glance, this looks great, almost 20% speedup!

But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.

But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.

I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?

skohan · on Sept 13, 2021

fwiw Zig is taking its role as a low-level, bare bones "c replacement" as a core design constraint. For instance zig has eschewed from even fairly tame features like operator overloading because they're not considered in line with Zig's identity as a minimal, explicit language.

So feature creep is always a concern, but from what I understand Zig has a better chance at avoiding this than other projects due to the language's guiding philosophy.

dnautics · on Sept 13, 2021

> how can one be sure that this 20% gap is a guaranteed design advantage

It's no guarantee, but in my experience, Zig is a language that lends itself very well to composability. I imagine that instead of adding features to stdlib, anyone who needs "extra features" in their threading system will find it quite easy to implement their own as a library outside of stdlib that can easily be swapped in and out in A/B tests for the library consumer. If some feature or another is REALLY wanted in the stdlib for some reason it would probably be easy to drop it behind a comptime-flag as part of a type constructor function, which won't compromise (runtime) performance, teeny tiny hit to comptime performance.

jorangreef · on Sept 13, 2021

> this looks to be how almost every new tech begins

On second thought, do you think this is true? The level of detail shown in this post is fairly extraordinary for event loops just getting started.

> how can one be sure that this 20% gap is a guaranteed design advantage

I would say because the design is solid and recognizes principles of mechanical sympathy that are fundamental and unlikely to change for quite some time.

kristoff_it · on Sept 13, 2021

Nobody can predict the future, but one thing that might help is that Protty is in the Zig core dev team and, as the post shows, he really cares about efficiency.

adamrezich · on Sept 13, 2021

it seems like the design philosophy of zig is such that it's not going for feature creep/bloat, it's keeping the language very simple and eschewing unexpected complexity

komuW · on Sept 13, 2021

When you do `go run qsort.go` you are also timing the time taken by Golang to build/compile the code. I suspect the same applies to `cargo run`

You should do something like; `go build . && time ./qsort`

Arnavion · on Sept 13, 2021

The "took #ms" line is output of the program. It does not include compilation time.

masklinn · on Sept 13, 2021

The same occurs with Rust, we can even see that the no-op recompilation takes about 20ms.

It’s obviously just a rough estimate / comparison otherwise it’d be using hyperfine or something along those lines.

nessex · on Sept 14, 2021

This is not the case, it measures the time between two points in the code at runtime and prints that[1].

[1] https://github.com/kprotty/zap/blob/blog/benchmarks/rust/ray...

erichdongubler · on Sept 13, 2021

Does this already use `target-cpu=native`? Not sure if that would be apples-to-apples with the Zig implementation (which is what's important here), but I'd be surprised if it didn't yield some interesting wins.

adenozine · on Sept 13, 2021

Rust is such a bloated, insane design-by-committee language. It's actually still surprising to me that corporations are defending it and using it. I sure won't.

It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!

I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.

pcwalton · on Sept 13, 2021

There was no "design committee" for Rust.

I'm increasingly skeptical that any language with lifetimes can be anything other than "ugly to read". People complain about syntax, but what they're really complaining about is semantics.

adenozine · on Sept 14, 2021

Was? New junk is added to Rust every few weeks, using past tense seems intentionally deceitful, as does re-characterizing my comment. I really do mean ugly as in ugly. That's why I wrote that.

One of the strongest factors behind a lot of people choosing to stay away from Rust is that so many conversations with advocates are met with toxic and subtly toxic behavior. They demean software that's not memory safe the way that politicians use their words to sow anger. C has won, and Rust blew it's shot aiming at C++ instead.

As Amazon swallows Rust and spits all the refuse out, I can't help but smile, because it's a community that's been begging for obsolescence via their attitudes and behaviors since day one.

pcwalton · on Sept 17, 2021

You seem awfully angry about programming languages!

mcronce · on Sept 19, 2021

The Rust community is by a wide margin the most welcoming and positive Internet community I've personally seen.

In no small part because they don't make comments like this one.

platinumrad · on Sept 20, 2021

Welcoming? Yes. Positive? I don't know.

fiddlerwoaroof · on Sept 14, 2021

The thing that really surprised me about Rust is the significance of a semicolon at the end of a function body. E.g.:

    fn foo { 1; }

Returns unit while

    fn foo { 1 }

Returns 1. This reminds me of nothing so much as the infamous JavaScript ASI footgun where this returns undefined:

    function foo() {
      return
      2;
    }

c-cube · on Sept 14, 2021

It's not just the function's body, it's any block. Blocks in rust are expressions and return their last statement's value. It's nothing like Javascript's ASI where it's just inserting ; implicitly and only `return` can return a value.

burntsushi · on Sept 18, 2021

It's also nothing like Javascript because Rust is statically typed. If you accidentally insert a semi-colon (or not), you're almost guaranteed to get a compile error.

adamnemecek · on Sept 19, 2021

The second function is a syntax error since you are not declaring the return value in the function signature.

gpderetta · on Sept 13, 2021

To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.

[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.

kprotty · on Sept 13, 2021

I'm aware that it's technically considered blocking due to a producer being preempted between swap and store causing the consumer to see an invalid link and report empty. I address this in the run queue section by noting that it's OK if empty is spuriously reported since the thread waking/waiting mechanism takes care of rescheduling such a case.

jorangreef · on Sept 13, 2021

Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.

vanderZwan · on Sept 13, 2021

If they weren't before, they probably are by now after all the work put into this

omazurov · on Sept 13, 2021

Increasing the size of the array 10x (100_000_000) and filling a glaring omission:

Go (go1.17.1 darwin/amd64)

    took 5.591593544s
    took 5.285948722s
    took 5.218750076s
    took 5.239224787s
    took 5.131232207s

Zig (0.9.0-dev.959+f011f1393)

    took 3.48s
    took 3.37s
    took 3.44s
    took 3.43s
    took 3.49s

Java (openjdk version "17-ea")

    Time: 2684 ms
    Time: 2654 ms
    Time: 2840 ms
    Time: 2676 ms
    Time: 2649 ms

MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3

gavinray · on Sept 13, 2021

Is this using JVM/JIT or using Graal to make a native-image binary and calling that? You'd get better results due to shaving off startup time with that if it does include it.

I think .NET 6 latest preview would slightly beat out Java here as well.

This isn't exactly apples-to-apples since Zig/Rust/Go are considered "systems languages", but the JVM has GraalVM for compiling to native binaries/libraries and even importing/exporting C functions and structs. And .NET of course has .NET Native + "Dotnet Native Exports", including the LLVM experiment where it uses LLVM bitcode instead of Ryu.

So you can make the argument that writing a native binary or a library which exported this benchmark function as a C-callable method with a C header in each language would technically equivalent.

The JVM and .NET ones would have a large size (several MB each) but would otherwise still fill this requirement.

omazurov · on Sept 14, 2021

It's plain Java (i.e. JVM/JIT), ForkJoinTask based implementation. As in the original implementation, measurement is done around the quickSort() call.

One point is actually that the parallel quick sort algorithm is a bad benchmark for task schedulers (it doesn't scale well for one thing). Another point is, well, that one can spend two years of deep technical work and then be easily beaten by some "legacy" tech in the course of a morning exercise. Maybe those bearded guys were good for something after all :)

kprotty · on Sept 14, 2021

First of all you, along with a few others, misunderstood the goal of the scheduler. I note in the post that it's primarily for async execution. See a previous comment of mine on how a fork-join optimized thread pool which hooks into the scheduler to wait for the other-forked side and run the poll-loop inline is ideal for a fork-join case, and why I'm intentionally not benchmarking that. Given rayon::join actually does this, i'd be curious to see the results of that vs Java's ForkJoinPool on your machine to see if the optimizations match up.

Second, parallel quicksort isn't a bad benchmark, and it does scale enough to stress-test the spawning and joining aspects of the scheduler. Keep in mind, the best thing to scale AFAIK is one that is embarrassingly parallel and takes enough time to offset the cost of any scheduler overhead and contention. Again, this thread pool is optimized to execute small tasks. To that, there are indeed better benchmarks but quicksort with small-size optimization is one that is most widely understood.

Finally, you're in the game of trying to invalidate others work due to novelty, lack of understanding on your part, and generalizations on culture. I'm here to learn about cool scheduler designs. Would appreciate if you would contribute in that aspect instead of the former.

vips7L · on Sept 13, 2021

This is most likely using HotSpot as I don’t believe Graal has released anything past Java 11.

I don’t know if native-image would perform better. I’ve mostly found that it performs worse than HotSpot overall, especially once you start generating garbage and the heap gets larger the Serial GC won’t keep up with G1.

fiddlerwoaroof · on Sept 14, 2021

Yeah, I think people seriously underestimate the abilities of the JVM

kevingadd · on Sept 13, 2021

It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.

I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.

eptcyka · on Sept 13, 2021

I don't think that it's the threadpool's fault that an application uses it incorrectly. Also, I think there are a lot of developers who have not considered that on today's machines, just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case. I wouldn't say it's the case of bad software but rather software that was written for the CPUs of 5 years ago. And generally, I don't think this will be an easy problem to solve due to the variety of heterogeneous and non-heterogeneous topology of modern SoCs. In fact, I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.

matu3ba · on Sept 13, 2021

> I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.

To do that efficiently you need to pin thread groups to cores based on having information of data usage. This smells like over-optimizing architectures to me, if you want to go beyond separating stuff like hyper-threads on io. Additional annoyance: There is no POSIX way to get hyperthreads and physical ones.

eptcyka · on Sept 13, 2021

I think a general purpose threadpool should work well on general purpose hardware, and it seems like the most popular SoCs on consumer devices will have heterogeneous cores et al, so a good implementation would schedule the threadpool appropriately. I agree that there is no POSIX way to distinguish between hyper threads and regular threads, and this is something that should be improved. I'm not saying that the achievements made by the threadpool implementation are lackluster or that any of the other solutions solve the issues I outline any better. What I am saying that the comment I was originally referring was somewhat mistaken about the benefits of a more optimal, yet naive threadpool library.

This isn't just about hyperthreads, by the way. As long as the workload isn't compute heavy and often stalls on memory, hyperthreads are just as good as regular threads. On a hardware level, there is no distinction between a regular and a hyperthread core. Either you multiplex a single physical core or you don't. Anyway, there is more to it than slower threads and faster threads - accessing memory between threads will be slower depending on which core is trying to access which bits of memory - a core stealing work from a sibling core on the same chiplet will probably be able to do that quicker than stealing work from a core across the cluster if the data prefetcher has been doing it's job correctly. Spawning more threads than necessary might force a CPU to power up more cores than necessary, resulting in slower performance per core performance and worse power efficiency, especially if a fast or slow cluster needs to be turned on, where a more optimal scheduling of threads might not force that to happen. I think a general purpose thread pool by default should no longer spawn as many threads as there are _cores_, whatever that term even means, with optional toggles to inform whether the work that'll be scheduled will be compute heavy or not.

dragontamer · on Sept 13, 2021

> just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case

Absolutely not.

Any task with any amount of I/O will have a significant amount of blocking. A GPU kernel may take microseconds or milliseconds to respond. RDMA (a memory-access over Ethernet) may take many microseconds.

Having multiple threads per core would be more efficient: it gives the cores something to do while waiting for SSD, RDMA, or GPUs. Remember: even the earliest single-core systems from the 1970s had multiple threads on one core: its just more efficient to have multiple terminals to read/write to at a time.

--------

One hardware-thread per thread (since SMT8 machines exist like POWER9 / POWER10) is only efficient in the most computationally expensive situations. Which is in fact, a rarity in today's world. Your typical programs will be waiting on the network interface or SSD.

IIRC: there's professional thread-pools out there that are 1.5x threads per hardware thread as a default option, and then scale up/down depending on how computationally expensive things look. That is: a 64-core/128-thread Threadripper would be overloaded with 192 threads, under the assumption that at least 33% of them would be waiting on I/O at any given time.

eptcyka · on Sept 13, 2021

>Any task with any amount of I/O will have a significant amount of blocking. A GPU kernel may take microseconds or milliseconds to respond. RDMA (a memory-access over Ethernet) may take many microseconds.

The argument I failed to make was that with heterogeneous distribution of memory bandwidth and compute resources, most user applications would benefit from spawning less threads than all available cores. In I/O heavy workloads, the correct thing to do is to do asynchronous I/O. This can be done for SSDs and GPUs. On contemporary systems where there's heavy costs associated with context switching ,avoiding them and servicing multiple tasks without blocking will always be superior to spawning more threads to do more blocking.

When it comes to hyperthreading, I assume that the threads are cores - because from the OS perspective, they are, and you cannot distinguish two hyperthreads running on a single core anyway.

Also, I apologize, but the first sentence you're quoting is not not what I intended to write - my point is that most application developers might still think that spawning as many threads as there are cores is a reasonable thing to do in all cases - but with CPUs that comprise of 4 core clusters with 16 cores each, it's often better to spawn far less than the total amount of cores available.

dragontamer · on Sept 13, 2021

> In I/O heavy workloads, the correct thing to do is to do asynchronous I/O

You can't async mmap into memory reads (or the GPU-equivalent: cudaMallocManaged).

Today's I/O is looking more-and-more like a memory read/write. As such, your typical "node = node->next" pointer traversals could very well be an incredible string of I/O. That indirection could be in RAM, over SSD (mmap), over RDMA (ethernet pretending to be RAM), or into GPU-RAM (cudaMallocManaged)... with an appropriate PCIe command (possibly atomic-PCIe) to boot.

Async only works on the simplest of reading/writing patterns.

EDIT: Consider the "persistent kernel" pattern on GPGPU (which starts off with a lot of the similar thought process you have on CPU-land. Launch only X wavefronts, where X is the size of the GPU). You primarily communicate with a persistent kernel over RAM / atomics. You don't use the classic cuda <<<blocks, threads>>>() (which admittingly has an async interface)... Instead: you read/write to magic managed-memory locations that will be eventually sync'd over PCIe to the GPU. This is because the "persistent kernel" is always executing. You launched it upon program start, there's no event to async-trigger on. You just pass data to the GPU. The GPU operates on the data at its leisure. Then it eventually returns data to a buffer elsewhere (probably with atomic compare-and-swaps, which traverse PCIe these days, to ensure coherence)

touisteur · on Sept 13, 2021

Smells like io_uring to me on the userland/kernel interface. Just write to the queue and come later check the completion queue.

Matthias247 · on Sept 14, 2021

I recently spent some time reading the kernel code for io-uring - however only for older revisions (5.4 - 5.7). There I found out that a lot of it is actually implemented on top of existing kernel functions for blocking IO and polled IO, and does not replace it and make the system fully asynchronous. Pratically that means with a 5.4 kernel doing lots of disk IO you will still have lots of threads being blocked on IO - only in this case those will be threads inside a kernel threadpool instead of in userspace. With 5.7 that model changed, and the threadpool is no longer necessary for read/write operations on sockets. Maybe also for files - but I don't really understand the kernel code well enough to confirm or deny that. And things might obviously also have changed for newer Kernel versions.

dragontamer · on Sept 13, 2021

CUDA-streams probably already use io_uring under the hood.

The issue is that a CUDA-stream based execution kernel will spend lots of time pulling from the queue and spinning up threads (sound like a familiar problem?).

Instead: you sidestep this issue with persistent kernels: you launch exactly the number of kernels that matches your GPU-size, and then pass data to those kernels through an alternative means.

-------

The future of I/O is going to be atomic-operations across PCIe with stronger-and-stronger coherency models. PCIe 3.0 introduced atomic-PCIe commands. PCIe 4.0 strengthened them. PCIe 5.0 is rumored to have even stronger coherency rules and "coherent grids" are being seriously discussed / implemented in high-speed computers (see Compute eXpress Link, or CXL)

Any serious I/O (and GPUs will be part of that I/O future), will likely use PCIe atomics in some regards, quite similar to atomic-compare-and-swaps that the CPU does already between their cores.

I/O is becoming indistinguishable from memcpy + atomics.

-------

EDIT: In effect, I'm saying that GPU-programmers are more concerned about GPU-efficiency rather than CPU-efficiency. Sure, the CPU is less efficient spinning on these memory-reads/writes. But the GPU is where the bulk of the calculations are happening. Therefore, it behooves the architect to optimize the GPU (even if the CPU ends up slightly less efficient).

Whatever argument you have about "queues being more efficient in hypothetical architecture", the GPU has more calculations and more cores to feed (aka: far more overhead when it comes to Amdahl's law). That means you want to apply those principles first to the GPU, and then the CPU "makes up for it", so to speak.

eptcyka · on Sept 13, 2021

I was under the impression that PCI-E was perfectly capable of sending notifications from one device to another in a somewhat efficient manner. Having said that, this is not my area of expertise - and I do see that if your main concern is to feed the GPU then blocking a thread might be the optimal solution. I assume that MSI would be too much overhead and might involve some context switching to service the interrupt from the kernel etc to allow for asynchronous completion? Also, is it possible to have overlapping memory regions between a high speed networking card and the input buffer from the GPU, which in effect just means that the CPU just has to tell the GPU to start reading once the network card is done receiving?

Having said that, I don't believe that for most application developers this is a major concern - in cases where you flood the GPU with a firehose of data to compute on you probably also don't care about what other processes run on the machine and whether your architectural decisions end up making people's laps uncomfortably hot. I also do not believe that the future of all I/O is just memcpy and atomics - we can already do that today. It doesn't really bring you any advantages for speed in the general case. I think the future of I/O is memcpy, atomics and a good signaling mechanism to signal I/O task completion without costly context switches with as little extraneous memory allocation as possible. Moreover, the future of consumer computing will probably not rely on PCI-E at all and instead have the GPU and the CPU share all of it's memory. And hey, maybe Nvidia will add some cool ARM companion cores to their biggest chips, slap on some DDR5 slots on their cards and sell self-contained solutions, sidestepping PCI-E entirely, at least for feeding the data from the CPU to the GPU.

touisteur · on Sept 14, 2021

Recent datacenter network controlers (Mellanox, marvell) have 'gpu direct' capabilities, so direct interactions with devices with no cpu interaction. I've also seen fpga+network boards do that with success. And with libraries like nccl and 200gbe eth links you could almost forget you have CPUs or network links between.

What I miss is a simple but efficient data queue between cpu and gpu. Everyone's doing manual memory reservation and cudamemcpy, I want an async send (gpu->cpu) with an mpi or socket-like interface. I've seen someone posting stuff on io_uring from gpu code, but just bragging, no code.

Buying Mellanox, and their bluefield dpu (integrated 8 or 16 arm cores in the NIC) stuff I feel, nvidia could probably go the way you're seeing. Haven't seen any Mellanox/NVIDIA tech convergence yet.

smashed · on Sept 13, 2021

You can change the number of processors gimp's gegl multi-threaded operations gimp will use in the preferences.

https://youtu.be/D2c8BH8q6Yc

The tile size seems to be hard-coded to 128x128.

nicoburns · on Sept 13, 2021

Can you limit the number of cores available to an application when you open it? I feel like there must be a way to do that.

kevingadd · on Sept 13, 2021

On windows, no. You can use affinity to prevent the application from running on some of your cores but it will still spin up a ton of threads (which makes the situation even worse sometimes)

dundarious · on Sept 13, 2021

On Linux it’s taskset, numactl, etc.

egberts1 · on Sept 13, 2021

This is the first four-state-machine thread algorithm.

And I like what I am seeing. Hopefully all transitions between states have been tested.

Good work.

matu3ba · on Sept 13, 2021

Drawing them could help figuring out all (invalid) state transitions. Certainly would make a cool drawing.

sn9 · on Sept 14, 2021

Using a formal methods tool [0] like Alloy [1] would honestly be ideal.

[0] https://www.hillelwayne.com/tags/formal-methods/

[1] https://www.hillelwayne.com/tags/alloy/

lmb · on Sept 14, 2021

What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.

Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?

kprotty · on Sept 14, 2021

Id encourage you to try recording them yourself. The results can vary depending on your system, how much concurrent tasks it can make parallel, if scheduling resources are being used elsewhere, etc. The zig code contains an example of using timers + the spawning and joining is in quickSort() function so it should hopefully be easy to add the timing logic. I can answer questions regarding it if you hop on the IRC or Discord.

In regards to the overflow queue, yes some pathologica tasks may see long delyas but this is true for any mostly-FIFO or sharded queue scenario. Both Golang and Tokio overflow from their local buffer into a shared lock-protected linked list (tokio is a bit more eager in this regard) so they can suffer similar fates. They actually do an optimization which is to check the shared queue before the local buffer every few local scheduling ticks (% 61 or 64 for each task run iirc) to decrease the change of local starvation. Could try adding that to Zig's thread pool after the timing logic and see if that helps tail latencies. I'm curious about the outcome either way, but I may not have time to work on that.

jedisct1 · on Sept 13, 2021

Pretty impressive achievement!

r0f1 · on Sept 13, 2021

I know this is off-topic, but does this website not look eerily familiar to dev.to?

kristoff_it · on Sept 13, 2021

Yes, as others have said, it's based on the same open source CMS.

I've created zig.news for people who want to write about Zig but who don't necessarily want to maintain their own blog and to the work necessary to publicize it.

This should hopefully help more people write down their experience and opinions on using Zig and alleviate the problem of having a "blogger aristocracy" who has a much stronger voice than anyone else.

I briefly go over this concept in this 3mins long video: https://www.youtube.com/watch?v=i9pUuj6eiEg

andruby · on Sept 13, 2021

That's correct. Both are built on Forem. It's in the footer:

   Built on Forem — the open source software that powers DEV and other inclusive communities.
   Made with love and Ruby on Rails.

https://github.com/forem/forem

r0f1 · on Sept 13, 2021

Ah I did not know that. Thanks :)

ibraheemdev · on Sept 13, 2021

I'm guessing it's using forem, the open-source platform that powers dev.to: https://github.com/forem/forem

feffe · on Sept 13, 2021

A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.

The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.

OskarS · on Sept 13, 2021

”Lock free” has a very precise meaning in computer science, and it’s this: ”given some number of threads of execution, an algorithm is lock free if in some finite number of steps, at least one thread is guaranteed to make forward progress”. When using mutexes, this is not guaranteed under any amount contention: the scheduler is perfectly free to preempt a thread holding a lock, which means no threads make progress.

However, if you use a CAS loop (like this code does) to update some value (i.e. a while loop that repeatedly tries to execute a compare-and-swap), that loop will only fail to make progress if some other thread DID make progress. This means that such an algorithm is in fact lock-free: if thread X didn’t progress, it is necessarily because some other thread did. In addition, a scheduler preempting a thread at any point will not stop other threads making progress.

Whether an algorithm is lock free or not is not a matter of opinion, it is a mathematical statement. The fact that it annoys you is… weird.

feffe · on Sept 13, 2021

Yes, I'm probably wrong on the semantics of lock free. I should read up on that. My point was that maybe you can gain a factor of 2 by using atomic ops in a "lockless" fashion compared to mutexes but it still scales extremely poorly compared to what you could do if the hardware had real async mechanisms that did not rely on cache coherency. The cache line that is subjected to an atomic operation will be evicted from all other CPUs cache lines, back and forth.

A simple example is just letting all CPUs in a multi core CPU doing an atomic add at a memory address. The scaling will be exactly 1 with the number of available cores. I realize this is very of topic wrt to this cool article an implementation in zig. It's just that this problem can't really be solved in an efficient manner with todays hardware.

OskarS · on Sept 14, 2021

Yeah, it's a true point that "lock-free" != "fast". If contention is very low, the simple version using mutexes often perform better.

gpderetta · on Sept 13, 2021

Atomic instructions are not locks on the instruction level (the lock prefix on x86 is just an historical artefact).

edit also lock-free has not much to do with locks.

feffe · on Sept 13, 2021

It's a lock associated with the cache line that the atomic operation operates on. This is because they are built on top of the cache coherency mechanism, a synchronous blocking operation on the hardware level to implement an asynchronous mechanism on the software level.

The big frustration with today's multi core CPUs is that there's simply not efficient way to communicate using message passing mechanisms. I think this is something the hardware guys should focus on :-) Provide an async mechanism to communicate between cores, not relying on the cache coherency.

gpderetta · on Sept 13, 2021

There is no lock associated with the cache line.

The coherency protocol guarantees that a core can own in exclusive mode a cache line for a bounded number of cycles, this guarantees forward progress and it is different from an unbounded critical section. It has also nothing to do with the lock prefix and also applies to normal non atomic writes.

What the lock prefix does is delay the load associated with the RMW so that it is executed together with the store before the core has a chance to lose the ownership of the line (technically this can also be implemented optimistically with speculation and replays, but you still need a pessimistic fallback to maintain forward progress).

A message passing feature would be nice, but it would be necessarily non coherent which means you can only use it to pass serialised values, not pointers to other threads.

An alternative more usable solution would be to aggressively speculate around memory barriers as the memory subsystem is already highly asynchronous.

feffe · on Sept 13, 2021

The effect is the same. If someone touch the cache line, it's evicted from all other caches that has it, triggering a cache miss when other cores touch it. Everyone knows this. I just think it's a bit depressing if you try to optimize message passing on many-core CPUs you'll realize you can't make it fast. No-one has been able to make it fast (I've checked Intel message passing code as well). If you get more than 20 Mevents through some kind of shared queue, you are lucky. That is slow if you compare to how many instructions a CPU can retire.

So all these event loop frameworks try to implement an async behaviour ontop of a hardware mechanism that is synchronous in it's core. The main issue is that the lowest software construct, the queue used to communicate is built on this cache line ping pong match. What the software want it still a queue, you could still send pointers if the memory they point to has been committed to memory when the receiving core see them. Just remove the inefficient atomic operations way of synchronizing the data between CPUs. Send them using some kind of network pipe instead :-) As you say the memory subsystem is really asynchronous.

I'm convinces this is coming, it should just have arrived 10 years ago...

gpderetta · on Sept 13, 2021

>What the software want it still a queue, you could still send pointers if the memory they point to has been committed to memory when the receiving core see them

Which means that the sender will need to pessimistically flush anything that the receiver might need to access, which is likely more expensive than optimistic cache coherency.

Non-CC systems were a thing in the past, and still are for things like GPUs and distributed systems, but are much harder to program and not necessarily more efficient.

As the saying goes there are only two hard things in programming...

StreamBright · on Sept 13, 2021

Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is it planned to have something like this? I could not find too much about that online.

matu3ba · on Sept 13, 2021

The former (Inversion of Control) is a general method and too unspecific to answer. I can only hint that one of the biggest caveats of async is brittle composability due to missing cancellation routines. This will eventually be addressed.

The latter (communicating sequential processes) is a formal language, but it looks like it is for IPC: https://arild.github.io/csp-presentation/#1

goodpoint · on Sept 13, 2021

How does it compare to Nim's Weave thread manager?

chris_st · on Sept 13, 2021

Nim newby here -- does Nim have something analogous to Go's channels for inter-thread communication?

goodpoint · on Sept 13, 2021

Yes, it has channels as a built-in:

https://nim-lang.org/docs/channels.html

Things like Weave abstract away the need to manage threads and messaging explicitly.

randyrand · on Sept 13, 2021

please have Thread Priority a native feature. See GCD.

posharma · on Sept 13, 2021

How does this compare to OpenMP? Does it support nested parallelism? (disclaimer: haven't read the post thoroughly).

ur-whale · on Sept 13, 2021

How does it compare with a bare-metal C++ threadpool?

judofyr · on Sept 13, 2021

What is a bare-metal C++ threadpool? Is that a specific library you're referring to?

_57jb · on Sept 13, 2021

You've been spending too much time arguing over virtualization or bare metal.

You are at the wrong level of abstraction.

thechao · on Sept 13, 2021

Weellll... I guess you could get rid of the OS and switch to a stack-only ABI with a callee saves-everything policy? Then "thread switching" and "multiprocess" both map onto saving/restoring the stack (on ARM you'd have to have SP/LR motion to stack), and process communication would just be over a unified stack.

What'd be cool is if every core (and, thus, affinitized process) had its own virtual address. That way if any single core went down, its neighbors could stand it back up. Of course, you'd have to have some sort of hideous last-level shared virtual addressing scheme — probably with large last level pages — to "migrate" a "process" between cores.

EDIT: And security be damned! We need that last .5%!

dragontamer · on Sept 13, 2021

GPUs are practically the "OS-free" compute device of choice these days.

The way GPUs work in most cases is through work-queues. The CPU submits work to GPUs (either OpenCL queues or CUDA Streams), and the GPU "plucks" work off of the queue to do. There are dozens of SMs (NVidia) or CUs (AMD GCN) or WGPs (AMD RDNA) that pluck the work off concurrently.

The kernels are then run to completion. Upon completion, the stream gets an event that triggers (which often times, automatically enqueues another task). If the CPU has work to do, it can get an async message (or maybe be waiting through blocking behavior).

--------

There are other models of course, not everyone runs the standard methodology. "Persistent kernels" start up early on and infinite-loop. You interact with those by passing data over PCIe and the kernel itself contains a queue/load-balancing logic to pickup the data somehow (rather than letting the device drivers / hardware logic do that)

------

The thing about GPU-level tasks is that you have so many little tasks (ex: shoot a ray, or render pixel (0,0), render pixel(1,0), etc. etc.) that just pulling tasks off of a queue is your most efficient methodology. Lots of small, roughly equal-sized tasks can be load balanced with simple methodologies.