Hacker News new | past | comments | ask | show | jobs | submit login

Workgroup in Vulkan/WebGPU lingo is equivalent to "thread block" in CUDA speak; see [1] for a decoder ring.

> Using atomics to solve this is rarely a good idea, atomics will make things go slowly, and there is often a way to restructure the problem so that you can let threads read data from a previous dispatch, and break your pipeline into more dispatches if necessary.

This depends on the exact workload, but I disagree. A multiple dispatch solution to prefix sum requires reading the input at least twice, while decoupled look-back is single pass. That's a 1.5x difference if you're memory saturated, which is a good assumption here.

The Nanite talk (which I linked) showed a very similar result, for very similar reasons. They have a multi-dispatch approach to their adaptive LOD resolver, and it's about 25% slower than the one that uses atomics to manage the job queue.

Thus, I think we can solidly conclud that atomics are an essential part of the toolkit for GPU compute.

You do make an important distinction between runtime and development environment, and I should fix that, but there's still a point to be made. Most people doing machine learning work need a dev environment (or use Colab), even if they're theoretically just consuming GPU code that other people wrote. And if you do distribute a CUDA binary, it only runs on Nvidia. By contrast, my stuff is a 20-second "cargo build" and you can write your own GPU code with very minimal additional setup.

[1]: https://github.com/googlefonts/compute-shader-101/blob/main/...




> Thus, I think we can solidly conclud that atomics are an essential part of the toolkit for GPU compute.

Complete agreement there! Yes there are absolutely good use cases for atomics, I just think it shouldn’t be summarized as either the best or the only approach. It’s incredibly common for there to be better approaches that avoid atomics.

Important to note that “Multiple-dispatch” can mean many things, and your comment seems to suggest that you’re thinking of serial dispatches in a single stream. If atomics and persistent threads are providing benefits, then it’s also possible that multiple parallel dispatch would also see performance improvements over multiple serial dispatch, because parallel dispatches can fill the exact same gap between dispatches that persistent threads are filling.

> Most people doing machine learning need a dev environment

Correct, but your 20 second cargo build was preceded by an install of the dev environment, right? I can’t ‘cargo build’ in 20 seconds right now, I don’t have the dev environment. On the other hand, I can build and run a CUDA app in 20 seconds. I don’t yet see this point being fair.


Vulkan can't reliably do parallel dispatches, certainly not with any kind of scheduling fairness guarantee. CUDA has cooperative groups, which is a huge advantage.

Okay, I see your point about dev environments. It's like cameras, the best dev toolchain is the one you already have installed on your machine. I'll fix this but want to think about the best way to say it. I still believe there's a case to be made that CUDA is a heavyweight dependency.


Thanks for listening Raph! It’s a good post, I’m picking nits. CUDA is a heavyweight dependency, I don’t have any problem with that. It’s just that most dev environments are heavy dependencies to development, so it’s mostly about what we’re comparing CUDA to. The driver is the runtime dependency, and it’s something to consider, but CUDA is pretty good about backward and forward compatibility. It’s true that CUDA code only runs on NV hardware, and I hope some of the good things CUDA has will make it to WebGPU & Vulkan. It’s not super common to build CPU code that only runs on Intel.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: