Optimizing a Rust GPU matmul kernel

billti · 2024-12-06T16:19:24 1733501964

I've been playing around with some low-level GPU stuff lately and trying to port some of my CPU-bound matrix multiplication. The challenge has been that in the space I work in (quantum) the matrices are of complex numbers, and for a certain size of problem floats don't cut it and you need doubles. Nearly everything I find for GPUs is targeted at matrices of real floats.

Any pointers to examples? I'd be fine sticking with floats as a first step, but would love to see some (reasonably optimized) low-level GPU code for working with matrices of complex numbers (preferably for Metal or WGPU, which is what I'm using).

pythomancer · 2024-12-06T21:19:55 1733519995

In BLAS terminology this is usually called CGEMM (for single precision) or ZGEMM (for double precision). Both cuBLAS and rocBLAS support ZGEMM, and the latter is open source. rocBLAS is also pretty complicated though, and perhaps not such a good learning resource. This is a more readable library which implements CGEMM, or at least a similar operation: https://git.astron.nl/RD/recruit/ccglib.

The main issue is that double precision is not so interesting for AI and graphics, and so silicon is rather spent on more of these features and less double precision. Not so for HPC, though, and GPUs specialized for this usually have better throughput. For example, the AMD MI210 has the same performance in single and double precision (matrix) operations, while graphics GPUs either have something like 1/2, 1/4, 1/16 etc rate of fp64:fp16, or have no support at all.

LegNeato · 2024-12-06T17:04:10 1733504650

CUDA has support for complex numbers and some NVIDIA frameworks use and expose it (https://nvidia.github.io/cccl/thrust/api/structthrust_1_1com...).

For Rust GPU, nothing built in but there are libraries like https://github.com/rust-num/num-complex that support `no_std` and should work on the GPU. I've never used them so I don't know what (if any) the perf hit would be.

taminka · 2024-12-06T08:36:32 1733474192

i’m always confused wrt how shader languages that target a bunch of other framework specific shader languages manage to map all the available features correctly?

like tensor core support, non-uniform thread groups, different data type support, a bunch of simd group functions, stuff like that…

i couldn’t find info wrt this in rust gpu, so i’m assuming it just tries to target the narrowest available feature set, that’s compatible across all shading languages?

LegNeato · 2024-12-06T11:16:24 1733483784

The goal is to handle things similar to how rust on the CPU does (shared traits with different implementations, intrinsics, platform-specific modules, user replaceable traits like alloc or panic handler and "asm" as a last ditch case).

We already have a lot (we have many intrinsics and an `arch` module like `std::arch`, asm! to include raw spirv, etc). For example, here are the intrinsics: https://rust-gpu.github.io/rust-gpu/api/spirv_std/arch/index... and here is support for ray tracing (which obviously is not on every card: https://rust-gpu.github.io/rust-gpu/api/spirv_std/ray_tracin...).

Vulkan has a way to query and specify different GPU capabilities and Rust-GPU uses that.

Rust and Vulkan have many of the tools we need for progressive enhancement, we are not focused on lowest common denominator.

cosmic_quanta · 2024-12-06T15:14:21 1733498061

I'm not familiar with GPUs specifically, but I have seen this for ORMs that support multiple SQL dialects (e.g. [0]).

A great technique is called 'tagless final encoding' [1]. Using this technique, you can specify capabilities of an embedded domain-specific language (eDSL) such that you can have a shared (but narrow) common set of features, while allowing specializations of this eDSL to support extra features.

[0]: https://github.com/haskell-beam/beam

[1]: https://nrinaudo.github.io/articles/tagless_final.html

messe · 2024-12-06T08:48:16 1733474896

It's mentioned near the beginning of the linked article:

> These Rust GPU programs are then compiled into SPIR-V, a low-level format that most GPUs understand

taminka · 2024-12-06T08:51:19 1733475079

so it’s just a matter of rust gpu not yet supporting these features?

fulafel · 2024-12-06T09:09:55 1733476195

The GPU prorgramming language tech landscape is generally pretty low-tech on the compilers side. Tensor cores are not supported.

(Culturally of course the big one is the fragmentation and the proprietary nature of everything which is the reason so little gets done on GPUs and the horror of attempting multiplatform software there)

creata · 2024-12-06T15:54:39 1733500479

> Tensor cores are not supported.

Correct me if I'm wrong or misunderstanding you, but doesn't SPIR-V support tensor cores via SPV_KHR_cooperative_matrix?

fulafel · 2024-12-07T08:38:26 1733560706

I don't know, do share if there's something you can link.

At first blush it just sounds like something to allow multiple shader compute elements to work more efficiently together ("cooperate") on a single bigger matrix computation.

creata · 2024-12-08T21:27:55 1733693275

That is what it is afaict, but NVIDIA says this in their 2019 post "Machine Learning Acceleration in Vulkan with Cooperative Matrices"[0]:

> Additionally, if the GPU includes dedicated hardware for high-speed matrix operations, such as the Tensor Cores on Turing GPUs, then the Cooperative Matrix extension can tap into the power of this acceleration with no application changes.

The benchmark graph doesn't look too great, though - around half the "theoretical peak tensor core performance".

[0]: https://developer.nvidia.com/blog/machine-learning-accelerat...