I've been playing around with some low-level GPU stuff lately and trying to port some of my CPU-bound matrix multiplication. The challenge has been that in the space I work in (quantum) the matrices are of complex numbers, and for a certain size of problem floats don't cut it and you need doubles. Nearly everything I find for GPUs is targeted at matrices of real floats.
Any pointers to examples? I'd be fine sticking with floats as a first step, but would love to see some (reasonably optimized) low-level GPU code for working with matrices of complex numbers (preferably for Metal or WGPU, which is what I'm using).
In BLAS terminology this is usually called CGEMM (for single precision) or ZGEMM (for double precision). Both cuBLAS and rocBLAS support ZGEMM, and the latter is open source. rocBLAS is also pretty complicated though, and perhaps not such a good learning resource. This is a more readable library which implements CGEMM, or at least a similar operation: https://git.astron.nl/RD/recruit/ccglib.
The main issue is that double precision is not so interesting for AI and graphics, and so silicon is rather spent on more of these features and less double precision. Not so for HPC, though, and GPUs specialized for this usually have better throughput. For example, the AMD MI210 has the same performance in single and double precision (matrix) operations, while graphics GPUs either have something like 1/2, 1/4, 1/16 etc rate of fp64:fp16, or have no support at all.
For Rust GPU, nothing built in but there are libraries like https://github.com/rust-num/num-complex that support `no_std` and should work on the GPU. I've never used them so I don't know what (if any) the perf hit would be.
i’m always confused wrt how shader languages that target a bunch of other framework specific shader languages manage to map all the available features correctly?
like tensor core support, non-uniform thread groups, different data type support, a bunch of simd group functions, stuff like that…
i couldn’t find info wrt this in rust gpu, so i’m assuming it just tries to target the narrowest available feature set, that’s compatible across all shading languages?
The goal is to handle things similar to how rust on the CPU does (shared traits with different implementations, intrinsics, platform-specific modules, user replaceable traits like alloc or panic handler and "asm" as a last ditch case).
I'm not familiar with GPUs specifically, but I have seen this for ORMs that support multiple SQL dialects (e.g. [0]).
A great technique is called 'tagless final encoding' [1]. Using this technique, you can specify capabilities of an embedded domain-specific language (eDSL) such that you can have a shared (but narrow) common set of features, while allowing specializations of this eDSL to support extra features.
The GPU prorgramming language tech landscape is generally pretty low-tech on the compilers side.
Tensor cores are not supported.
(Culturally of course the big one is the fragmentation and the proprietary nature of everything which is the reason so little gets done on GPUs and the horror of attempting multiplatform software there)
I don't know, do share if there's something you can link.
At first blush it just sounds like something to allow multiple shader compute elements to work more efficiently together ("cooperate") on a single bigger matrix computation.
That is what it is afaict, but NVIDIA says this in their 2019 post "Machine Learning Acceleration in Vulkan with Cooperative Matrices"[0]:
> Additionally, if the GPU includes dedicated hardware for high-speed matrix operations, such as the Tensor Cores on Turing GPUs, then the Cooperative Matrix extension can tap into the power of this acceleration with no application changes.
The benchmark graph doesn't look too great, though - around half the "theoretical peak tensor core performance".
Any pointers to examples? I'd be fine sticking with floats as a first step, but would love to see some (reasonably optimized) low-level GPU code for working with matrices of complex numbers (preferably for Metal or WGPU, which is what I'm using).