It must be a misnomer on PyTorch's side. Clearly it's neither CUDA nor OpenCL. A...

omneity · on Oct 6, 2023

I believe the fp64 limitation came from the laptop-grade GPU I had rather than inherent to AMD or ROCm.

The API level I could target was at least two or three versions behind the latest they have to offer.

KeplerBoy · on Oct 6, 2023

Might very well be true. I don't blame anyone for not diving deeper into figuring out why this stuff doesn't work.

But this is one of the great strengths of CUDA: I can develop a kernel on my workstation, my boss can demo it on his laptop and we can deploy it on Jetsons or the multi-gpu cluster with minimal changes and i can be sure that everything runs everywhere.

brutus1213 · on Oct 7, 2023

There is indeed something excellent about CUDA from a user perspective that is hard to beat. I do high-level DNN and it is not clear to me what it is or why that is. Anytime I have worked on optimizing to mobile hardware (not Jetson, but actual phones or accelerators), it is just a world of hurt and incompatibilities. This notion that operators or subgraphs can be accelerated by lower level closed blobs .. I wonder if that is part of the issue. But then why doesn't OpenCL not just work? I thought it gave a CUDA kernel like general purpose abstraction.

I just don't understand the details enough to understand why things are problematic without CUDA :(

iopq · on Oct 7, 2023

Sorry, still trying to install some dependencies for DNN and CUDA, not sure why it says my Clang version is too new (!)

JonChesterfield · on Oct 7, 2023

FP64 is what HPC is built on. F32 works on the cards too (same rate or faster). I don't know the status of F16 or F8.

Some architectures provide fast F16->F32 and F32->F16 conversion instructions so you can DIY the memory bandwidth saving - that always seemed reasonable to me, but I don't know if the AMD hardware people are/will go down that path.

KeplerBoy · on Oct 7, 2023

Sure but Radeon cards are not HPC accelerators. A modest 7800XT for example, which would be a great card for SD, has 76 TFlops@FP16, 37TF@FP32 and 1.16TF@FP64.

Keeping all those FPUs busy is another problem and not easy, but in cases where it can be done FP32 is clearly desirable.

londons_explore · on Oct 7, 2023

More importantly, if you specify FP16, yet the hardware only supports FP32, then the library should emit a warning but work anyway, doing transparent casts behind your back as necessary.