Hacker News new | past | comments | ask | show | jobs | submit login

It must be a misnomer on PyTorch's side. Clearly it's neither CUDA nor OpenCL.

AMD should just get it's shit together. This is ridiculous. Not the name, but the fact that you can only do FP64 on a GPU. Everybody is moving to FP16 and AMD is stuck on doubles?




I believe the fp64 limitation came from the laptop-grade GPU I had rather than inherent to AMD or ROCm.

The API level I could target was at least two or three versions behind the latest they have to offer.


Might very well be true. I don't blame anyone for not diving deeper into figuring out why this stuff doesn't work.

But this is one of the great strengths of CUDA: I can develop a kernel on my workstation, my boss can demo it on his laptop and we can deploy it on Jetsons or the multi-gpu cluster with minimal changes and i can be sure that everything runs everywhere.


There is indeed something excellent about CUDA from a user perspective that is hard to beat. I do high-level DNN and it is not clear to me what it is or why that is. Anytime I have worked on optimizing to mobile hardware (not Jetson, but actual phones or accelerators), it is just a world of hurt and incompatibilities. This notion that operators or subgraphs can be accelerated by lower level closed blobs .. I wonder if that is part of the issue. But then why doesn't OpenCL not just work? I thought it gave a CUDA kernel like general purpose abstraction.

I just don't understand the details enough to understand why things are problematic without CUDA :(


Sorry, still trying to install some dependencies for DNN and CUDA, not sure why it says my Clang version is too new (!)


FP64 is what HPC is built on. F32 works on the cards too (same rate or faster). I don't know the status of F16 or F8.

Some architectures provide fast F16->F32 and F32->F16 conversion instructions so you can DIY the memory bandwidth saving - that always seemed reasonable to me, but I don't know if the AMD hardware people are/will go down that path.


Sure but Radeon cards are not HPC accelerators. A modest 7800XT for example, which would be a great card for SD, has 76 TFlops@FP16, 37TF@FP32 and 1.16TF@FP64.

Keeping all those FPUs busy is another problem and not easy, but in cases where it can be done FP32 is clearly desirable.


More importantly, if you specify FP16, yet the hardware only supports FP32, then the library should emit a warning but work anyway, doing transparent casts behind your back as necessary.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: