- Buy an AMD GPU labeled for ROCm during the Great GPU Shortage
- 6-9 months later, learn AMD discontinues ROCm support for that GPU. When I suggest this might be a false advertising or warranty issue, AMD and vendor both point to each other
- Install an old version of ROCm to get work done
- See odd crashes where my system goes down hard for no reason
- Read more documentation, and learn ROCm only works headless
- Find ROCm runs more slowly on my workloads than CPU
- Find most of the libraries I want to use don't work with ROCm in the first place, but require NVidia
- At that point, I bought NVidia, and everything Just Works.
That's a shortened version. I'd take a 5x speed hit for open source, but I won't take 'not working.'
I really, really don't think this scenario is super common. Shitty, yes, definitely. But super common? Definitely not. You're talking about running GPGPU workloads on a budget, consumer-grade GPU that had a ~5-6 year old chipset when you purchased it. Yah, you can find other people complaining if you Google it, but realistically, how many people do you think this affected?
It sounds like Nvidia was a much better option for you from the start, and I'm surprised anyone purchased a Polaris 10 AMD GPU for GPGPU in 2020.
Conversely, I've had an RX580 that I purchased shortly after launch, and I've had zero issues with it. I've used it in a normal PC, a self-built Hackintosh and an eGPU enclosure that I use with my 2016 MBP in both Windows and macOS.
Most people are buying AMD GPUs for gaming or productivity (e.g. blender), not machine learning, and its working great for them. ROCm is currently a joke. Maybe in a few years AMD will care enough to try to participate in the ML hardware space.
Amusing introduction for what manages to look like elaborate FUD.
Particularly, there are quite telling parts.
- It seems to be very specifically about compute, which is not what most people buy their GPUs for. Interestingly, your former "does not work" comment didn't even mention that.
- No timeline (is this 2016? 2018? 2020? 2021?). Particularly, ROCm today has nothing to do with ROCm two years ago.
- We know nothing about your application (what are you even trying to do?).
- GPU model and Vendor are omitted, so we cannot verify your story about support removal.
- Libraries "you want to use" are omitted, so we cannot check today's status of ROCm support.
NVIDIA, everything just works. (advertisement thrown in at the end)
> It seems to be very specifically about compute, which is not what most people buy their GPUs for. Interestingly, your former "does not work" comment didn't even mention that.
It's an anecdote. There are many more like it. However, compute is increasingly common, and I suspect we're hitting a critical point with tools like Stable Diffusion.
> No timeline (is this 2016? 2018? 2020? 2021?). Particularly, ROCm today has nothing to do with ROCm two years ago.
"Great GPU shortage" places it a bit after COVID hit.
> We know nothing about your application (what are you even trying to do?)
NLP, if you care, but that's true across most compute applications.
> GPU model and Vendor are omitted, so we cannot verify your story about support removal
It's a conversation, not a jury trial. RX570, if you care.
> Libraries "you want to use" are omitted, so we cannot check today's status of ROCm support.
If it please the court, the most popular NLP library for the type of work I do is:
If it please the court, I was also using Cupy extensively, which has experimental ROCm support, which completely didn't work. It isn't officially supported either:
If it please the court, I just made my own library which is tied to CUDA as well, not for lack of trying to make it work with ROCm. AMD will have a bit more of a hole to dig out of if it ever tries to be competitive.
>"Great GPU shortage" places it a bit after COVID hit.
>RX570
My condolences. I was lucky enough to get a Vega 64 (Sapphire's) on launch. I'm still using it today. RDNA3, together with new electricity prices, might finally get me to upgrade.
>If it please the court, I just made my own library which is tied to CUDA as well,
HIP is meant to solve that problem. Your library might be auto-convertable. It's pretty much CUDA with everything renamed to HIP. It can then run on both CUDA and ROCm.
Here's my basic problem. Neither I, nor anyone I work with, want to understand what "Vega 64 (Sapphire)" or any other of this stuff is. I'd just like things to work.
I bought a card advertised to work with ROCm, and got 9 months of use from it, which was just about enough to set up a development environment, since most of the real work is in data engineering, dashboarding, etc.
I did take the time to understand this when things broke, but that's not really a reasonable expectation. My recollection of this will be:
"AMD market tools at the stability and maturity of early prototypes as production-grade code" and "AMD GPUs might stop working in a few months if AMD gets bored." Experiences like this DO burn customers. If AMD had advertised this as being not-quite-ready for prime-time, I would not have felt bad. The gap between advertising, fine-print, and reality was astronomical.
You could point me to github issues and say all this was public, but that's not reasonable to expect of someone buying a GPU. If I walk into a store, and walk out with a card labeled for ROCm, then ROCm should work.
One of my colleagues bought an ancient NVidia card, during the same shortage. It just works.