NVIDIA is the leader because most academic AI setups are NVIDIA based. When AI m...

Aromasin · on Nov 29, 2023

I work for one of their big competitors, and all my conversations with customers tend to follow the same string; "NVIDIA is milking us dry, we want an alternative, but all the alternatives require significant redesign in languages and tools people are unfamiliar with and we can't afford that overhead". It tends to be very cut and dry.

Until university labs get people working in open frameworks and not CUDA, every student joining the industry will default to NVIDIA GPUs until they're forced otherwise. The few people I've managed to convert have been forced by supply constraints, not any desire to innovate or save themselves money. As long as NVIDIA can keep the market satiated with a critical mass of compute, they'll sit on their throne for a long ol' while.

einpoklum · on Nov 29, 2023

> but all the alternatives require significant redesign in languages and tools people are unfamiliar with and we can't afford that overhead

Where I work, we've made it a principle to stay OpenCL-compatible even while going with NVIDIA due to their better-performing GPUs. I even go as far as writing kernels that can be compiled as either CUDA C++ or OpenCL-C, with a bit of duct-tape adapter headers:

https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...

of course, if you're working with higher-level frameworks then it's more difficult, and you depend on whether or not they provided different backends. So, no thrust for AMD GPUs, for example, but pytorch and TensorFlow do let you use them.

joe_the_user · on Nov 29, 2023

Yeah - which is to say that that competitors and other competitors aren't actually going to create a CUDA replacement any time soon. And, correct me if I'm wrong, it would be quite possible to create such a thing - AMD had a system which had a tool to do conversion a while back but I recall them not supporting it seriously.

The problem is that when a company has done serious capital investment to advance a market, anyone who invested equivalently wouldn't reap the same rewards - competition would just eat away each company's profits so no one will challenge that.

latchkey · on Nov 30, 2023

> AMD had a system which had a tool to do conversion a while back but I recall them not supporting it seriously.

They are supporting it seriously now. It is being actively developed and improved.

https://github.com/ROCm-Developer-Tools/HIPIFY

zozbot234 · on Nov 30, 2023

It's being done. The Mesa project has drivers for OpenCL (RustiCL) and Vulkan under development on any hardware that can provide the underlying facilities for that kind of support. This provides the basic foundation (together with other projects like SYCL) for a high-level alternative that can be properly supported across vendors (minus the expected hardware-specific quirks).

ralfd · on Nov 30, 2023

The second mover advantage though is that you can just copy the first company and avoid blind alleys.

jhj · on Nov 29, 2023

> Either CUDA will open up, if it is to survive or open API use will spread.

I don't really think so, at least not anytime soon while the hardware functionality continues to evolve so much, and while they seem to be concentrating on the high end devices/architecture rather than low-end stuff.

I've been more or less exclusively writing CUDA for the past decade in the AI/ML space (though have spent some time with OpenCL, Vulkan and other things along the way too). What a GPU is or should be I don't think has reached an evolutionary end yet. CUDA also is not a static thing, and it has co-evolved with the hardware, not being locked into some static industry standard with a boatload of annoying glExtWhatever dangling off of it. Over the past decade or so, Nvidia has introduced new ways that the register file can be used (Kepler shuffles), changed the memory model of GPUs and the warp execution model (to avoid deadlock/starvation by breaking the lockstep behavior somewhat), slowly changing the grid/CTA model (cf. cooperative groups, CTA clusters), adding more asynchronous components to the host APIs and the hardware (async DMAs), and has constantly changed the underlying instruction set, all of which leaks into CUDA in some way.

smoldesu · on Nov 29, 2023

> Either CUDA will open up, if it is to survive or open API use will spread.

CUDA won't die if Open APIs take over AI inferencing operations. It's still used and applied in so many niche industries that it can only be "replaced" in industries like AI where companies invest in moving digital mountains. Stuff like Microsoft's ONNX project will go a long way towards making CUDA unneccesary for AI acceleration, but it won't ever kill the demand for CUDA.

Just look at how lethargic the industry's response has been in the wake of AI, and look at how other companies like AMD and Apple abandoned OpenCL before it was ready. Now Apple is banking on CoreML as an integration feature and AMD is segmenting their consumer and server hardware like crazy.

> Weirdly, NVIDIA hardware only outperforms competitors on its own API. When you compare NVIDIA on a level playing field, they aren't the clear winners.

That does not reflect any of the benchmarks I've seen at all, unless by "level playing field" you mean comparing old Nvidia chips to modern AMD ones. The only systems comparable to the DGX pods Nvidia sells is Apple's hardware, which lacks the networking and OS support to be competitive server side.

AMD is an amazing company for being open and transparent with their approach, but nice guys always finish last. This is a race between the highest-density TSMC customers, which means it's Apple and Nvidia laughing their respective paths to the bank.

nightski · on Nov 29, 2023

Calling AMD a nice guy is a huge stretch in my opinion. From my understanding they didn't even allow use of ROCm with consumer GPUs until this year... CUDA is and always was very accessible to a broad audience.

kvemkon · on Nov 30, 2023

> AMD is an amazing company for being open and transparent with their approach ...

Is it known, why consumer cards do not have fp64:fp32 performance at about 1:2?

lmm · on Nov 30, 2023

> Proprietary, hardware specific APIs never stand the test of time. Ask 3Dfx.

3DFX lost for other reasons. DirectX is doing fine despite having had open competition for decades.

Solvency · on Nov 29, 2023

How did NVIDIA entrench itself as the leader in academia? From a market perspective it's interesting how they cornered it.

smoldesu · on Nov 29, 2023

CUDA runs on most recent Nvidia GPUs, which are replete on college campuses and well-supported in server software. AMD's GPGPU compute support differs from GPU to GPU, and Apple didn't start contributing acceleration patches to Pytorch and Tensorflow until stuff like Llama and Stable Diffusion took off.

KaoruAoiShiho · on Nov 30, 2023

They built the entire market from scratch.