I have 40+ yrs of HPC/AI apps/performance engineering experience & I was one of the 1st people to port LAPACK and a number of other numerical libs to CUDA. Moreover, many of those major DoE + AI sites are my customers.

You should not confuse AMD's general & long-standing indifference/incompetence wrt SW with the actual difficulty of providing a portable SW path for acceleration. As Woody Allen once said: "90% of success is showing up"

But what happened in AI, when, in a very short period of time, almost everyone moved away from writing their directly in CUDA, to writing them in frameworks like Tensorflow & PyTorch is all the evidence anyone need to show just how unsound that SW obstacle is.

I'm working on a project ATM at one of the DoE sites you're likely referring to... Maybe we'll bump into each other!

Ah yes, pytorch:

1) Check issues, PRs, etc on torch Github. Considering market share ROCm has a multiple of the number of open and closed issues. There is still much work to be done for things as basic as overall stability.

2) torch is the bare minimum. Consider flash attention. On CUDA just runs of course with sliding window attention, ALiBi, and PagedAttention. ROCm fork? Nope. Then check out the xFormers situation on ROCm. Prepare to spend your time messing around with ROCm, spelunking GH issues/PRs/blogs, etc and going one by one through frameworks and libraries instead of `pip install` and actually doing your work.

3) Repeat for hundreds of libraries, frameworks, etc depending on your specific use case(s).

Then, once you have a model and need to serve it up for inference so your users can actually make use of it and you can get paid? With CUDA you can choose between torchserve, HF TEI/TGI, Nvidia Triton Inference Server, vLLM, and a number of others. vLLM has what I would call (at best) "early" support that requires patches to ROCm, isn't feature complete, and regularly has commits to fix yet another show-stopping bug/crash/performance regression/whatever.

Torch support is a good start but it's just that - a start.

I almost spew my coffee when reading your grand parent comments.

One of the first teams that ported LAPACK to CUDA or CULA are apparently being paid handsomely by Nividia [1],[2].

Interestingly, DCompute is a little known effort to support compute on CUDA and OpenCL in D language, and it was done by a part-time undergrad student [3].

I strongly believe we need a very capable language to make advancement much easier in HPC/AI/etc, and D language fit the bill very much and then some. Heck it even beat other BLAS libraries that other so called data languages namely Matlab and Julia still heavily depended on for their performances to this very day. It does it in style back in 2016 more than seven years ago [4]. The DCompute implementation by the part-timer in 2017 actually depended on this native D implementation of these linear algebra routines in Mir [5].

I got paid to do the LAPACK port, back in the mid 2000s, for a federal contractor working on satellite imaging type apps. I was still a good coder, back then... Took me about a month, as I recall. Maybe 6 weeks.

But I'm one of those old-school HPC guys who believes that libraries are mostly irrelevant, and absolutely no substitute for compilers and targeted code generation.

Julia is cool, btw. It could very well end up supplanting Fortran, once they fix the poor performance code generation issues.

I think you are right on the libraries, that's why there's currently an initiative in D eco-system to have D compiler DMD as a library, and the aim is probably for compiler should be the only way to run the library without extra code [1].

I really wished any modern language should try supplanting Fortran for HPC and personally my bet is on D.

[1]DMD Compiler as a Library: A Call to Arms:


