Might very well be true. I don't blame anyone for not diving deeper into figuring out why this stuff doesn't work.
But this is one of the great strengths of CUDA: I can develop a kernel on my workstation, my boss can demo it on his laptop and we can deploy it on Jetsons or the multi-gpu cluster with minimal changes and i can be sure that everything runs everywhere.
There is indeed something excellent about CUDA from a user perspective that is hard to beat. I do high-level DNN and it is not clear to me what it is or why that is. Anytime I have worked on optimizing to mobile hardware (not Jetson, but actual phones or accelerators), it is just a world of hurt and incompatibilities. This notion that operators or subgraphs can be accelerated by lower level closed blobs .. I wonder if that is part of the issue. But then why doesn't OpenCL not just work? I thought it gave a CUDA kernel like general purpose abstraction.
I just don't understand the details enough to understand why things are problematic without CUDA :(
The API level I could target was at least two or three versions behind the latest they have to offer.