Such a pity no one else can compete here presently. Would that others be able to gain a position where their software made them competitive on the free market.
Thanks for the tip about exllama, I've been on the lookout for a readable python implementation to play with that is also fast and has support for quantized datasets.
There was free competition here, a while ago. OpenCL was formed by Apple, Khronos et al. to stave off CUDA's dominance. The platform languished from a lack of commitment though, and Apple eventually gave up on open GPU APIs entirely. Nvidia continued funding CUDA and scaling it for industry application, and the rest is history. The landscape of stakeholders is just too bitter to unseat CUDA for what it's used for - your best shot at democratizing AI inferencing acceleration is through something like Microsoft's ONNX[0] runtime.
CUDA had a lot of inertia and opencl brought half baked docs and half baked support out of the gate. If they had focused on simplifying their api to be more user friendly for the 80% use case it could've been a success. Opencl always looked nice on the surface but a few hours in and you've exhausted the docs trying to figure out what to do and there's no good example code around. Of course if they really wanted it to succeed they would've built a Cuda to opencl transpiler for the c api or at least a comprehensive migration guide. I'm not convinced anyone involved was trying to make it popular.