I would want an equivalent of cublas optimized for my specific GPU model and implementing the same API.
AFAIK cublas and other first-party libraries are hand-optimized by nVidia for different generations of their hardware, with dynamic dispatch in runtime for optimal performance. Pretty sure none of these versions would run optimally on AMD GPUs because ideally AMD GPUs run 64 threads / wavefront, nVidia GPUs run 32 threads / wavefront.