Such a pity no one else can compete here presently. Would that others be able to...

theaiquestion · on June 13, 2023

Compete with Llama.cpp? Like transformers llama [0], exllama [1] (really fast), or litllama [2] ?

exllama is really memory efficient and really fast

[0] https://huggingface.co/docs/transformers/main/model_doc/llam...

[1] https://github.com/turboderp/exllama

[2] https://github.com/Lightning-AI/lit-llama

EDIT: Or do you mean cuda? Because yeah, it's such a shame AMD's Rocm is so bad even geohot gave up. it's examples don't even run without crashing.

https://github.com/RadeonOpenCompute/ROCm/issues/2198#issuec...

kayvr · on June 13, 2023

Also https://github.com/kayvr/TokenHawk, a WebGPU implementation of LLaMA.

edit: Note that this is my project.

dTal · on June 13, 2023

Thanks for the tip about exllama, I've been on the lookout for a readable python implementation to play with that is also fast and has support for quantized datasets.

smoldesu · on June 13, 2023

There was free competition here, a while ago. OpenCL was formed by Apple, Khronos et al. to stave off CUDA's dominance. The platform languished from a lack of commitment though, and Apple eventually gave up on open GPU APIs entirely. Nvidia continued funding CUDA and scaling it for industry application, and the rest is history. The landscape of stakeholders is just too bitter to unseat CUDA for what it's used for - your best shot at democratizing AI inferencing acceleration is through something like Microsoft's ONNX[0] runtime.

[0] https://onnxruntime.ai/

eyegor · on June 13, 2023

CUDA had a lot of inertia and opencl brought half baked docs and half baked support out of the gate. If they had focused on simplifying their api to be more user friendly for the 80% use case it could've been a success. Opencl always looked nice on the surface but a few hours in and you've exhausted the docs trying to figure out what to do and there's no good example code around. Of course if they really wanted it to succeed they would've built a Cuda to opencl transpiler for the c api or at least a comprehensive migration guide. I'm not convinced anyone involved was trying to make it popular.

nl · on June 13, 2023

Note that Llama supports acceleration on both OpenCL and Apple Metal

chrischen · on June 13, 2023

There’s also geohot’s tiny corp betting on AMD gpus.

pclmulqdq · on June 13, 2023

Not any more.

nromiun · on June 13, 2023

https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

AMD gave him a binary blob driver and that fixed his problem. Also, tinygrad is the only Python framework I know that has full OpenCL acceleration.

vvladymyrov · on June 13, 2023

What do you mean? At least as of June 7 geohot was still working on amd drivers builds and stability. https://geohot.github.io/blog/jekyll/update/2023/06/07/a-div...

So far it doesn’t look that AMD is fully on board with Tiny Corp, but they are talking…

m00x · on June 13, 2023

Why not ggml?

nl · on June 13, 2023

Unclear what this is referring to, but if it means CUDA vs other things it is worth noting that:

a) CUDA won in a free market because NVidia showed they cared about it

b) Llama has support for OpenCL (via CLBlast) and Apple Metal

The OpenCL support already has a custom kernel for token generation.

angch · on June 13, 2023

There's Fabrice Bellard's textsynth server. https://bellard.org/ts_server/

No open source though.

ShamelessC · on June 13, 2023

This isn't a market.