Hacker News new | past | comments | ask | show | jobs | submit login
llm.c: multi-GPU, bfloat16, flash attention, ~7% faster than PyTorch (twitter.com/karpathy)
121 points by tosh 5 months ago | hide | past | favorite | 10 comments



Much faster yet than stable pytorch 2.3 (46% on A100, as per the tweet), and much much faster yet compared to pytorch 2.2, which was the stable version a couple weeks ago. Also llm.c is much faster yet when the performance comparison is on H100 instead of A100, or on multiple GPU instead of a single one.


I’d be happier with 93% of PyTorch but works on multiple gpu manufacturers.


That... wasn't the original intention of the project. It was to create a C version of the PyTorch code that could train GPT-2.


Yeah, I'm sure that's what anyone trying to build some kind of AI startup that's managed to acquire a small handful of A100 or even better H100s thinks too. "Those cards sure were expensive, but ethically, I'd rather the software run slower to give me future imaginary options than to get the most out the hardware I just bought."


it’s pretty impressive that PyTorch is only 7% slower than this given it can be used so generally


How does it compare to GGML? That I'd what they must be comparing and yet I don't see any comparison made


CPU or coda … if it is c can it be used on Apple and intel etc. cpu and Gpu.


what about CPU / GPU diff ? is this an improvement across all or only on GPU ?


Crated over the period of like 4 weeks by random people all over the internet


Tinfoil hat time. The recent gpt2 chatbot that everyone thought was a new open ai product - could it be?

“ You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies.”

Remarkably similar nomenclature. I give it 1% chance this is related. I did play with that chatbot and it was smarter than gpt4 whatever it was.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: