Hacker News new | past | comments | ask | show | jobs | submit login

In the README, it says:

> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.

I think I'll need a TL;DR on what to change all these values to.

I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.

So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).




Actually, leaving it on 16 threads performs a bit better than setting it to 32 threads.

16 threads: https://0x0.st/XaDB.png

32 threads: https://0x0.st/XaDM.png

But still not as fast as it ran on your 7700(X) and NumPy is 2-3x faster than matmul.c on my PC.

I also changed KC to some other values (500: https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it didn't change much performance wise.


as we discussed earlier, the code really needs Clang to attain high performance





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: