Hacker News new | past | comments | ask | show | jobs | submit login

Actually, leaving it on 16 threads performs a bit better than setting it to 32 threads.

16 threads: https://0x0.st/XaDB.png

32 threads: https://0x0.st/XaDM.png

But still not as fast as it ran on your 7700(X) and NumPy is 2-3x faster than matmul.c on my PC.

I also changed KC to some other values (500: https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it didn't change much performance wise.




as we discussed earlier, the code really needs Clang to attain high performance





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: