In the README, it says: > Important! Please don’t expect peak performance withou...

SushiHippie · 2024-07-04T14:36:29 1720103789

Actually, leaving it on 16 threads performs a bit better than setting it to 32 threads.

But still not as fast as it ran on your 7700(X) and NumPy is 2-3x faster than matmul.c on my PC.

I also changed KC to some other values (500: https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it didn't change much performance wise.

salykova · 2024-07-04T16:08:56 1720109336

as we discussed earlier, the code really needs Clang to attain high performance

SushiHippie · 2024-07-04T16:27:36 1720110456