16 threads: https://0x0.st/XaDB.png
32 threads: https://0x0.st/XaDM.png
But still not as fast as it ran on your 7700(X) and NumPy is 2-3x faster than matmul.c on my PC.
I also changed KC to some other values (500: https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it didn't change much performance wise.
16 threads: https://0x0.st/XaDB.png
32 threads: https://0x0.st/XaDM.png
But still not as fast as it ran on your 7700(X) and NumPy is 2-3x faster than matmul.c on my PC.
I also changed KC to some other values (500: https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it didn't change much performance wise.