We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY
Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.
Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.
"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!