salykova's comments

salykova · 2024-07-04T17:11:28 1720113088

excalidraw <3

salykova · 2024-07-04T16:08:56 1720109336

as we discussed earlier, the code really needs Clang to attain high performance

SushiHippie · 2024-07-04T16:27:36 1720110456

salykova · 2024-07-04T05:59:09 1720072749

We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY

salykova · 2024-07-04T08:20:48 1720081248

"off-topic" channel

salykova · 2024-07-04T05:52:46 1720072366

Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!