They are bad but not way off for that basic for loop, depending on which rotatio...

They are bad but not way off for that basic for loop, depending on which rotation is being applied.

Using their code on my Intel-based workstation at around 3ghz using GCC 7.3 it takes around 80-100ms to rotate a 4096x4096 buffer 90 or 270, and 14ms to rotate 180.

Max memory bandwidth of something like an i9-9900k is 41.2GB/s. This test reads & writes 128mib of data. So max theoretical achievable performance here is around 3-4ms. Max theoretical. So 100x is not really feasible. 10x, though, very much is, as the quick convert shows a peak time of 14ms with a 180* rotation.

Of course the major source of slowness here is that the reads/writes are not sequential, and the 90 & 270 rotations are achieving a fraction of the possible bandwidth they could as the input reads are jumping around, so every single one is a cache miss and the other 60 bytes in each cache line on the miss will be purged before it's used again.

Flipping it would mean the writes are never utilizing a full cache line, either, though. So you can't really "fix" that, not easily at least. So either your read or write bandwidth ends up tanking and you can only achieve roughly 6% of the max (only ever using 4 bytes of the 64-byte cache line) for that half of the problem. Without some clever magic to handle this your max theoretical on a 41.2GB/s CPU drops to around 50ms.

All that said it's clear that WASM is very far off from native levels of performance. ~5x slower isn't something to brag about. But hey maybe the test system was a potato, and the 500ms isn't as bad as it sounds.