Wow all these numbers seem insanely bad. 500 milliseconds to transpose 16 millio...

kllrnohj · on Feb 14, 2019

They are bad but not way off for that basic for loop, depending on which rotation is being applied.

Using their code on my Intel-based workstation at around 3ghz using GCC 7.3 it takes around 80-100ms to rotate a 4096x4096 buffer 90 or 270, and 14ms to rotate 180.

Max memory bandwidth of something like an i9-9900k is 41.2GB/s. This test reads & writes 128mib of data. So max theoretical achievable performance here is around 3-4ms. Max theoretical. So 100x is not really feasible. 10x, though, very much is, as the quick convert shows a peak time of 14ms with a 180* rotation.

Of course the major source of slowness here is that the reads/writes are not sequential, and the 90 & 270 rotations are achieving a fraction of the possible bandwidth they could as the input reads are jumping around, so every single one is a cache miss and the other 60 bytes in each cache line on the miss will be purged before it's used again.

Flipping it would mean the writes are never utilizing a full cache line, either, though. So you can't really "fix" that, not easily at least. So either your read or write bandwidth ends up tanking and you can only achieve roughly 6% of the max (only ever using 4 bytes of the 64-byte cache line) for that half of the problem. Without some clever magic to handle this your max theoretical on a 41.2GB/s CPU drops to around 50ms.

All that said it's clear that WASM is very far off from native levels of performance. ~5x slower isn't something to brag about. But hey maybe the test system was a potato, and the 500ms isn't as bad as it sounds.

robko · on Feb 14, 2019

You are correct. The code is using an inefficient cache access pattern, so most of the time is spent waiting.

You probably won't get 100x faster without SIMD, but 10x is certainly doable. Unfortunately, SIMD.js support has been removed from Chrome and Firefox a while ago, even though it is not available in wasm to this day.

kllrnohj · on Feb 14, 2019

How would SIMD do anything to address the problem's fundamental anti-cache-friendly access patterns? You'd need to restructure the problem to be cache-friendly, but SIMD won't really be relevant to that.

robko · on Feb 15, 2019

You can use both at once. Usually, you'd have something like 64x64 tiles in cache and use 4x4 or 8x8 tiles for SIMD.

bufferoverflow · on Feb 15, 2019

Or better yet, WebGL should be able to do this in a few ms on a GPU.

olliej · on Feb 15, 2019

Or simply use the canvas api, which has super optimized graphics libraries behind it - rather than reimplementing the wheel :)

But I get that really this was a how much can wasm help performance as % vs js - you could always write an “optimized” routine and compare those and theoretically achieve something similar.

bzbarsky · on Feb 15, 2019

The article mentions why they couldn't use canvas for this: they are running this code in a worker, and canvas support in workers is not great in browsers so far.

jaffathecake · on Feb 15, 2019

Not only that, there's a nasty bug in Chrome that makes it unusable for our use-case https://bugs.chromium.org/p/chromium/issues/detail?id=906619

olliej · on Feb 15, 2019

Ah my bad for skimming - I though most canvas stuff worked these days? (I recall many years ago when I worked on such things that fonts were the biggest problem, but also people generally wanting to be able to paint dom elements in their as well)

Technetium_Hat · on Feb 16, 2019

It is OffScreenCanvas, the variant that works in web workers, that has poor browser support.

robko · on Feb 15, 2019

In my experience, the canvas api is very slow and not well thought-out. For example, to create a native image object from raw pixels, you have to copy the pixels into an ImageData object, draw it to a canvas, create a data URL from the canvas and then load an image from that data URL.