My guess is that the JS implementation of the worst-performing browser is having...

acqq · on Feb 14, 2019

Your 0.18 sec result is (to use the units they used in the article) 180ms, and if I understand correctly their best webassembly compiled and executed result (?) is 300ms. Beautiful.

EDIT: But it could also be that your computer is somewhat faster than theirs? Do you happen to have some very fast CPU? Can you say which? When I run C-like C++ versions of your code I get the speeds you get with node.js. However, you made overall much better results than they were able, it's still great work!

    #include <stdio.h>
    int main(int argc, char* argv[]) {
        enum { height = 4096, width = 4096 };
        unsigned* a = new unsigned[ height*width ];
        unsigned* b = new unsigned[ height*width ];
        if ( argc < 2 ) { // call with no params
            // to measure overhead when just allocations
            // and no calculations are done
            printf( "%d %d\n", (int)a, (int)b );
            return 1;
        }
        if ( argv[1][0] == '1' ) // call with 1 the fastest
        for (unsigned y0 = 0; y0 < height; y0 += 64)
            for (unsigned x0 = 0; x0 < width; x0 += 64)
                for (unsigned y = y0; y < y0 + 64; y++)
                    for (unsigned x = x0; x < x0 + 64; x++)
                        b[x + y*width] = a[y + (width - 1 - x)*height];
        else
        for (unsigned y = 0; y < height; y++)
            for (unsigned x = 0; x < width; x++)
                b[x + y*width] = a[y + (width - 1 - x)*height];

        return 0;
    }

vijaybritto · on Feb 15, 2019

I think its fast because of the L1 cache or something like that. I dont understand fully but this is what i got

acqq · on Feb 15, 2019

The fastest version is the fastest because it's the most cache-friendly one of all which were presented. See e.g.

https://stackoverflow.com/questions/5200338/a-cache-efficien...

But note that robko made an improvement even before making that.

acqq · on Feb 16, 2019

> made an improvement even before

Or maybe not: my short experiments with the simplified version based on their algorithm and his JavaScript versions gave some conflicting results. I haven't thoroughly verified them, this note is just to motivate the others to try.

robko · on Feb 15, 2019

I get 60ms in C. But in your code, the compiler might decide to remove most of the code since b is not used after being calculated. I checked the assembly code and it does not seem to be the case here, but it's still something to be aware of.

acqq · on Feb 15, 2019

> I get 60ms in C

OK, I get cca 80ms for my run with the parameter 1 on my main computer, and 200ms on N3150 Celeron.

> b is not used after being calculated

Earlier, I've never seen that any C compiler optimizes away the call to the allocator and the access to the so allocated arrays. Maybe it's different now? Hm, dead code elimination... I guess a random init of the few values before and read and print of a few values after the loop must be always safe... Now that I think, also filling the array with zeroes before.

seanwilson · on Feb 14, 2019

Maybe this is what you meant but the snippet can be optimised a ton as well unless I'm missing something:

- Move the "y * width" calculation outside of the "for x" loop.

- The multiply operators can be replaced with addition e.g. replace "y * width" with "counter += width" each y iteration and similarly for the x loop.

Optimising inner loops is really fun.

How much of the speed up in the article is because the JS engine can't figure out how to optimise it compared to the WebAssembly compiler?

tom_mellior · on Feb 15, 2019

These code motion/strength reduction optimizations are standard even in mildly optimizing compilers. I would be very surprised if an optimizing JavaScript compiler did not perform them automatically.

robko · on Feb 15, 2019

I tried a few micro-optimizations, but they did not make a measurable difference, so I kept the code short instead. But maybe some JIT is particularly bad at loop hoisting, so it might make a difference there.

dassurma · on Feb 15, 2019

Huh interesting! I always disliked butchering code to do processor cache optimizations and I kinda worked under the impression that a browser’s JS and wasm compilers would do these optimizations for me.

I’ll definitely give tiling a spin (although at this point we are definitely fast enough™️)

yalok · on Feb 15, 2019

Can someone please explain why loop tiling increases performance in JS so dramatically? Is it mainly due to the fact that inner loops have constant size (64) and get called more frequently, and thus get promoted faster into deeper stages of JS runtime optimization?

My guess is that if you try to invoke initial whole code (before tiling) in a external loop (rotating images of exactly the same size), you will get similar perf boost (not that it has practical implication, but just to understand how optimization works).

vardump · on Feb 15, 2019

No, it's faster because the working set of 64 * 64 * 4 * 2 bytes can (almost) fit in CPU core L1 cache. Further cache levels are slower and finally the memory is glacially slow.

WASM example would speed up as well using the same approach. Or C, Rust or whatever.

fulafel · on Feb 15, 2019

To add background, this is a standard optimization technique that has been employed in eg fortran compilers since at least the 1980s.

Asooka · on Feb 15, 2019

Doesn't this rely on the CPU prefetching the memory to cache? Do current CPUs from Intel&AMD detect access patterns like this successfully? I.e. where you're accessing 64-element slices from a bigger array with a specific stride.

fulafel · on Feb 15, 2019

The idea is that the Y dimension is going to have a limited nr (here 64) of hot cache lines while a tile is processed. After going through one set of 64 vertical lines, the Y accesses are going to be near the Y accesses from the previous outer-tile-loop iteration.

(Stride detecting prefetch can help, especially on the first iteration of a tile, but is not required for a speedup).

BTW this is the motivation for GPUs (and sometimes other graphics applications) using "swizzled" texture/image formats, where pixels are organised into various kinds of screen-locality preserving clumps. https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-...

bufferoverflow · on Feb 16, 2019

I tested these two pieces of code in different browsers on i7-8750H with 16GB of RAM.

Chrome: 248 ms vs 93 ms

Firefox: 552 ms vs 93 ms

MS Edge: 7486 ms vs 6186 ms

IE: 9590 ms vs 9156 ms

These are some WTF results, to be honest.

maxgraey · on Feb 14, 2019

https://www.reddit.com/r/programming/comments/aqpjkx/replaci...

seanwilson · on Feb 15, 2019

> As I understand they the main goal was to achieve easily readable and maintainable code, even to the detriment of performance.

Seems like a tricky goal for image algorithms in general where you're performing the same action over and over on millions of pixels. Obscure inner loop optimisations are pretty much required.

In these situations, I would sometimes keep the code for the naive but slow version around next to the highly optimised but difficult to understand version. You can compare the output of them to find bugs as well.

RivieraKid · on Feb 15, 2019

> My guess is that the JS implementation of the worst-performing browser is having trouble with the non-1 for-loop steps.

Why would non-1 for loop be slower in some browsers? Does the compiler add some sort of prefetch instruction in the faster browsers based on the loop increment?