> Vector API, which is useful in cases the JIT cannot automatically vectorise lo...

CHY872 · on June 12, 2022

It would not be hard to write a quicksort which could be automatically vectorized. For example, this is a perfectly valid partition implementation of the pivot operation.

    int partition(int[] array, int from, int to, int pivot) {
        int[] buf1 = new int[to - from];
        int[] buf2 = new int[to - from]; // buffers could be preallocated, as in the blogpost

        int idx1 = 0;
        int idx2 = 0;

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value < pivot) {
                buf1[idx1++] = value;
            }
        }

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value >= pivot) {
                buf2[idx2++] = value;
            }
        }

        System.arraycopy(buf1, 0, array, from, idx1);
        System.arraycopy(buf2, 0, array, from, idx2);
    }

It would be relatively straightforward for a compiler to automatically vectorize the two loops I've coded up. I don't know if there's one that would, but with the vpcompress instruction I think the absence of a compiler doing it automatically would simply be a function of it not being a common use case.

dzaima · on June 12, 2022

Seems that[0] clang doesn't succeed at vectorizing that, instead just doing a bunch of boring regular unrolling. I'd assume it wouldn't be too hard for clang to be able to understand that pattern, just noone has bothered to (and it won't pop up much in regular code because usually it'd be done with vector pushes and not a preallocated buffer).

[0]: https://godbolt.org/z/hzsdhGorr

Jap2-0 · on June 12, 2022

It looks like GCC and MSVC don't either, but ICC does.

dzaima · on June 12, 2022

good point on ICC! I've compared against it a couple times before, but never got any satisfying results so forgot about it.

Const-me · on June 12, 2022

I thing clang 13 has the best automatic vectorizer so far, yet it failed to do the job: https://godbolt.org/z/Y9796r1T1

This is despite unlike Java’s runtime, C++ compilers work offline, they don’t care too much about compilation speed, a JIT compiler simply doesn’t have time for expensive optimizations.

pron · on June 12, 2022

> a JIT compiler simply doesn’t have time for expensive optimizations

That is a common misconception, which might apply to some JIT compilers. For example, OpenJDK does not only need to compile only the hot methods, but even then it employs tiered compilation, where more powerful optimisations are made over time, and so the JIT can take as much time as needed (it runs concurrently with the already-compiled program) and has more profiling information than available to an AOT compiler. I.e. the same optimisation might be much quicker for a JIT than for an AOT compiler because the JIT doesn't need to do as deep an analysis of the program.

The advantage AOT compilers have over JITs is that they don't require warmup and they don't consume resources at runtime, but JITs have a greater potential for optimisation, and AOT compilers generally require more low-level information from the programmer to produce good code. There are various projects for caching JIT output and get something that's close to the best of both worlds.

Having said that, compilers aren't all developed on the same schedule, and different ones might prioritise different optimisations.

pjmlp · on June 12, 2022

Also another thing that critics forget is that all modern JVM implementations provide JIT caches with PGO feedback, so the quality of code can in theory be improved to some local maximum, after several runs.

.NET Framework 4.6 did provide some mechanism (MGO, Managed PGO), but did required calling into runtime APIs and was a bit clunky. Starting with .NET 6 they are using an approach more similar to the JVMs.

As far as I am aware, only Java (including the Android stepchild) and .NET ecosystems do have such JITs with PGO feedback between runs.

CHY872 · on June 12, 2022

Also, the JVM can frequently totally avoid compiling large swaths of the code (my methods ended up with about 10 uncommon traps in each method). https://shipilev.net/jvm/anatomy-quarks/29-uncommon-traps/

Someone · on June 12, 2022

I don’t see an easy way to vectorize those loops, given that the loop iterations have a dependency on idx1, respectively idx2.

Can you explain?

(Also, if you loop idx2 down from to, I think you need only one buffer and one System.arraycopy call, at the cost of no longer having a stable sort)

dzaima · on June 13, 2022

In AVX-512, there's an instruction vpcompressd, which literally does the equivalent of 16 iterations of that loop (the "store a subset of items to memory contiguously" part at least). The compiler would have to figure out how the conditional write & increment work together, but it's definitely possible as ICC manages to: https://godbolt.org/z/n68ca4aYe

For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.