> Vector API, which is useful in cases the JIT cannot automatically vectorise loops
For vertical-only operations (vectorizable loops), people often using GPUs instead of SIMD in CPUs. Graphics cards have way more resources, both GFlops and memory bandwidth.
I don't think sorting an array has anything to do with vectorizing a loop.
It would not be hard to write a quicksort which could be automatically vectorized. For example, this is a perfectly valid partition implementation of the pivot operation.
int partition(int[] array, int from, int to, int pivot) {
int[] buf1 = new int[to - from];
int[] buf2 = new int[to - from]; // buffers could be preallocated, as in the blogpost
int idx1 = 0;
int idx2 = 0;
for (int i = from; i < to; i++) {
int value = array[i];
if (value < pivot) {
buf1[idx1++] = value;
}
}
for (int i = from; i < to; i++) {
int value = array[i];
if (value >= pivot) {
buf2[idx2++] = value;
}
}
System.arraycopy(buf1, 0, array, from, idx1);
System.arraycopy(buf2, 0, array, from, idx2);
}
It would be relatively straightforward for a compiler to automatically vectorize the two loops I've coded up. I don't know if there's one that would, but with the vpcompress instruction I think the absence of a compiler doing it automatically would simply be a function of it not being a common use case.
Seems that[0] clang doesn't succeed at vectorizing that, instead just doing a bunch of boring regular unrolling. I'd assume it wouldn't be too hard for clang to be able to understand that pattern, just noone has bothered to (and it won't pop up much in regular code because usually it'd be done with vector pushes and not a preallocated buffer).
This is despite unlike Java’s runtime, C++ compilers work offline, they don’t care too much about compilation speed, a JIT compiler simply doesn’t have time for expensive optimizations.
> a JIT compiler simply doesn’t have time for expensive optimizations
That is a common misconception, which might apply to some JIT compilers. For example, OpenJDK does not only need to compile only the hot methods, but even then it employs tiered compilation, where more powerful optimisations are made over time, and so the JIT can take as much time as needed (it runs concurrently with the already-compiled program) and has more profiling information than available to an AOT compiler. I.e. the same optimisation might be much quicker for a JIT than for an AOT compiler because the JIT doesn't need to do as deep an analysis of the program.
The advantage AOT compilers have over JITs is that they don't require warmup and they don't consume resources at runtime, but JITs have a greater potential for optimisation, and AOT compilers generally require more low-level information from the programmer to produce good code. There are various projects for caching JIT output and get something that's close to the best of both worlds.
Having said that, compilers aren't all developed on the same schedule, and different ones might prioritise different optimisations.
Also another thing that critics forget is that all modern JVM implementations provide JIT caches with PGO feedback, so the quality of code can in theory be improved to some local maximum, after several runs.
.NET Framework 4.6 did provide some mechanism (MGO, Managed PGO), but did required calling into runtime APIs and was a bit clunky. Starting with .NET 6 they are using an approach more similar to the JVMs.
As far as I am aware, only Java (including the Android stepchild) and .NET ecosystems do have such JITs with PGO feedback between runs.
In AVX-512, there's an instruction vpcompressd, which literally does the equivalent of 16 iterations of that loop (the "store a subset of items to memory contiguously" part at least). The compiler would have to figure out how the conditional write & increment work together, but it's definitely possible as ICC manages to: https://godbolt.org/z/n68ca4aYe
For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.
For vertical-only operations (vectorizable loops), people often using GPUs instead of SIMD in CPUs. Graphics cards have way more resources, both GFlops and memory bandwidth.
I don't think sorting an array has anything to do with vectorizing a loop.