SIMD accelerated sorting in Java – how it works and why it was 3x faster

geokon · on June 12, 2022

Great article. Very comprehensive and well written, while not skipping the gotchas the author hit along the way

I don't really much about how the JVM works, but it's interesting that this is (if I understood correctly) in effect calling out to some native code. In the JVM world, when are things added to the byte code as a new instruction, and when are they wrapped up in a native code call?

I'd expect a new vector instruction for the stack machine, but I'm guessing that's less flexible and less performant..?

"Practically speaking, you end up with code that looks like it creates many objects, but actually compiles to code that allocates no memory. It is extremely disconcerting."

It's interesting comparing people working in Java vs C++. Java makes it so impractical to really reason about the code directly that users ends up relying a lot more on tooling for benchmarking and introspection (and hence the tools are top-class)

"Java’s ability to print its compiler output as assembly "

How do you do this? Is this integrated into the IDE?

While the tools are great and you can dissect all your performance problems, ultimately feedback from the compiler should be feeding back into the editor. I feel that some of the issues the author hit shouldn't happen. You should for instance be immediately able to see if a function is being inlining while you're editing and not having to run extra tools

pron · on June 12, 2022

> In the JVM world, when are things added to the byte code as a new instruction, and when are they wrapped up in a native code call?

These aren't native code calls but compiler intrinsics. The Java compiler (the JIT compiler, that is) translates certain Java method invocations into specific machine instruction patterns.

> Java makes it so impractical to really reason about the code directly that users ends up relying a lot more on tooling for benchmarking and introspection (and hence the tools are top-class)

As others have mentioned, you can dump the generated machine code, but you are correct that Java is, by design, higher level than C++ in the sense that low level details are less under the programmer's control, but the compiler's. So when you see a `new Foo()` in Java, you cannot tell whether an object will actually be allocated on the heap, or perhaps "scalarized" and stored directly in registers. In fact, what happens is not even decided statically and could vary dynamically, depending on what other code does, as compilation is done JIT, and optimisations are decided based on the program's current execution profile. As a result, even though all instance methods in Java are semantically virtual, in practice they are very often inlined, and don't compile even to a call instruction, let alone a vtable call.

Unlike C++, Java's goal — generally speaking — is to get very good average-case performance for little effort rather than give the programmer a lot of knobs to control the worst case. Of course, sometimes there are knobs — just like the Vector API, which is useful in cases the JIT cannot automatically vectorise loops — but the balance is different from C++.

geokon · on June 12, 2022

Thanks for the explanation. So I'm gathering that the distinction is inlined native code will be explicit while an intrinsic is up to the compiler. The intrinsic provides a fallback if you run on a dumb/old JVM. That does seem much cleaner than adding new byte codes

pron · on June 12, 2022

Yes. Also, the compiler knows what the intrinsics do (or don't do), so they participate in various compiler analyses, unlike an opaque native calls, that the compiler is not usually able to move around.

Const-me · on June 12, 2022

> Vector API, which is useful in cases the JIT cannot automatically vectorise loops

For vertical-only operations (vectorizable loops), people often using GPUs instead of SIMD in CPUs. Graphics cards have way more resources, both GFlops and memory bandwidth.

I don't think sorting an array has anything to do with vectorizing a loop.

CHY872 · on June 12, 2022

It would not be hard to write a quicksort which could be automatically vectorized. For example, this is a perfectly valid partition implementation of the pivot operation.

    int partition(int[] array, int from, int to, int pivot) {
        int[] buf1 = new int[to - from];
        int[] buf2 = new int[to - from]; // buffers could be preallocated, as in the blogpost

        int idx1 = 0;
        int idx2 = 0;

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value < pivot) {
                buf1[idx1++] = value;
            }
        }

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value >= pivot) {
                buf2[idx2++] = value;
            }
        }

        System.arraycopy(buf1, 0, array, from, idx1);
        System.arraycopy(buf2, 0, array, from, idx2);
    }

It would be relatively straightforward for a compiler to automatically vectorize the two loops I've coded up. I don't know if there's one that would, but with the vpcompress instruction I think the absence of a compiler doing it automatically would simply be a function of it not being a common use case.

dzaima · on June 12, 2022

Seems that[0] clang doesn't succeed at vectorizing that, instead just doing a bunch of boring regular unrolling. I'd assume it wouldn't be too hard for clang to be able to understand that pattern, just noone has bothered to (and it won't pop up much in regular code because usually it'd be done with vector pushes and not a preallocated buffer).

[0]: https://godbolt.org/z/hzsdhGorr

Jap2-0 · on June 12, 2022

It looks like GCC and MSVC don't either, but ICC does.

dzaima · on June 12, 2022

good point on ICC! I've compared against it a couple times before, but never got any satisfying results so forgot about it.

Const-me · on June 12, 2022

I thing clang 13 has the best automatic vectorizer so far, yet it failed to do the job: https://godbolt.org/z/Y9796r1T1

This is despite unlike Java’s runtime, C++ compilers work offline, they don’t care too much about compilation speed, a JIT compiler simply doesn’t have time for expensive optimizations.

pron · on June 12, 2022

> a JIT compiler simply doesn’t have time for expensive optimizations

That is a common misconception, which might apply to some JIT compilers. For example, OpenJDK does not only need to compile only the hot methods, but even then it employs tiered compilation, where more powerful optimisations are made over time, and so the JIT can take as much time as needed (it runs concurrently with the already-compiled program) and has more profiling information than available to an AOT compiler. I.e. the same optimisation might be much quicker for a JIT than for an AOT compiler because the JIT doesn't need to do as deep an analysis of the program.

The advantage AOT compilers have over JITs is that they don't require warmup and they don't consume resources at runtime, but JITs have a greater potential for optimisation, and AOT compilers generally require more low-level information from the programmer to produce good code. There are various projects for caching JIT output and get something that's close to the best of both worlds.

Having said that, compilers aren't all developed on the same schedule, and different ones might prioritise different optimisations.

pjmlp · on June 12, 2022

Also another thing that critics forget is that all modern JVM implementations provide JIT caches with PGO feedback, so the quality of code can in theory be improved to some local maximum, after several runs.

.NET Framework 4.6 did provide some mechanism (MGO, Managed PGO), but did required calling into runtime APIs and was a bit clunky. Starting with .NET 6 they are using an approach more similar to the JVMs.

As far as I am aware, only Java (including the Android stepchild) and .NET ecosystems do have such JITs with PGO feedback between runs.

CHY872 · on June 12, 2022

Also, the JVM can frequently totally avoid compiling large swaths of the code (my methods ended up with about 10 uncommon traps in each method). https://shipilev.net/jvm/anatomy-quarks/29-uncommon-traps/

Someone · on June 12, 2022

I don’t see an easy way to vectorize those loops, given that the loop iterations have a dependency on idx1, respectively idx2.

Can you explain?

(Also, if you loop idx2 down from to, I think you need only one buffer and one System.arraycopy call, at the cost of no longer having a stable sort)

dzaima · on June 13, 2022

In AVX-512, there's an instruction vpcompressd, which literally does the equivalent of 16 iterations of that loop (the "store a subset of items to memory contiguously" part at least). The compiler would have to figure out how the conditional write & increment work together, but it's definitely possible as ICC manages to: https://godbolt.org/z/n68ca4aYe

For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.

pjmlp · on June 12, 2022

If you use Oracle's own IDE, it will support it out of the box, as it already did on Sun's days.

Then there are other ways depending on which JVM implementation is used.

On OpenJDK's case you can load runtime plugin to do it

https://github.com/AdoptOpenJDK/jitwatch

geokon · on June 12, 2022

Oh very cool. I'm really glad these things exist. I skimmed the intro video and it does looks like a separate tool though. Would be extra slick if these things got highlighted as you coded. But I get that usually you only really bring out the big guns when you're optimizing a tight loop somewhere - and then you probably want the whole toolbox and not just some annotations.

Thanks for letting me know about this :)

pjmlp · on June 12, 2022

The reasoning is that these are development tools and thus shouldn't be shipped by default.

I don't agree, but it is as it is, at least it does exist in some form.

Here .NET is better, you can just press F12 (Show Assembly) during debug session, and it will show it just like for C and C++ sessions.

Compiler explorer and sharplab make use of it.

Node allows to show some Assembly stuff, but if you want all the possible debugging options for Assembly code, you will need a debug build of nodejs.

dzaima · on June 12, 2022

You can't really have bytecode instructions for the vector API functions because they're generic. They're instead methods, which the JIT compiler can handle specially if it so chooses to. Effectively, for pure computational operations (i.e. not ones doing complex stack operations or jumps), there's no real difference between a method call and a new bytecode (except that adding a new bytecode is a lot more work because everything must be updated to support it, whereas a new method is just a new method).

chrisseaton · on June 12, 2022

> You should for instance be immediately able to see if a function is being inlining while you're editing

But it’s a dynamic decision, except in the most trivial cases. How can you know it without running the program?

grishka · on June 12, 2022

This whole "the JVM inlines things for you and avoids the allocation of objects altogether sometimes" goes so against my intuition after more than a decade of writing Java, mostly for Android. The idea that you should allocate as few short-lived objects as possible has stuck with me. So these new APIs do make me feel uncomfortable a bit, they make me think "oh man, this does look like it's gonna have some performance issues due to too much time spent in GC deleting all those objects".

I do understand that the regular (Hotspot?) JVM must be much smarter with optimizations than the Android runtime, but still.

pjmlp · on June 12, 2022

Android Java is not like proper Java, including how ART works. At least they dropped Dalvik which was comparable to early JVM implementations pre-JIT days, think Java 1.2.

Then there is the whole issue that Google isn't that up the game into making the same kind of optimizations as the Java community, and rather calls into C++ via native methods than optimizing ART more than good enough.

Finally the sore state of proper Java support comes from their beef with Sun and Oracle initially, and now their agenda to push Kotlin no matter what, and every modern Java feature that becomes one red less against Kotlin doesn't help.

Hence why most of their Java vs Kotlin samples are mostly based on Java 7.

If the Android team really cared, they would provide equal support for both languages and then let the community pick up which one they were found of.

Even the recently announced improved support for Android 13 is based on a Java 11 LTS subset, when the more recent LTS version is 17, and Java 19 is a couple of months away.

vips7L · on June 12, 2022

Fwiw, as a Java fanboy, I do think Kotlin/Compose is way better for creating GUI’s than what Java currently has to offer. I recently went and tried some swing again and it wasn’t enjoyable, but like you said it’s like this for a reason.

pjmlp · on June 12, 2022

Yeah, let Google pretend Groovy didn't do it first.

https://groovy-lang.org/swing.html

tpxl · on June 12, 2022

IIRC, the 'modern' way to create java GUIs is using JavaFX, which should have better development experience.

vips7L · on June 12, 2022

It’s a little better, but I still don’t think it’s on par with compose.

layer8 · on June 12, 2022

> The idea that you should allocate as few short-lived objects as possible has stuck with me.

I don’t know about Android, but since generational garbage collectors have become the default, the rule is that short-lived objects are dirt cheap, because they’ll be collected all at once in O(1) with the young generation. And since allocating them is usually just a pointer increment, they can be as cheap as stack objects.

snek_case · on June 12, 2022

> since allocating them is usually just a pointer bump, they can be as cheap as stack objects.

I don't think that's quite true. Even with a copying/moving GC, you still need to traverse all the live objects and then copy all of them. Asymptotically as cheap as stack objects maybe, but in reality the overhead is greater. GCs also have some amount of overhead due to all the synchronization they need to do.

layer8 · on June 12, 2022

> you still need to traverse all the live objects and then copy all of them.

The live objects traversed are generally not short-lived objects, because those aren’t live anymore! Meaning, short-lived objects typically aren’t traversed and therefore don’t contribute to the GC cost like longer-lived objects do.

dc-programmer · on June 12, 2022

I’m curious how the JVM efficiently allocates many small objects. How does it avoid memory fragmentation? Or does it request a large block of memory from the OS at the start? (last q is probably easily googled)

layer8 · on June 12, 2022

The JVM allocates small objects by incrementing a pointer in the “young generation” region. The GC later moves all objects that are still alive from that region to a different region. The “young” region can then be reused from scratch. The moving of objects effectively defragments (compacts) that region of the heap. Modern GCs use multiple per-thread and/or per-core regions, i.e. there are generally multiple “young” regions, not just a single one. Memory is allocated from the OS in large chunks.

There is intermediate fragmentation due to dead-but-not-yet-collected objects. Together with the use of different generational regions, GC languages thus require more memory (a rule of thumb is twice the memory of a non-GC program), but memory is cheap, and not having to reference-count and deallocate each object individually can conversely have performance benefits.

dc-programmer · on June 12, 2022

Ok thanks, that helped clear some things up. I didn’t realize generational GC reuses the same contiguous memory blocks for the young regions. Makes sense from a fragmentation and resident page locality perspective.

Do you know if the early generation region settings can be tweaked (alignment, size, number of threads/regions)? I’m wondering what happens if you “overflow” these areas by generating too many objects

layer8 · on June 12, 2022

There are many parameters you can tweak, for sure. Overflowing an area probably triggers an immediate GC run, and if that doesn’t free up enough space, an additional “young” region is used. You only need to specify a high enough maximum value for the total memory consumption of all generations.

One situation you can run into is if you generate and quickly drop objects faster than the GC can collect them, or rather, the process spends most of its time with GC rather than with actual program execution, then after a certain while an exception will be raised. That situation almost always indicates a bug in your program, and the exception helps to see the cause of the program stalling. Of course that behavior is also configurable.

dc-programmer · on June 12, 2022

Excellent. I appreciate the information. I’m just getting into the JVM dev ecosystem so I have a lot to learn

tpxl · on June 12, 2022

The latest GCs are _fast_. I benchmarked ZGC and IIRC it ate up something like 4GB/s of garbage (24 threads doing nothing but allocating) with about 2 milliseconds of stop-the-world pause time over around 3 minutes of runtime.

adgjlsfhk1 · on June 12, 2022

Note that benchmarking GCs properly is really hard. Changes in size distribution, tree shape, and lifespan can lead to drastically different results (and to make matters worse, the type of code that is easiest to write as a benchmark tends to be types of code that GC can handle really well).

dc-programmer · on June 12, 2022

Looks like it’s complicated. It preallocates memory on the heap, but on Linux this would be virtual memory whose pages have to loaded later. Also more memory will have be requested if the initial amount is insufficient. And this doesn’t include any stack or swap resources the JVM might use.

dale_glass · on June 12, 2022

I think this problem occurs at many places in the industry. Developer learns a thing, then keeps trying to apply it two decades later, long after it became completely obsolete.

That's how you get things like people unrolling a loop by hand. Compilers have known how to do it for ages, and have built-in data about what works best on what CPU. 99% of the time they do a better job than a programmer that developed a pattern back when they were writing for a 386.

These days it often works better to write very straightforward, non-clever code and let the compiler figure out how to optimize it.

Const-me · on June 12, 2022

Here’s an example of a simple C++ function, manually vectorized + unrolled for optimal performance on modern processors: https://stackoverflow.com/a/59495197/126995

Of all C++ compilers, only clang generates similar machine code from a straightforward source. And only with some special compiler switches, most importantly the -fast-math. Dot product of 2 vectors is a trivially simple algorithm, sum( a[ i ] * b[ i ], i = 0..N-1 ), I wouldn’t expect clang to auto-vectorize more complicated stuff. Finally, C++ compilers are designed to work offline so the optimizer can afford to spend substantial CPU time searching for a best optimization, a JIT compiler like Java runtime simply doesn’t have time for that.

dale_glass · on June 12, 2022

That's not quite what I'm saying though. There's of course going to be exceptions, especially in highly specialized cases.

> Finally, C++ compilers are designed to work offline so the optimizer can afford to spend substantial CPU time searching for a best optimization, a JIT compiler like Java runtime simply doesn’t have time for that.

Not necessarily. A JIT can also see exactly what the program is doing, see that 1000 out of a million lines of code are performance critical, and throw all the effort on optimizing that, while armed with stats about how the program works in practice and which branches are taken how often.

You also don't need to wait for it, since there's no reason why that work can't be done on a separate thread without blocking execution.

It can also generate native code for your specific CPU, so it may well do much better than GCC there.

Const-me · on June 12, 2022

I agree about the scalar code, I know from experience JIT compilers can be awesome, sometimes they indeed outperform AOT.

Automatic vectorization, on the other hand…

Modern SIMD arrived in 1999, SSE1 in Pentium III. For the 20+ years which followed, very smart compiler developers tried to improve their automatic vectorizers. Yet they only achieved very limited success so far.

They do a good job when all of the following is true: (1) pure vertical operations (2) no branches or conditions (3) the numbers being handled are either FP32 or FP64.

I think building a sufficiently good automatic vectorizer is borderline impossible task. Even when the runtime is very sophisticated like modern Java, with several of these progressively better versions based on the real-time performance profiler data, the problem is still extremely hard to solve.

For instance, here’s a fast way to sort 4 floats with SSE https://godbolt.org/z/c97Yf5js8 I don’t believe a compiler could have possibly figured out these shuffles and blends from any reasonable scalar implementation.

kaba0 · on June 12, 2022

Android’s Java is pretty much a bastard child of OpenJDK - it is so behind on features that it is truly a shame.

my123 · on June 12, 2022

Android 13 started to give some OpenJDK 11 features... but ugh - nothing that we can widely rely on yet.

Core Java 11 features will be backported to Android 12 with ART as an APEX module, but that didn't happen quite yet as far as I can see.

izacus · on June 12, 2022

Which is why the language of choice for Android is Kotlin which has all modern features without the necessity of changing the runtime.

pjmlp · on June 13, 2022

It might look like it, yet it is deceptive.

Unless Kotlin decides to stop following JVM capabilities and interoperability with modern Java features, those capabilities need to exist in Android as well.

Otherwise, the JVM Kotlin libraries won't have a place on Android, unless they get coded twice, with Kotlin's version of #ifdef.

Google's surprise decision to update Java support to core Java 11 LTS, was most likely triggered by the Java ecosystem now basing on Java 11 LTS as the minimum version, and they want to keep those libraries somehow available on Android.

grishka · on June 13, 2022

Actually, "desugaring" is a thing. It's a backport of those new parts of the standard library that the dex converter inserts into your app. Currently, it's only Java 8, but I had no trouble building an app that uses Java 17. You just can't use any features that rely on new classes or methods, like records. But purely syntactical ones do work flawlessly.

I don't like Kotlin because of how much it tries to do in the language itself with maximally terse syntax, and because of its asinine defaults like final-by-default classes and immutable-by-default everything. Java's sugar is much better thought out.

ErikCorry · on June 12, 2022

The section on "dictionary encoding" didn't make much sense to me. Is there an explanation of this technique somewhere? The example code took a list called "buffer" and then didn't use it for anything - perhaps something got lost when writing the code?

The explanation was pretty opaque: Here’s one simple example I encountered recently. It’s common in columnar database code to ‘dictionary encode’ lists of values. Here, for some subset of the stored data, we store a dictionary containing the unique values that exist, assigning unique indices to each value, and we store a list of pointers into the list to represent each element.

We store a list of pointers into which list? The dictionary contains the unique values as keys or as values? How do we pick which subset of the stored data we do this for?

kwillets · on June 13, 2022

The buffer in this case is the input to a max() function; his point is simply that sorting the dictionary indices before looking them up is a good idea for common data structures like arrays, disk blocks, or anything with caching or readahead.

kaba0 · on June 12, 2022

Would some of the mentioned drawbacks be solved by having Project Panama's other half be available? What I mean is allocating memory manually, instead of using Java primitive arrays inside the Vector API. Would the (de)serialization overhead be too much for it to be worth it?

kwizzt · on June 12, 2022

I'm really looking forward to when SIMD support is released in later versions of Java. If anyone here follows handmade hero (by Casey), he explained SIMD in some of his videos when implementing the software renderer. The speedup from regular code to SIMD is very impressive. You essentially get speedup of 2x-3x or even more for free.

The downside is it makes the code much harder to read, so I think it should be applied only in very small sections of the code.

hereforphone · on June 12, 2022

[flagged]

kwizzt · on June 12, 2022

What does Java have to do with Indian university? Java/JVM is widely popular in the enterprise world for backend services, and a ton of open source software, e.g. Cassandra, Hadoop. You need to learn more stuff for sure.

ecshafer · on June 12, 2022

Java is a very performant language. You get an awesome Jit, static type checking, some best in class stdlib libraries. Performance issues in Java typically say more about the programmer than the language.