If you want to deep dive there Travis Downs has written an amazing thing about it https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre... and you should follow the links down to his github avx-turbo project that actually runs all sorts of 'frequency reducing' benchmarks and it's just clearer like this.
For example I noticed that on some Intel Gold 6230 processors (2000+ USD each at the time), not only avx512 did bring down frequency, but also 'dense avx2' (sequences of heavy FMA). Explained a lot about measured latencies for some real-time processes.
Can't thank Travis enough for cutting through all the vagueness there.
I believe the guy occasionally posts here, or has done, under the name BeeOnRope, also on StackOverflow. I printed out some of his posts here about memory bandwidth. The guy really, really knows his stuff.
I'm still salty that we needed him to put down time and energy in this, when Intel should have done the job... This license stuff was a major engineering tradeoff, a pain in the ass for many people, that should have been explained as clearly as Travis did, from the get go.
That was only on Skylake-X, which I admit was like 4 generations of Intel boxes.
But its no longer true today. AMD's Zen4 doesn't downclock at all with AVX512 instructions, and modern Intel chips have only like 100Mhz (aka 0.1Ghz) of downclock measured in practice.
In any case, AVX512 is certainly a win from a power-performance perspective. You get more compute-work done with far less decoding/other CPU-related internal core stuff. Even on Skylake-X, I'd expect that AVX512 sorting routines will heat up the CPU less than a non-AVX512 sorting routine... and otherwise be superior from a head and/or power limited perspective.
Sounds like you’d still want a non-AVX512 path for sufficiently small arrays, though. As an extreme case, sorting a 4-element array presumably doesn’t get enough speedup from AVX512 to be worth downclocking by 100Mhz. I wonder where the cutoff is.
The 100MHz downclocking is only seen in Icelake, and not in Rocketlake (no perceptive downclocking behavior from any AVX512).
So its a lot of effort to think about a problem that's only identified on a relatively small number of server processors, and (probably) won't be a problem moving forward.
I don't know the details, but for example, "x265 disables AVX-512 support by default due to clock speed regressions on Intel Skylake-X systems"[¹], which means that yes, this is an existing problem.
That's surprising, I would expect that codecs are doing enough work that the faster processing from AVX-512 more than pays for the downclocking.
In JPEG XL we saw 1.4x end to end speedup (including this downclocking) and that was for images, which are often less data than videos.
A policy of AVX-512-off-by-default seems pessimistic now that Intel Icelake+ have basically no throttling, and AMD Zen4 definitely none.
The thing is now your codec (or libc as it was once!) is having a complex side effect. The clamping and ramping back up of the frequency is not immediate and it has an impact of overall latency on the core it's running. I even saw once funny traces on a heavily loaded machine where the avx512 code was rescheduled on other cores almost all the time, because overall the core running the avx512 code seemed to 'be more loaded' and the scheduler would find a new home for the avx512-crunching process. Kind of a 'fun' runaway process.
Fair point :) I wonder why the kernel is so aggressive about migrating threads, it also loses the cache contents. FWIW our command line tools at one point pinned themselves to the current core, which did help performance.
OS kernels should be relatively sticky. Possibly it was power/temperature management where the load was moved to an unloaded core to even out power dissipation?
In most cases even beside this issue, microbenchmarks are in general very difficult to interpret for a slew of reasons. I get the impetus to try and isolate a single variable, but in practice you're often isolating away real world applicability of the result as well.
This is especially true for CPU and I/O benchmarks that typically have an onion's worth of caching layers in hardware and OS.
What you say may be true, but both Oracle and Google have significant investments in parallel microbenchmark tools: Java Microbenchmark Harness and Caliper. Aleksey Shipilëv (@shipilev) famously worked on the Oracle one for a while. He used it extensively in his analysis of core Java code changes. And, I recall that the Google Guava team used Caliper similarly.
Mr Shipilev is pretty well aware of this. He used to give a talk "JMH: The Lesser of Two Evils" that goes into some detail about a lot of the problems and pitfalls with microbenchmarks.
The argument isn't that microbenchmarks are never useful, but rather that they are usually not very applicable and extremely hard to get right, and even harder to interpret.
If you're doing JVM optimizations, it may be a useful tool, but like in practice, it's questionable whether these 12x optimizations will result in even a 0.1% improvement in performance in the median Java application.
It's pretty simple when you think about it. If you're optimizing effects that are hard to isolate and in many ways barely measurable in the noise of process memory alignment and similar factors, then in most cases, whatever optimization you make is likewise going to be barely measurable in the noisy hardware/software landscape.
Most zen4 consumer reviews didn't even have any Java benchmarks at all much less any that would have significantly changed the narrative around those CPUs.
Epyc vs. Xeon is what this is about as the datacenter is the last holdout for OpenJDK at this point.
I didn't think Caliper was still maintained - looks like there's been some recent commits, but their last release is 2 years ago and says "Caliper is currently in maintenance mode and we don't expect to make changes to it other than bug fixes."
To me, this why AVX didn't get widespread use. The code is practically hieroglyphics.
If you want humans to use a performance feature, it needs to be easy to use. Preferably fully integrated into the compiler so that you don't need to be aware of it at all.
The level of abstraction is wrong - When they invent AVX1024, recompiling this code won't make use of it. Yet the code itself is probably harder to understand and reason about than just a few lines of assembly code.
AVX has seen very, very widespread use across mobile, desktop, gaming, and server platforms. It's over 10 years old. Probably every hand-optimized vectorized x86 routine in the past 10 years has seen AVX thrown at it. Not every programmer can write it, but it sees absolutely tons of widespread usage and can be written by many competent programmers.
> If you want humans to use a performance feature, it needs to be easy to use.
In practice for general purpose "messy" compute kernels (e.g. parse this JSON/CBOR bullshit with SIMD) there is still a very wide gap between hand-written intrinsic code and what the compiler can generate. Most compilers in popular languages don't have the leeway to automatically perform the necessary optimizations/"setup" that a human must perform to fully exploit these tools, so this is not an easy fight to win.
For limited domains, and with the correct semantic design, however, you can either achieve or exceed human performance with high level semantics e.g. Halide.
> Yet the code itself is probably harder to understand and reason about than just a few lines of assembly code.
It's really not. Hacker news sucks shit to read code on, but the above code is in no way harder or worse than a raw assembly routine using AVX instructions. If anything it's less error prone, because the compiler takes care of a ton of incidental but necessary drudgery too, such as handling PIC (something 99% of people forget immediately) and eliminating the need for an outlined function. Not to mention you can leave the compiler to do all the annoying shit like sinking or coalescing stores/loads along the codepath, accurate DWARF/debug information without things breaking, etc.
Sure but OP's point is that that tiny fraction of code is (probably, I'm just guessing) responsible for a disproportionately large fraction of time spent actually running work. AVX is AFAIK mostly used in hand rolled, low level library code that then is used by a whole lot of consumers.
I know I've seen pretty huge speedups in my own code for "free" just from switching to an AVX version of BLAS. You can just think about how many different programs use BLAS (which is itself highly arcane internally), and AVX is definitely in a ton of other low level libraries out there.
Hieroglyphs or not, with some human cleverness you can achieve pretty amazing performance.
It's not that bad if you just study it a bit and put your mind into it. Just a sequence of relatively simple operations (+ masking when you need a partial operation), data shuffling, loads and stores.
Unfortunately to extract all of that performance, you do need to know your target architecture, be it SSE, AVX2/512, NEON, SVE or whatever.
You might also need to know some architectural details, like register renaming, out of order execution, branch prediction, load/store buffers, automatic prefetching, cache architecture, sometimes even details like CPU internal ring bus. That's just how it is when you want to extract the last drop of performance.
Sometimes compilers can be pretty great, just not always. Unfortunately sometimes beautifully autovectorized code just stops being so due to some small change in code somewhere.
Luckily most of us don't need to care. Most people can just use optimized libraries instead.
> Hieroglyphs or not, with some human cleverness you can achieve pretty amazing performance.
I'll never forget hand-optimizing some DNA analysis code into assembly and achieving the theoretical performance bottleneck that is the memory throughput between the CPU and main RAM on a server with 800GB of RAM.
Taking something that took minutes and turning it into something that takes seconds is a flipping amazing feeling.
Could it get even faster? Probably, but that would require very significant data structure changes with an algorithmic change to match. Would it be worth it? Well, probably when the database reaches multiple TB in size. But I've since moved on from that company so it's someone else's problem.
It's often about memory bandwidth nowadays. That and locality of reference, random access is expensive. You can compute a ton for the cost of one random memory access.
Tangential Question (purely curious): How to you validate your code produces the same (presumedly correct) results as the non-optimized (presumedly HLL) code for something as complex and variable as DNA analysis? I run into an analogous concern when people talk about "we need to rewrite this exhaustively tested with decades of track record Fortran 77 code in (whatever the new hotness in PLT is)".
AVX512 got far far FAR easier to read and write than previous iterations. Lots of 'missing' operations were added and the set seems far more thought for autovectorizer compiler passes.
Still, yes, it's difficult to grok because it's a low-level vector assembler, and as any low-level assembler, can be hard to follow at first.
AVX512 also has amazing features like mask registers, all the gfni, bit extract/compress and vpternlog, vpopcount, etc. features that once you've grokked you start seeing vector code as even more magic.
My main gripe with AVX512 is that it's still hard to get amazing performance because of memory bandwidth, unoptimized streaming issues, cache locality issues. You can write amazing compact avx512 that will still have poor performance because it's all about feeding the FMA units.
But, if you want to see a simpler high level language that will generate fast code for AVX512 targets, checkout ISPC, I've had very nice successes in the past, writing idiomatic code and getting better performance than my shitty experiments with intrinsics. It's not for every kind of code, but when it fits, it's just nice. And ISPC generates C-callable objects/libraries so it's relatively nice to integrate in a build pipeline.
The way I see it, you could argue for say, wing-nuts to be used everywhere because they can be used tool-free, but then you can't tighten them as much, can't put them is confined spaces etc. AVX is meant to be used with a specialist tool interface that has knowledge of lower level details and is made by someone else who adds a simple handle for you, Array.sort being this example.
You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.
I do agree that it is the wrong level of abstraction. Explicitly stating the SIMD width leads to a compatibility nightmare. RISC-V vector instructions instead use an explicit "vector width" register, which pretty much entirely solves this problem.
> You're not really supposed to write AVX yourself - the compiler should be doing that for you. And it will, if you write your code in a SIMD-compatible way and turn on the right compiler flags.
Take it from experience: sure you can write high-level code that is SIMD-compatible. But the compiler is garbage at understanding the semantics and will write terrible SIMD code.
The best thing a current compiler can provide is probably replacing intrinsics with more conventional-looking things like GCC's Vector extensions[1] and C++'s simd<T>. Even then you'd need to do a little bit of union work for the cooler operations.
Compilers will always be terrible at vectorizing code because the required transformation is architectural. It would require the compiler to understand your code well enough to replace the scalar algorithms and data structures with new ones that are semantically equivalent in all contexts with all necessary invariants preserved (e.g. how memory is organized). The code transformation would be very non-local.
Compilers can't generally rewrite your scalar code as vector code for the same reason they can't rewrite your b-tree as a skip list.
hm, that seems optimistic for this use-case.
I heard from a compiler engineer that autovectorizing a sort (which is full of permutes/shuffles) is much harder and is likely to remain infeasible for quite some time.
GPUs have a crossbar that allows for high speed lane-to-lane permutes and bpermutes, but it's still slow compared to butterfly shuffles.
I do believe that compilers can optimize any movement pattern into the right butterfly shuffles (not today in the general case. Modern compilers in CUDA are impressive but this is a hard problem) but I'm convinced that the programmer needs to be aware of the low level difficult nature of many-to-many data movements on a 16-wide AVX512 register, or a 32-wide GPU block / warp / wavefront.
--------
EDIT: I'm like 90% sure some dude at Bell Labs from 1950s working on CLOS network or Benes network design probably has an efficient representation for many-to-many data shuffles on a parallel architecture. But I'm not PH.d enough to have read all those papers or keep up with those old designs.
Many-to-many data movements is traditionally a networking and routing problem. But SIMD-programmers and SIMD-chip designers are starting to run up against this problem... because a ton of parallel programming is about efficient movements of data between conceptual lanes and/or threads.
How would you write a fixed 16 element bitonic sort? In Python or whatever?
I dunno, the recursive template seems brilliant to me. Bitonic sort is innately a recursive process, but here we get compile time recursion and optimal assembly code at the end...
Writing a virtual machine or compiler is always hard! If you write code for the JVM, going forward you will now use the instructions while writing nice clean Java…
Looking at the implementation, it is implemented in C++. I'd rather have an implementation using the new Vector API (still in incubation). It would be more Java like and provide a good demo for this new API.
It seems like you're saying you'd rather have a slower implementation given that a bunch of single instructions useful for this sort of thing aren't available in the Vector API and must be built from sequences of Vector methods that themselves must be implemented using multiple instructions.
I think he's referring to something similar to what .NET has been doing in the last few versions. They introduced a new Vector API that abstracts platform-specific SIMD instructions. The end result is the same, code using Vector128 will be directly compiled to equivalent AVX opcodes on x86/x64 and NEON on ARM* as if you would have written that directly, except that now you can add these kinds of optimizations across many architectures with a single codebase
This [0] post by Stephen Toub goes in GREAT detail on that
You mean this IntVector[0], which I assume is the Java experimental API anthony88 was referring to, correct? If that operation being missing is a blocker, I feel there may be some middle ground other than implementing the whole thing in C++ (like adding it or fast tracking work on this API)
Hello, thanks for pointing out that I missed this one. It looks like the methods exposed may be sufficient for implementing a sort. I also looked for vpalignr and vpshufb which seem to be missing, but I'm open to the possibility that I have missed these as well.
vpalignr is called "slice", vpshufb is "rearrange". The latter is easy to find if you search the page linked above for "shuf". The former is a bit harder, but I found it by searching for "concat" while thinking about how it might be possible to express it.
If you need the vpshufb behavior of zeroing elements where the index is negative, I think you will need to build something out of multiple operations. The compiler is of course free to match those multiple operations to one target instruction, i.e., recognize that what you are trying to say is really a vpshufb. It can do this in the same way that it can match multiple operations like x + y * 8 + 12 to a single lea instruction.
Another way that rearrange is not vpshufb is that 0 in the 17th position means "get me the first byte (from a different 16-byte lane)" rather than "get me the 17th byte (from this 16-byte lane)" It would be very impressive if the compiler could take a lookup table of values designed to be used as the shuffle argument to rearrange and transform it into a lookup table of values designed to be used as the shuffle argument to vpshufb.
vpshufb only accesses bytes within the same 16-byte lane and has latency 1. Instructions that shuffle bytes across lanes tend to have higher latency. So at the 16th position, a shuffle value of 0 (or 16, or 32...) gets you the first byte (byte 0) from the other argument, but at the 17th position, a shuffle value of 0 instead gets you the 17th byte (byte 16, or byte 0 of the corresponding 16-byte lane, if you like) from the other argument. This isn't really desirable behavior. Rather, it's somewhat awkward behavior that one has to work with to avoid doing a sequence of more expensive operations.
I am concerned that it takes like 6 instructions (2 to put copies of each lane into all lanes, 2 shuffles, cmpgt, blend) including a load and using an extra register to implement rearrange for use cases that would be fine with vpshufb for 32-byte registers and maybe 14 instructions (4 to put copies of each lane into all lanes, 4 shuffles, 3 compares, 3 blends) with 3 loads and 3 extra registers to implement rearrange for 64-byte registers. Maybe you can do better for the latter case though, I haven't had a thorough try at it.
A common use of vpshufb is to get a mask of character positions from a string matching a small set of characters, which you can do with cmpeq(in_slice, vpshufb(c, in_slice)), where the input substring in_slice is used as the shuffle and c is chosen so that input characters in the set of characters of interest (e.g. {'\n' '\r' '\t' '\"' '\\'}) map to themselves and other characters map to something other than themselves. rearrange doesn't seem to define what happens when indices in the shuffle are negative or too large, so it seems like to use rearrange we need to mask off some bits and do cmpeq(in_slice, rearrange(c, and(d, in_slice))) where the `and` is only there because of the nit about invalid indices and `rearrange` is 1 or 6 or 14 instructions depending on what register width we're working with. That could be unfortunate.
You don't need this stuff for sorting primitives though!
You are still making things up about the API. If you're actually interested, look at the VectorShuffle class, its docs will tell you all about how illegal indices are handled and how to legalize them. But I doubt that you're actually interested.
It's a pity because you clearly have a lot of knowledge that you could put to use to provide very helpful feedback to implementors of the Vector API. (I'm doing this for the GraalVM compiler.)
I just read the entire documentation for the rearrange method. My apologies for not reading some unrelated docs to find out what rearrange does. It looks like this just makes things worse though? Like you really do have to mask off the high bits of the input, but then you're also paying several instructions for the mapping of out-of-bounds indices to negative values when making the VectorShuffle and then several instructions again to check whether any of them were negative so you can throw an exception from `rearrange`.
This is the sort of area where it feels the JVM has some under utilised potential, as these type of optimisations can take advantage of the strong guarantees of the JVM runtime.
Am curious though if it can work with other SIMD AVX versions since AVX512 is only selectively supported. And what's the potential to go all the way and add OpenCL or CUDA implementations?
You typically wouldn't want an OpenCL or CUDA implementation because developers expect their code to be running on a single core. If random functions started using multiple cores or running or the GPU, it would become very difficult to manage core allocation of highly parallel programs.
On the other hand, they could provide a different function (GPUSort for instance) but they shouldn't replace the default one with different threading requirements.
I'm of a different opinion there. When I call e.g. sort on a list, I write what my intent is, the what; I don't care about the how, and I'd rather have the JVM decide to do a multicore / gpu driven optmized sort than that I have to think about implementation details like that.
If I want to fine-tune my implementation and write code that works well with the underlying hardware, I'd write it in a language more appropriate for that particular goal.
Java is for business applications, where the developer expresses what the code should do; lower level languages (C, Rust, etc) give the developer more options and responsibility in the how it should do it.
Simple example, for-loops vs functional functions (map, reduce, etc); the former is the how, the latter is the what.
The Java Streams API does various functional maps and reductions and can do them in a multithreaded fashion, when asked. I don't think it makes sense to parallelize by default.
Although Java is used to write all kinds of business applications, it's also used to write the platform those applications run on, like Netty for example. If I'm writing a thread pool implementation (which is eventually going to be used by "business applications"), I want to have control over concurrency. I don't want to accidentally introduce pauses in my users web apps just because I sorted an array in my library.
One could argue that web servers shouldn't be written in Java at all but you can't really support JVM applications efficiently on a non-JVM backend. All the great instrumentation tools that are available on the JVM will have trouble inspecting your library code, memory use becomes more complicated because the GC no longer knows how much memory is truly allocated, and users can't put breakpoints in your code easily in their IDE.
Even in java, you often times care about threads / core utilization. For example in a web api, you typically don't want to use multiple cores on a single client
I agree to an extent, but you have to at least be aware of the complexity of your algorithms (time and space) and understand the trade offs. If the trade offs change depending on compiler optimisations, then you'll have some difficult debugging on your hands.
As to portability to other SIMD instruction sets: vqsort[0] is portable to >10 different instruction sets, and actually faster than this when running on AVX-512.
Disclosure: I am the main author; happy to discuss.
Thanks :) We do not have JVM bindings. I am not familiar with Java internals but would be happy to consider maintaining them if someone wants to add them (plus a test).
> And what's the potential to go all the way and add OpenCL or CUDA implementations?
Wouldn't make sense outside of environments like Apple's M-series SoCs or some gaming consoles that have unified memory between CPU and GPU. Normal Intel-based architectures would waste too much time shuffling data over PCIe.
Nearly every consumer Intel CPU sold for the last 10 years has had unified memory between the CPU & GPU. Similarly nearly every Qualcomm, ARM, etc... SoC for the last 10+ years has had unified memory.
This isn't a rare or new architecture. Apple didn't invent this or popularize this with the M-series SoCs.
Now since you mention PCIe you're almost certainly only thinking of discreet GPUs. Those of course are not unified memory. But the vast majority of consumer CPUs also have an integrated GPU and that is often unified.
I wish I could easily use the 2 TFLOPS Xe GPU of the TigerLake gear I have with something else than the whole opencl/oneAPI/dpc++ kitchensink. I wish Intel really made an effort on the programability of all their offload hardware, be it GPUs or Myriad-X stuff... Either open everything and support a community, or make the sustained long-term CUDA effort... Hopefully SYCL gets us there but I still feel TigerLake was a grossly missed opportunity...
Impressive, but I wonder what the practical benefit will be. How often do you need to sort an array of numbers?
In practice you almost always want to sort an array of objects based on one of their properties.
> How often do you need to sort an array of numbers? In practice you almost always want to sort an array of objects based on one of their properties.
And? If your objects are immovable then one level of indirection (pointers or indices) solves that. If your objects are moveable, then there's many (!) times where having your container sorted will improve performance.
I don't see how you can add that level of indirection if you can only sort an array of numbers.
Let's say you have a Person object with property 'age' which is an Int.
I can put the Person's ages into an Int Array and sort that. Where to add this indirection to get to the sorted array of Person objects?
Eh yes, this works, but you are not using Arrays.intSort here. Which means that the optimizations that the article is about won't apply.
intSort (or any other of the sorts in the article) works purely on an Array of Ints, moving them around and comparing them to each other.
It's quite hard to come up with something that we could use in every day programming that will benefit from the new optimizations.
Sure, but what use is that? You will have your objects sorted by the absolute value of their pointer. I guess this means you'll have your objects sorted by the order in which they are allocated in the heap. Cool, but not very useful.
Nice. I'm always slightly disappointed in the amount of optimization and intrinsics in the JDK for these kinds of fundamental and frequently-used methods, despite this always being one of the main arguments for JIT.
One would think such relatively low-hanging fruit (dispatching to an existing avx512 lib) would have been picked by now, considering the massive effort to improve the JVM in other areas (virtual threads, value objects, C interop, Graal, novel GCs).
Maybe Array.sort() isn’t that frequently used, as data sorting is often done by the database?
The big issue here (provided that I'm reading the PR correctly) is that it's purely for arrays of the primitive number types. Whereas most real applications (that I've seen) will be sorting objects by some (potentially computed) property, be it an ID, a timestamp or a name. For all of these cases, the linked pull request won't be doing anything useful.
The most useful application for this (outside of getting nicer numbers on a benchmark) that I see is computing the median/quantiles of some property on a bunch of objects that aren't already sorted by the interesting property.
You still see quite a lot of primitive number sorting in library code. If you want your Java code to go fast, you typically stick to primitives in arrays.
Most Java application code doesn't, but it typically uses libraries that do. A sorted array is a priority queue, a binary search tree, etc.
I think the number of applications that will see a measurable speed-up from this improvements is very small.
If you have an application that is truly bottlenecked the performance of number-sorting in any measurable way, then you most probably didn't write it in Java. It's not really a number crunching language for a variety of reasons.
Java high-performance programming would be significantly improved by generics over primitive types. Writing performant code in Java feels like writing Fortran or C.
You end up using libraries like fastutil, which is "generic code" templated by C preprocessor macros,
> Java high-performance programming would be significantly improved by generics over primitive types.
This is planned as phase two of Project Valhalla [0]:
The second phase will focus on generics, extending the generic type system
to support instantiation with inline classes (which will include primitives),
and extending the JVM to support specialized layouts.
Sure and I'll be happy to use it when available assuming I'm still working with JVM applications then. Meanwhile I'll still be stuck with fastutil-style templating for another ten years and it kinda sucks.
(Scala has specialization and value classes and opaque types so that covers a fairly big range but they make interop with Java tricky.)
Project Valhalla has been announced 9 years ago and nothing relevant is in the JDK yet, I don't think this will be relevant for day-to-day Java for several more years at least.
Probably, but "nothing relevant is in the JDK yet" is not quite fair. Several sub-projects are in "Preview Feature" [0] status, which is defined as "a new feature of the Java language, Java Virtual Machine, or Java SE API that is fully specified, fully implemented, and yet impermanent".
This includes JEP 401 "Flattened Heap Layouts for Value Objects" [1, JEP 402 "Enhanced Primitive Boxing" [2], and I think also "Value Objects" [3] and "Universal Generics" [4].
It is a huge task with many dependencies and requires careful design. It feels like it might finally make it in the next long term release.
Not GP, but in almost every introductory text to JIT compilers I've read there is a section called "advantages and disadvantages" and it almost always mentions that JIT compilers in theory have more information available than ahead-of-time compilers (ie all the information AOT compilers have and then also runtime information) and can use that information to safely do optimizations that AOT compilers cannot.
But then when you look at the actual code of JIT compilers, such JIT-only optimizations seem extremely rare. The JVM, surely one of the platforms that have had more than enough dollars and top-level CS talent thrown at it, even after ~20 years it still apparently lacks many optimizations that you'd expect to be in there if you've only read the introductory texts about it.
The linked MR about AVX optimizations for sorting is one such example IMO. AVX512 was first announced 10 years ago, and (not being deeply into JVM development myself) I might have assumed its use would be more prevalent in the JIT output than actually seems to be the case.
Autovectorization is a famously difficult optimization and runtime information doesn’t make it any easier. That is more useful for optimistic assumptions, like these pointers won’t be zero, only a single instance exists for this interface, etc.
Nonetheless, you might find the Graal compiler doing a better job at autovectorization than C2.
I believe by having deeper inlining, unrolling and specialization in a JIT you can gain more performance than by static analysis. You squeeze more performance out of stupider algorithms.
Of course, and dead code elimination can also be improved upon (compared to AOT compilers) because some inputs like command line flags will never change during the execution of the program. This means you can better predict which branches will be taken. These are optimizations that work for almost any program.
OTOH, another way in which JITs "should" generate better-performing code is by tailoring their output to the platform on which the program is currently being run. With AVX being quite prevalent on the server-grade CPUs on which many big JVM programs run on, I don't think it would be unreasonable to expect the JVM to have more support for AVX512 in its code generator than it apparently does. Is low hanging fruit like 10x speed improvements in sorting something you'd expect out of a very mature platform like the JVM?
I don't mean to harp on the JVM devs here, JIT development is a Very Hard Problem. It's just that I can understand why GGP is disappointed, JITs in general don't seem to quite deliver on the excitement they generated when they were new.
I don’t know — having almost native speed with all the benefits of a fat runtime like attaching a debugger to a prod process to check the number of objects, or streaming runtime logs without almost any overhead sounds like they do deliver.
But it doesn't unless your "almost" is very generous. Java is pretty consistently 2-10x slower than the major performance-focused AOT offerings (C, C++, Rust)
Now maybe you call 2x "almost", but let's phrase it in terms of CPU performance over time. That's equivalent to 10 years of CPU hardware advancements.
To me that's a lot of overhead. Depending on who is paying for the CPU time vs. the developer time it's regularly a cost worth paying, but at the same time don't pretend it's "almost native speed", either. It is a cost and a rather significant one at that. Just, so are engineers. They also aren't cheap.
That’s not so simple that we could reduce it to a number. Java is very close to C performance when you operate on primitives only. But it does have an overhead for objects and it can’t “hide” it as well as languages that have developer-control on stack allocation (List<Complex> will be an array of pointers in java, while it can be inlined values layed out sequentially in C/C++/Rust).
Also, which 10 years of CPU advancement do you mean? It is definitely not a linear graph, we have reached an almost plateau on single-core performance.
The most recent 10 years and that's exactly the point - we're plateauing on hardware advancements, so software inefficiencies are more of an issue now than ever before. 2x is now a huge change in performance. You're not getting that "for free" from just waiting a couple of years anymore.
You did not take into account my point about code operating on primitives definitely not having that kind of overhead.
And even code that does have an overhead is not as simple to judge. Could you write that same code in a lower level language that it will still remain correct and safe? Is the algorithm actually expressible in Rust’s much more restrictive style (to stay safe)? If you do locks and ref counting everywhere, will that code actually be still faster? For example, a compiler might very well be faster in Java/haskell/another managed language.
> You did not take into account my point about code operating on primitives definitely not having that kind of overhead.
Because it's not really interesting to debate. Nobody writes Java code like that, and even when they do there's still overhead to it. The exact amount of overhead is kinda irrelevant since the language very obviously doesn't want you to write code like that.
> Could you write that same code in a lower level language that it will still remain correct and safe?
Rust is safer than Java, so yes :)
But you're drifting into the productivity argument anyway, which I already pointed out is a reasonable reason to pay runtime overhead to get.
Plenty of places write Java that way, HFT might be a notable example.
Rust has data-race freedom, while java has “safe” data races (it is tear-free, so even in case of a data race you won’t corrupt the memory, which is not true of rust with any number of unsafe parts). So I don’t really buy the argument that Rust would be safer.
On the other hand .NET is much closer, because it supports value types and CLR was designed to be targed by C++ as well, so it supports most of the crazy C and C++ stuff (not counting UB related ones).
Likewise if you code in C, C++ or Rust with allocations everywhere, bad algorithms or data structures, being AOT won't help.
Yes and no, the effectivness depends on which JVM we are talking about.
GraalVM does it much better than OpenJDK, then there are OpenJ9, PTC, Aicas, Azul, ART (yeah not really, but close enough), microEJ, and a couple of nameless others from Ricoh, Xerox, Cisco, Gemalto,... on their devices.
Even if we stick to OpenJDK, the distributions based on it aren't all the same, for example Microsoft's fork has additional JIT improvements (-XX:+ReduceAllocationMerges).
AOTs consistently do much more inlining & unrolling than JITs do. JITs often have preset heuristics for figuring out where to split the code that's practical to implement rather than being the most performant possible. After all, the code needs to hot-swapped in without much disruption. And then similarly JITs need to optimize for compilation speed which limits the amount of optimization passes it'll do, because the JIT result is itself a "hot path."
JITs do regularly have a lot more specialization optimizations, though, but is that really because it's a JIT instead of an AOT or is it more because JIT'd languages just often tend to also be more dynamic ones as well?
> AOTs consistently do much more inlining & unrolling than JITs do.
Nonsense. What evidence do you have for this claim?
> JITs often have preset heuristics for figuring out where to split the code that's practical to implement rather than being the most performant possible. After all, the code needs to hot-swapped in without much disruption.
You seem to be talking about JITs without on-stack replacement. So not state of the art high performance JITs.
The code changes have references to something that looks like it’s searching for a dll… isn’t that a windows thing? Or is that a generic search for a library?
I assume you are referring to "os::dll_locate_lib()". I'm sure that it is cross platform. While uncommon, it's not totally wrong to call a .so file a dynamically loaded library on a POSIX/UNIX platform.
What interests me the most: The developer is employed by Intel! These big, essential open source projects always have a myriad of developers employed by the Big Guns of Silicon Valley hacking away. Is the thinking that if Intel adds vectorization optimization to OpenJDK that it might help sell more chips? The connection looks so loose; I'm a bit surprised that some senior bean counters approved this expenditure!
They want their chips to beat AMD (which this might not help with) and ARM in Java benchmarks, so adding optimizations makes sense from a marketing perspective.
Why? this is pretty typical of them. They used to ship massive amounts of patches to various projects to make their CPUs look good as part of their Clear Linux project, although that seems to have stalled somewhat.
I expect they are referring to the "Vector API" described in JEP 426: Vector API [1]. From the summary:
Introduce an API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thus achieving performance superior to equivalent scalar computations.
Yes. Where does it expose the AVX or ARM intrinsics? I only see generic add/neg/mul methods (thus the comparison to .NET Vector<T>) and not the intrinsics exposed thhemselves; here is a random pick from the .NET ones: https://learn.microsoft.com/en-us/dotnet/api/system.runtime....*)
x86 specific optimization for a language so focused on *portability*, heavy abstractions and business logic is kind of ehh. Especially with ARM is rearing its head.
If you desire performance close to the chip, you chose the wrong language and should write code in a language closer to the chip.
Unless the abstractions and concepts required for your primary work are so different from what you are using for day-to-day work (data science, ML python and C++ bindings for interacting with the GPU)
The language is still portable; this is a change in the JVM, the runtime, which should have all the optimizations. I don't understand your issue.
Java is a higher level language, you just want to call sort on a list without having to worry about low level performance characteristics, because there's people much smarter that can polish that.
I think what he's saying is that instead of writing that in platform-specific C++ they could have worked on a Vector API and use that instead to automatically work in other (future) SIMD implementations of the same width.
A poster in another comment mentioned such API is being worked on, and what I described above is exactly how .NET is tackling this: they built a Vector API and are building optimizations like that in C# on top of that API, giving also developers the ability to write SIMD-oriented code in C# rather than resorting to platform-specific C++ and interop/JNI
The Vector API exists to write SIMD code in Java. This is an intrinsic inserted by the JIT compiler. HotSpot intrinsics are always written in assembly or compiler IR because they are inserted in the generated assembly.
Arrays.sort() could very conceivably be called in a hot loop, so you really don't want to allocate Java objects in it.
Yeah, that work in C# required a lot of other things to minimize allocations.
What I was thinking is something similar to how they implemented things like IndexOf [0] which is a pure C# implementation that gets translated by the JIT in C++ equivalent code. The advantage is of doing this kind of things this way is that when ARM adds a 256-bit wide SIMD extensions they will only need to support that as a Vector256 implementation to get that code working with no other changes.
SIMD is not Intel only. ARM has SIMD support. So does AMD.
Portability is not a problem. The C/C++ compilers have nice wrappers on them to let JVM take advantage of them. And there’s always the non-simd version to fall back to.
JVM is the correct abstraction layer to implement this for portability. Any Java program doing sorting benefited from this on all supported platforms.
A precedent for x86 SIMD in those low level performance building blocks would also set a precedent for the inclusion of ARM equivalents. A heavy abstraction environment is exactly the right spot to place a set of ergonomic, long SIMD levers, one for each architecture.