Your linked post is not really a good example for that — escape analysis *is* ve...

pkolaczk · on Dec 27, 2021

Optional is only a part of the picture here. It also missed:

* branch elimination with cmov

* loop unrolling

* SIMD vectorization

* turning heap allocation into stack allocation

All those things could be done without breaking any semantic guarantees of Optional even without value types in place.

Also note how even forcing the Rust program to use references with double Box didn't make the code any worse. So Rust/LLVM had no issue optimizing that out even if Option was defined the way it is in Java now.

aardvark179 · on Dec 27, 2021

A lot of the problem stems from Java’s boxing, because the first n values are cached and so defeat escape analysis can’t remove the boxing reliably, and that cannot be fixed without breaking some applications.

kaba0 · on Dec 27, 2021

Java is capable of all of these optimizations though — but I am not an OpenJDK dev so I’m getting out of my depth here.

Of course you have less time/resources during JIT compilation (and mostly, inline depth), so the quality of the resulting code can at times be vastly worse than what an AOT compiler can do, but my experience is that in real life code bases Java’s JIT compiler is really great, while this benchmark reflects on a singular case where it failed.

pkolaczk · on Dec 27, 2021

> Java is capable of all of these optimizations though

In theory - yes.

In my experience it just repeatedly does worse job than a C / C++ / Rust compiler, unless I'm very careful in Java coding (yes, I can often make it close, but this requires way non-idiomatic Java code; e.g. I've seen cases when manually unrolling a loop helped getting 2x more performance, which is something I don't recall ever having to do in C / C++ / Rust).

For example we don't use Java Streams in performance critical code, because everybody on the team knows it does not optimize them back to the level of simple for loops. Well, we checked many times and it simply never happened, although, theoretically it could. But I can throw a chain of map/filter/fold calls in C++ or Rust freely and it just works as fast as a hand-optimized loop, with unrolling, simd, etc.

kaba0 · on Dec 27, 2021

How did you measure it? Because unless it is a long-running production code or JMH, it can be tricky to correctly measure it.

(But I’m fairly sure you know that already)

pkolaczk · on Dec 27, 2021

JMH is a standard tool we use for performance comparisons.

For context, see Scala's battle with specialization to get reasonable performance of collection transformations. Once you start using lambdas to define e.g. a filter condition, and once you want generic implementations working on different item types, this pushes you into a boxing hell and the JVM is surprisingly reluctant to remove all that overhead, and you end up with >10x penalty. So instead of relying on JVM, they specialize data structures for primitive types. It is even something that you are supposed to do in Java manually (see IntStream, LongStream classes).

bluGill · on Dec 27, 2021

What matters is the time it takes from when I start a request until it completes. Some of my code isn't long running, in that case the hotspot doesn't to me anything, but it would be wrong to contrive a long running example to show that java can be faster if hotspot engages. Other process run for a long time and java may have an advantage.