Awesome. I wonder how well this works on a stock JDK10 using graal.
Whenever I see a speed boost to do what is conceptually the same thing I'm always curious where the fat was cut. What did we give up? You can dump the resulting assembly with
-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly and diff might be revealing.
My hunch is that the line from the tutorial: `@CFunction(transition = Transition.NO_TRANSITION)`
makes all the difference. Explanation of NO_TRANSITION from [0]:
No prologue and epilogue is emitted. The C code must not block and must not call back to Java. Also, long running C code delays safepoints (and therefore garbage collection) of other threads until the call returns.
Which is probably great for BLAS-like calls. This lines up with my understanding from Cliff Click's great talk "Why is JNI Slow?"[1] basically saying that to be faster you need make assumptions about what the native code could and couldn't do and that generally developers would shoot themselves in the foot.
A team I was on in the past had a well known bottleneck for performance on the most performance critical component. It was one that couldn't possibly be avoided or minimised. It was one called with high frequency, and wall clock wise, didn't take too long.
"JNI is slow", being the conventional wisdom, and knowing just how frequent the calls would be, people had ignored it as an option.
Randomly one of the devs who was most bothered by the bottleneck, had an hour spare and threw the conventional wisdom out the window and dropped in JNI calls to an standard (highly optimised) library and re-benchmarked. 40% performance boost. Further experiments found that "JNI is slow" isn't as true as conventional wisdom quite had it.
Back in the day, GCC's Java native compiler "GCJ", had an alternative native method interface called CNI.
GCC recognized #extern "Java" in headers generated from class files. You could then call (gcj-compiled) Java classes from C++ as if they were native C++ classes, as well as implement Java "native" methods in natural C++.
The whole thing performed a lot better than JNI since it was, more or less, just using the standard platform calling conventions. Calling a native CNI method from Java had the same overhead as any regular Java virtual method call.
Ultimately, GCJ faded away because there wasn't a great deal of interest in native Java compilation back then, and too many compatibility challenges in the pre-OpenJDK days. But it's interesting to see many of it's ideas coming back now in the form of Graal/GraalVM.
There's an effort to bring a more modern FFI to Java that works similar to the one described in the article, called project Panama. It has tools to convert C header files into the equivalent annotated Java definitions and is intended to help improve performance as well.
I can say for a fact that panama is not seriously targeting this space.
We implement a ton of that native code today that works with c++ and actual android today.
We also handle gpus.
Project panama is only targeting c, and even then will only do it a cross platform non committal fashion. They aren't doing it the way they should be in order to properly target native vectorized code.
We tried seeing if we could get some of this work in to the JDK, but their goals fundamentally compete with what it takes to get vector math to be fast. It's also not nearly as ambitious as it needs to be to handle real world tensor workloads.
>Project panama is only targeting c, and even then will only do it a cross platform non committal fashion
John Rose of Oracle:
Panama is not just about C headers. It is about building a framework in which any data+function schema of APIs can be efficiently plugged into the JVM. So it's not just C or C++ but protocol specs and persistent memory structures and on-disk formats and stuff not invented yet. We've been relentless about designing the framework down to essential functionality (memory access and procedure calls), not just our (second-)favorite language or compiler.
The important deliverable of Panama is therefore not Posix bindings, but rather a language-neutral memory layout-and-access mechanism, plus a language-neutral (initially ABI-compliant) subroutine invocation mechanism. The jextract tool grovels over ANSI C (soon C++) schemas and translates to the layouts and function calls, bound helpfully to Java APIs with unsurprising names. But the
jextract tool is just the first plugin of many.
We do look forward to building more plugins for more metadata formats outside the Java ecosystem, such as what you are building.
In fact, I expect that, in the long run, we will not build all of the plugins, but that people who invent new data schemas (or even data+function schemas or languages) will consider using our
tools (layouts, binder, metadata annotations) to integrate with Java, instead of the standard technique, which is to write a set of Java native functions from scratch, or (if you are very clever) with tooling. The binder pattern, in particular, seems to be a great way to spin repetitive code for accessing data structures of all sorts, not just C or Java. I hope it will be used, eventually, in preference to static protocol compilers. The JVM is very good at on-line optimization, even of freshly spun code, so it is a natural framework for building a binder.
>They aren't doing it the way they should be in order to properly target native vectorized code.
Which is interesting since Intel is the one contributing the majority of the vector code changes.
Yes that's what I stated above. I've also stated that I haven't just read the news. We've talked to that team physically.
Being language/platform neutral does not mean it is going to fulfill most use cases people would have for c bindings.
Java tends to be "good enough" for a lot of use cases out of the box.
It might help a bit with libraries like netty and memory management, but it's not going to work on real world math code which, as I stated, is our main use case.
That codegen isn't going to match what you need to do for real speed on cpus or gpus when writing vectorized math code.
Re: his last point. That's exactly what we talked to that team about. We don't feel those tools are going to work for real world use cases. We already do the codegen and auto bindings/mapping ourselves in addition to the memory management ourselves.
Whenever I see a speed boost to do what is conceptually the same thing I'm always curious where the fat was cut. What did we give up? You can dump the resulting assembly with -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly and diff might be revealing.
My hunch is that the line from the tutorial: `@CFunction(transition = Transition.NO_TRANSITION)` makes all the difference. Explanation of NO_TRANSITION from [0]:
No prologue and epilogue is emitted. The C code must not block and must not call back to Java. Also, long running C code delays safepoints (and therefore garbage collection) of other threads until the call returns.
Which is probably great for BLAS-like calls. This lines up with my understanding from Cliff Click's great talk "Why is JNI Slow?"[1] basically saying that to be faster you need make assumptions about what the native code could and couldn't do and that generally developers would shoot themselves in the foot.
[0]: https://github.com/oracle/graal/blob/master/sdk/src/org.graa... [1]: https://www.youtube.com/watch?v=LoyBTqkSkZk