Static typing in Java does not give enough information to optimise because of polymorphism. The JIT can observe runtime behaviour and inline method calls at monomorphic sites.
Worth noting also that Azul's Zing ReadyNow technology for AOT (and indeed their no-pause C4 collector) address some of the specific gripes raised here around the JVM...
This is absolutely correct. The primary optimization in Java JIT is determining the concrete types (potentially) present at the call site and thereby turning dynamic method dispatch (vtable) into static method dispatch. Static method dispatch then allows inlining.
Zing's ReadyNow is useful but really highlights some of the problems with JITs (in some domains). I think that ReadyNow is very interesting because its existence points to some interesting consequences of JITs in performance sensitive domains.
The driving force behind ReadyNow (as I understand it) was that many performance sensitive systems needed to be fast out of the gates on start up. This means that an interpreted->compiled->optimised-compiled transition was not acceptable.
Developers would try to solve this by running dummy data through the system before it was open to real traffic. But this had the unfortunate consequence that the JIT would optimise for dummy data only, including clever inlining at, apparently, monomorphic call sites etc. When real traffic flows into the system the JVM sees that its optimisations are no longer effective/valid. This would trigger a de-compilation/re-compilation of many code sites causing a very noticeable stutter.
Now we have ReadyNow. If you are really committed to Java or the JVM and you don't like these JIT stutters this is your solution. But this is an extra layer of complexity and another thing to be managed and another thing to fail. This is on top of your JVM, jar file soup you may already be struggling with.
I would prefer a good AOT compiler to remove this concern and give me quite good predictable performance. YMMV of course.
There is another use case for JIT compilers that has not been considered: speeding up programming language development.
PyPy's meta-tracing JIT-compiler framework [1] allows you to generate reasonably fast tracing JIT compilers for languages essentially by writing an interpreter. Laurie Tratt has written a nice description of this, and the advantages this delivers [2]. This make the development of new programming languages much simpler, because you don't have to invest a lot of time and effort into producing a reasonably fast compiler (whether JIT or AOT) at the beginning of language development. Yes, in many cases you can build AOT compilers that beat JITs produced by meta-tracing an interpreter, but not easily so.
Tratt's team has used this approach for building impressive multi-language tools [3] that allow you to write programs in heterogeneous languages like Python, Prolog and PHP at the same time.
"Yes, you can build AOT compilers that beat JITs produced by meta-tracing an interpreter" - not for all the languages. As mentioned below, there is a certain tradeoff associated with JITs - warmup time, memory consumptions etc. But for certain class of problems (say compiling Python) and for certain class of use cases (running at top speed), I dare you to compete with PyPy. The biggest contender so far comes from the Zippy project, which is indeed built on truffle which is a meta JIT, albeit a method based one.
Not sure, don't have much experience with that. This is a "classic" futamura projection - you write an interpreter and the "magic" turns it into a compiler. I'm not aware of any consumer-grade compiler like that, but there is a huge swath of research on that.
You can very easily create a dumb one - you just copy-paste interpreter loop essentially (which is what e.g. cython does if not presented with annotations), however the results just aren't very good
Research on partial evaluation (PE) was fashionable in the 1990s, but largely fizzled out. I was told that was because they could never really get the results to run fast. I'm trying to understand why. Clearly meta-tracing and PE have a lot of overlap. Truffle is based on some variant of dynamic PE if I understand what they do correctly. Most of the 1990s work in PE was more about static PE I think. The paper [1] touches on some of these issues, but I have not studied it closely yet.
The problem with native code is platform disparity. It's absurd that so many binary variants are required when shipping code to e.g. Android or iPhone.
This can apply to x86, too. Not sure if all your users have TSX-capable CPUs? You're now shipping two binaries. Not sure if your users have AVX? Now you have at least three binaries.
This is the biggest advantage of intermediate languages - your code can at the very least execute everywhere with a single image, and in some cases automatically take advantage of platform features such as AVX.
I think ART takes the correct approach: ship IL and AOT it on the device. Hopefully some day we can get the same type of system for LLVM IR.
The Intel C/C++ compiler creates different code paths depending on the CPU's capabilities. So at least there's no duplication of non-cpu-specific code.
>Hopefully some day we can get the same type of system for LLVM IR.
Right now the LLVM IR isn't platform-agnostic, e.g. when it comes to calling conventions.
Also when Apple transitioned from 68000 to PPC in the 90ies, then from PPC to x86 and recently from 32 bit to 64 bit, they used "fat binaries" where a single executable file contained code for several platforms. That way one and the same application could run on old Macs with PPC, newer ones with 32 bit x86 CPUs and really new ones with 64 bit capable x86 CPUs.
> Right now the LLVM IR isn't platform-agnostic, e.g. when it comes to calling conventions.
But it does give you the same benefits in terms of CPU capabilities. You can't ship intermediate code that's agnostic about e.g. x86 vs x64 (C and C++ just aren't amenable to that, no matter what your intermediate representation looks like), but you can ship intermediate code that expects some form of x64 but postpones instruction selection until the program is installed, so you can take advantage of new instructions just when they are available. As I understand it, Apple is going to use it for that purpose.
This can apply to x86, too. Not sure if all your users have TSX-capable CPUs? You're now shipping two binaries. Not sure if your users have AVX? Now you have at least three binaries.
Three code paths, not necessarily three binaries. It's easy enough to compile different variants of performance-critical routines and then select between them at run-time.
I think ART takes the correct approach: ship IL and AOT it on the device. Hopefully some day we can get the same type of system for LLVM IR.
This isn't necessarily optimal though. Having certain CPU features may make algorithmic changes useful -- e.g., switching between lookup tables and direct computation depending on whether you have the CPU features which make the direct computation fast. There's no way an AOT will be able to handle that for you.
No, but neither will compilation on a dev machine. You'll have to insert a manual switch witch depends on compilation target anyway.
Phoronix did a test where compiling with march=native sped up programs a substantial amount. 20% for one application. It astounds me that we throw away that sort of gains.
As someone who has regularly had to deal with OS/400 / iSeries / System i / IBM i (cant' wait to see what the next name is...) over the past decade or so, I've found my respect for these systems only increasing.
Sure it has a clunky default interface, and IBM artificially cripple most systems with licensing restrictions (owning a multi-core Cell processor on which I can only access one core? WTF?), but the virtualization capability, backwards compatibility, stability, reliability and performance really leave standard "consumer" grade server architectures for dead.
I can even run PASE, or Linux / AIX under an LPAR if I want to. It's pretty crazy what these things can do.
I find using them to be almost like living in some alternate steam-punk universe. It's like putting punch-cards into a quantum computer!
In software development we routinely ignore historical good ideas - we keep re-discovering the same things over and over again. I'd say that even if we eclipse JIT in the coming years, in the distant future someone is going to repeat our same mistakes.
Gentoo linux distribution works that way. It has automatic package management system (Portage) similar to apt-get or rpm (and even more similar to ports in *BSD). The source meta-packages (specyfying the compiler flags, patches etc) are maintained by the community.
You configure a system-wide flags specifying your cpu, optimization level, the features you want to enable and disable (many libraries and apps have flags to compile support for some features or not), etc. There were also app-specific flags IIRC.
When you install a package it downloads sources and the meta-package from portage, then it is compiled on your machine with your flags added for your specific processor with only the features you enabled. It was supposed to be much faster than "bloated ubuntu", but I suspect now it was mostly placebo :)
On the other hand installing Open Office on my Celeron 2000 laptop with 256 MB RAM took 10 hours, and if I wanted to enable some feature that turned out to be useful after all - I would need to wait another 10 hours.
I used it for 2 years when I was on university but then it was too much to wait for hours when I needed sth.
> It was supposed to be much faster than "bloated ubuntu", but I suspect now it was mostly placebo
Gentoo predates Ubuntu by several years. I remember reading some "gentoo ricer" threads bike-shedding over ultimately inconsequential optimization flags. But don't underestimate the simple benefit of `-march` and similar. At a time when binary distributions were pushing x86 binaries which ran on i386, compiling for a newer architecture could give a substantial increase in registers available, instructions available, and instruction scheduling quality. In aggregate, this definitely can improve performance.
Since then, I believe Debian moved to an i686 base which narrows the gap.
Well, at least in Windows it was always the case with .NET.
NGEN is there since day one, but they always kept it quite simple in optimizations and only allowed for dynamic linking (no static binaries). Also it required signed binaries, which many people did not bothered with.
Windows Phone 8 adopted the Singularity toolchain with AOT compilation in the store, with a dynamic linker on the phone. .NET Native just moves the WP8 linker out of the device and produces static binaries instead.
One aspect the author doesn't talk about is emulation: The Wii/GameCube emulator Dolphin uses JIT to translate the console's PPC machine code to x86 machine code.
I'm also not sure if "scripting languages" truly don't need a JIT. The development of LuaJIT for example has been sponsored by various companies, so there seems to be a need for the fast execution of Lua code: http://luajit.org/sponsors.html
Yes, but to be fair, Lua is one of these few examples where JIT makes a lot of sense because the extreme simplicity of the language, which greatly expands the opportunities for a JIT to generate efficient code with relatively low overhead and JIT compiler complexity.
This approach is great for some things, but it does sacrifice the flexibility and expressiveness of the language. Anything not part of the core language (which is a lot compared to most other programming languages, e.g. anything related to OOP, or more advanced data types than strings, floats and tables) has to be re-invented and/or bolted-on, which IMO makes it unsuitable for most kinds of applications.
This should not be interpreted as critique about Lua the language bu the way, I'm a big fan of Lua for embedded scripting and I generally love tools with a narrow focus (as opposed to kitchen sink technology). I would not choose Lua for anything besides embedded scripting though.
Usually I end-up using Python for smaller things that aren't mission-critical or need maximum performance, C++ for almost everything else, and Objective-C for OS X/iOS stuff. Maybe Java for things where both performance, safety and ease-of-development/maintenance matter (the latter would be mostly for other people who would have to work on the same projection who are less skilled in C++, as I'm not at all a fan of the language myself, but I recognize it has some properties that make it a suitable choice for many kinds of applications ;-). I don't have enough experience with any other programming languages that can be deployed easily across Linux and OS X, so I can't comment on those. Rust seems to have some good ideas so I may want to learn more about it in the future.
If I wanted to write something for Windows platforms I'd most likely gravitate towards C#, from what I know about the language it appears to have all the good things of Java without its downsides.
Fully agree, JIT makes sense in dynamic languages or real-time conversion of op-codes, for everything else AOT is much better solution.
Having already quite a good experience with memory safe languages when Java came into the world (mostly Wirth and ML derived languages), I never understood why Sun went with the JIT approach. Other than trying to sell Java to run in the browser.
There were already quite a few solutions using bytecodes as portable executable format that were AOT compiled at installation time, e.g. mainframes and research OSes like Native Oberon variants.
The best is to have both on the toolchain. A interpreter/JIT for development and REPL workflows. Leaving the AOT for deployment/release code.
Not if everything else includes long-running servers that may require various kinds of fiddling at runtime, like injection or turning off and on of monitoring/troubleshooting code, nor if you want to have zero-cost abstractions that are actually high-level abstractions (as opposed to the other kind of zero-cost abstractions, that are basically just weak abstractions that can be offered for free).
Once upon a time I did drank too much of JIT Kool-Aid, and for a moment I though it was the future, kind of.
However bearing the scars of the pain of "kinds of fiddling at runtime, like injection or turning off and on of monitoring/troubleshooting code" actually means in practice, changed my mind.
It is so damn hard to tune them, that there are consulting companies that specialize in selling JIT tunning services.
And in the end they offer little performance advantage over using PGO or writing code that makes better use of value types and cache friendlier algorithms.
This was one reason why Microsoft introduced PGO for .NET in .NET 4.5.
I've gone the opposite way, and I really like my long-running server code to be JITted when possible. I only wish my old C++ servers could be tweakable at runtime as my Java servers[1] and I really like being able to use languages (even for DSLs or rule engines) that have a few powerful abstractions, and let the JIT make them run nearly as fast as the infrastructure code.
If you really need careful tuning of optimizations (which is never a walk in the park), it will be made much nicer with this: http://openjdk.java.net/jeps/165
It is possible to have powerful abstractions, AOT and performance.
One just has to look at Haskell, OCaml, Common Lisp, Ada, Eiffel, SPARK, Swift, .NET Native,....
The fact that for the last decade mainstream AOT was reduced to C and C++, kind of created the wrong picture that one cannot have performance in any other way.
I always dislike how they present some of the language features (stack allocation, structs and arrays) as if there wasn't any other language with them.
For monitoring, although one doesn't get something as nice as VisualVM or JITWatch, it is possible to bundle a mini-monitoring library if really needed.
Something akin to performance counters in Windows.
Ada (/SPARK) is a complex language that makes the developer think of every optimization-related implementation detail as they code (not unlike C++ or Rust), and Haskell/OCaml would benefit if functions that do pattern-matching could be inlined into their call site and all of their branches removed but one or two (the same goes for the other languages on the list).
Just to provide some feedback that I should have written on the sibling comment.
I use both Java and .NET since they got out, there is the occasional C++ library, but they are the core of my work. So I know quite well their eco-systems.
Some of the mentioned scars were related with replacing "legacy" C++ systems, while keeping a comparable performance.
The article seems to ignore a variety of very real benefits that you get from the JIT. In many real world cases, there are optimizations that the JIT can perform. For example:
final boolean runF = ...; //In *this* program run, it works out to be false.
...
if (runF) {
f(x);
}
The JIT can just eliminate this branch entirely. More interestingly:
if (cond) {
myVar = f(x); //f(x) has provably no side effects
}
Supposing cond is nearly always true, the JIT can optimistically run f and in parallel check cond. In the unlikely event that cond is false, the JIT will still compute the return value of f(x), but just not do the assignment.
It surprised me when I saw all this happen, but the JVM's JIT is actually pretty smart and works very well for long running server processes.
In the first case, either the "if" is run in a tight loop, and then the branch predictor will likely predict it correcly, or it isn't, and then branch elimination brings only trivial improvement.
In the second, what do you mean by "run f and in parallel check cond"? How that would translate to the machine code? You want to run and synchronize cond and f(x) between two separate CPU cores or what?
And if it's not run in a tight loop, the JIT can often still detect it.
I don't know why you believe such things are trivial. On one occasion I altered some code which broke this, and it added about 10-20 seconds to a 15 minute HFT simulation. If that code made it to prod, those 10-20 seconds would be distributed around the hot parts of an HFT system. That's not remotely trivial, at least for performance critical applications (HFT, RTB, etc).
In the second example, basically what happens (if you get lucky - I haven't found the JVM JIT to be super consistent) is that computation of cond and f(x) will be put into the pipeline together, and only the assignment to myVar will be conditional. This is strictly NOT something the CPU can do by itself, since there are side effects at the machine level (allocating objects, and then GCing them).
> This is also something the CPU's branch predictor can figure out and do the same optimisation on
The benefit of this optimisation isn't correctly predicting the branch, which is the only thing the CPU branch predictor can do, it's removing the untaken branch from the optimisation pipeline.
if (x) {
y = rand();
} else {
y = 14;
}
y * 2;
In this code if x is never true then not only will the branch not taken be removed, but the code following the if statement can be optimised knowing for sure that y is always the constant 14, and then the whole thing can be constant folded.
I don't think it ignores that, just says that such benefits were always smaller in reality than promised, and not worth the lack of predictability WRT performance.
Why did Java stay with a JIT for so long, anyway? Java is hard-compilable, and GCC can compile Java. The original reason for a JIT was to speed up applets running in the browser. (Anyone remember those?) Server side, there was never much point.
> Why did Java stay with a JIT for so long, anyway?
I read in some forum once, that it was political.
At certain point it became kind of heresy at Sun to suggest AOT compilation. Hence why all Java AOT compilers are only available from third party JVM vendors.
If someone from those days at Sun can confirm or deny it, please do.
I imagine it took Oracle some beating in high performance trading, from the companies wanting to move from C++ to Java, to really change their mind. This is what triggered the value types, JNI replacement work. So AOT was the next checkpoint in the list I guess.
JIT gives you more information about how the program is actually run. This allows a large number of optimistic optimizations that GCC etc... can not make.
This in theory allows faster code to be generated. Than ahead of time compilers can. In practice this is destroyed by the pointer chasing default datastructure in java.
If you are executing "C" using JVM techniques then a research JVM JITing C code gets within 15% of GCC on a reasonable set of benchmarks. (http://chrisseaton.com/plas15/safec.pdf).
These benchmarks are favorable for the AOT GCC as they have few function pointers, startup configuration switches and little dataflow variety.
I suspect that specific compiler optimizations on the C code for the dense matrix transform in GCC are bigger part of why GCC is faster than the fact that it is AOT instead of JIT.
There are also a number of AOT compilers for Java e.g. excelsior. But I have not seen superior performance with that over an equivalent hotspot release.
> JIT gives you more information about how the program is actually run. This allows a large number of optimistic optimizations that GCC etc... can not make.
They'd be surprised to learn that considering GCC supports profile-guided optimisation.
Modern JITs basically are very sophisticated profile-guided optimizing compilers. Unlike AOT PGOs, though, JITs can adapt to changing program phases, that are common in server applications (e.g. when certain monitoring/troubleshooting/graceful-degradation features are turned on). On the whole, JITs can produce better code.
But this, too, is a tradeoff. Some JITs (like JS's in the browser) require fast compilation, so some optimizations are ruled out. Very sophisticated optimizations require time, and therefore are often useful when you have tiered compilation (i.e. a fast JIT followed by an optimizing JIT), and tiered compilation is yet another complication that makes JITs hard to implement.
An AOT compiler can decide to compile a JIT code generator into the binary, so even in theory AOT beats JIT. In practice it almost never makes sense to do this for a language that isn't full of gratuitous dynamism.
> An AOT compiler can decide to compile a JIT code generator into the binary, so even in theory AOT beats JIT
That is not AOT. That is compiling some code AOT and some JIT[1]. It's not a contest. Those are two very common compilation strategies, each with its own pros and cons.
> In practice it almost never makes sense to do this for a language that isn't full of gratuitous dynamism.
What you call "gratuitous dynamism" others call simple, general abstractions. BTW, even Haskell/ML-style pattern matching qualifies as "dynamism" in this case, as this is something a JIT optimizes just as it does virtual dispatch (the two are duals).
[1]: Also, a compiler doesn't generally insert huge chunks of static code into its output. That's the job of a linker. What you've described is a system comprised of an AOT compiler, a JIT compiler, and a linker that links the JIT compiler and the AOT-compiled code into the binary. Yeah, such a system is at least as powerful as just the JIT component alone, but such a system is not called an AOT compiler.
JITs could only optimize pattern matching if one particular branch is picked all the time, but not statically knowable. That is a very niche case.
Really, in almost all cases where JITs are effective it's because of a language where you have to use a dynamic abstraction when it should have been a static one. The most common example being virtual dispatch that should have been compile time dispatch. Or even worse, string based dispatch in languages where values are semantically string dictionaries (JS, Python, Ruby). A JIT is unnecessary in languages with a proper distinction between compile time abstractions and run time constructs. Names should be a purely compile time affair. Instantiating an abstraction should happen at compile time in most cases. There goes 99% of the damage that a JIT tries to undo, along with a lot of the damage that a JIT doesn't undo. Sadly most languages lack these abilities. Rust is one of the few languages that has a sensible story for this, at least for the special case of type abstraction, traits, and lambdas. I'm still hoping for a mainstream language with full blown staged programming support, but I guess it will take a while.
> JITs could only optimize pattern matching if one particular branch is picked all the time, but not statically knowable.
They would optimize it if one particular branch is picked all (or even most of) the time but not statically knowable at any given (extended) call site, and by "extended" I mean up to any reasonable point on the stack. That is not a niche case at all. Think of a square-root function that returns a Maybe Double. At most call sites you can optimize away all matchings, but you can't do it statically. Statically-knowable is always the exception (see the second-to-last paragraph).
> Names should be a purely compile time affair.
That doesn't work too well with dynamic code loading/swapping... But in any case, you're now arguing how languages should be designed. That's a whole other discussion, and I think many would disagree with you.
> There goes 99% of the damage that a JIT tries to undo, along with a lot of the damage that a JIT doesn't undo.
At least theoretically that's impossible. The problem of program optimization is closely related to the problem of program verification (to optimize something you need proof of its runtime behavior), and both face undecidable problems that simply cannot be handled statically. A JIT could eliminate any kind of branching at any particular site -- be it function dispatch, pattern matching or a simple if statement -- and eliminating branches opens the door to a whole slew of optimizations. Doing branch elimination statically is simply undecidable in many, many cases, regardless of the language you're using. If you could deduce the pertinent information for most forms of (at least theoretical) optimization, you've just solved the halting problem.
Whether or not what you say is true in practice depends on lots of factors (like how far languages can take this and still be cheaply-usable, how sophisticated JITs can be), and my bet is that it would end up simply being a bunch of tradeoffs, which is exactly where we started.
How much of the compiled code that runs day to day is shipped with a PGO build? How many packages are compiled at O2 instead of O3 in the average linux distribution.
Even then you assume the profile resembles the real program input. Not always true.
So GCC supports PGO but nearly no one uses it. On the JIT side at least all programs runing on the common JVMs and .NETs use PGO.
Even with PGO, one can't make all the optimizations one can in a managed runtime. e.g. null pointer exceptions -> dereferencing a null pointer is handled in the rare case in the JVM by segfaulting and dealing with it in a signal handler (after 3x this code will be deoptimized and turned into an if null check). This is an approach GCC cannot take as it cannot install signal handlers, so at best it can do the check and mark the segfault path as unlikely.
> How much of the compiled code that runs day to day is shipped with a PGO build? How many packages are compiled at O2 instead of O3 in the average linux distribution.
That's irrelevant to my objection. Also O3 isn't necessarily a gain over O2, so it doesn't make sense to blanket-compile everything as O3.
> Even then you assume the profile resembles the real program input. Not always true.
That is at least somewhat relevant, but your original comment stated that GCC can not make usage-based optimisations, not that it won't always make the right ones.
> Even with PGO, one can't make all the optimizations one can in a managed runtime. e.g. null pointer exceptions -> dereferencing a null pointer is handled in the rare case in the JVM by segfaulting and dealing with it in a signal handler (after 3x this code will be deoptimized and turned into an if null check). This is an approach GCC cannot take as it cannot install signal handlers, so at best it can do the check and mark the segfault path as unlikely.
GCC can do one better, given dereferencing null pointers is illegal it doesn't need to handle that at all, and thus has no need to optimise something which doesn't exist.
>> Even then you assume the profile resembles the real program input. Not always true.
>That is at least somewhat relevant, but your original comment stated that GCC can not make usage-based optimisations, not that it won't always make the right ones.
There are things that PGO can not optimize away that an optimistic JIT+managed runtime can.
> GCC can do one better, given dereferencing null pointers is illegal it doesn't need to handle that at all, and thus has no need to optimise something which doesn't exist.
Unfortunately, null pointer deference do exist even if their behavior is "implementation specific" in C, segfaults are depressingly common. And if you take CCJ as an AOT compiler then GCC as compiler collection can not make that optimization.
I think this points to a philosophical in difference between JVM compilers and GCC+ICC compilers. The JVM optimizes things away because behavior is defined in a certain way, while GCC tends to say you are asking for undefined behavior so I don't bother to do what you asked at all.
So I would not call it do one "better, it really is do one "worse"
I'm surprised to read about this direction, however little practiced it may be. JIT never was for the fastest of applications but in-time optimization should (and has, in some occasions) reach and even exceed the speeds of precompiled ('native'). Simply because the JIT optimizer has access to the actual runtime data that is being fed to the Tight Loop of Performance Bottleneck, and can do local and specific optimization for that run alone. AOT optimized binary can only do generic optimizations in theory.
Several years ago, HP (I think) published a research paper investigating JIT on precompiled binaries, exactly for this purpose. They claimed around 5-10% speedup on x86 code. Unfortunately the project never got released.
JIT may be a bad fit for real-time phone apps. But at the same time we're seeing the return of batch processes thanks to "big data".
At my last job I worked on Spark processes that took several hours to run. In research, performance is important but you do get to average. So I don't think JIT will go away for that case; it wasn't worth hand-optimising (most jobs were only ran a couple of times), but at the same time I was very glad it wasn't Python.
On the one hand, it intuitively makes sense that JIT compilation is inferior to AOT compilation for an application that will be packaged once on a developer's machine and then run on thousands or millions of devices. Doing the JIT compilation on all of those devices is a waste of energy; it's better to do the compilation just once on the developer's machine.
Also, JIT compilation requires warm-up to get to good performance. The kind of consumer who won't pay more than a few dollars for an app will also be turned off by an app that isn't snappy right away. So the first impression is everything, and a warm-up period before good performance isn't as acceptable here as it is in server applications.
On the other hand, on the major mobile platforms, native apps which can be AOT-compiled have to be distributed by gatekeepers, i.e. the app stores. On these platforms, the only alternative to the gatekeepers is the web. So I have a non-technical reason to want the last major JIT-compiled language on the client side, JavaScript, to be good enough for a variety of applications.
I was going to comment on the advantages of JIT compilation with regard to polymorphism, but then I found that the OP addressed that with his argument about predictability being more important than best-case or even average-case performance.
Finally, I'm curious if the author of the OP would use the same argument about predictability over best or average case performance to argue for reference counting over tracing GC. Maybe Android's ART and .NET Native would benefit from using reference counting in combination with a backup tracing GC to handle cycles, like CPython.
.NET GC is quite good and they introduced features in version 4.6 that allow even for fine-grained control over when collections happen and how.
Reference counting is only helpful if directly supported by the compiler, to remove inc/dec pairs. Also it is worse than GC for multi-threaded applications.
Currently I think the only alternative to GC are Substructural type systems. Even C++ guys are now looking into this. The problem is making it palatable to the average mainstream programmer.
In my company, we develop on windows, build on linux and run on solaris (JavaEE). Moreover (forget devops!) the deployment team are not the development team.
Having an environment (JVM + web server + our code) that is the same across the various hardware and company's teams is really something that helps.
So besides the AOT/JIT discussion, the value of the JVM in itself is really good for us.
In the case of Android, the big downside is very slow system updates because all apps on the device have to be recompiled. But it probably can be fixed.
The Julia language seems to be designed to be nothing more than optimised glue between two Python scripts. As a dynamic language, it's certainly a good match for a JIT (run at once after a quick edit). And that also explains why there can never be a standalone executable produced from Julia, since that would require AOT. My guess is that Julia will dodge the Jitterdämmerung for a long time to come, if not forever.
Julia does very little tracing, except to infer types, which it then passes on to LLVM. All the optimizations in Julia are because Julia stresses type stability, and once types are known, LLVM can generate very good code specific for those types.
You can easily annotate functions with types in Julia, and pre-compile specialized functions. Julia does very little runtime magic to speed up code, especially compared to PyPy. The speed just comes from a very clever type system + LLVM.
There is no "shift away from JIT", just a simple observation that JITs are a tradeoff that you sometimes don't want to make. They have three drawbacks and two advantages. The three drawbacks are 1/ slow warmup, 2/ somewhat increased RAM and CPU usage and 3/ complex implementation. The two advantages are 1/ (often much) better runtime performance (i.e. more optimized code) and 2/ much better support for runtime manipulation of code (for profiling, debugging "at-full-speed", hot-patching, monitoring etc.).
The slow warmup makes JITs a bad choice for quick command-line tools, and the increased RAM/CPU makes them a bad choice for battery-powered devices and those with very limited RAM. The better performance and optimization opportunities makes them the only performant choice for some languages that are hard to optimize AOT (those that rely a lot on dynamic dispatch and/or have dynamic data-structures, i.e. maps instead of class instances). The better instrumentation support makes them a terrific choice for long-running server-side application, where the drawbacks don't matter. On the client-side, unless the language requires a JIT for decent performance (like JS), there is no compelling reason to use one, and that's why Microsoft's decision makes perfect sense, as they've decided to focus on .NET on the client. This has nothing to do with JITs' great utility in general.
One of the biggest breakthroughs in compiler technology in the last decade is Oracle Lab's Graal[1], which can also be used as an AOT, but with less-powerful optimizations. E.g. Graal does this: https://twitter.com/ChrisGSeaton/status/619885182104043520
Graal (alongside its language-construction DSL, Truffle) has yielded implementations of Ruby, Python and JS that easily rival the state-of-the-art with far, far less effort, and also a very decent implementation of C (also with orders-of-magnitude less effort than the competition).
The third drawback makes developing JITs from scratch a bad choice for anyone but the most well-resourced teams, or those that are in no hurry, or those that target very simple languages only.
JITs have another drawback -- less predictable performance -- that matters mostly for hard-realtime code (or nearly hard-realtime), as deopts that momentarily slow-down performance as a better optimization strategy is sought. This is why hard-realtime JVMs offer a mixed JIT/AOT mode, where the hard realtime kernel (which values predictability over perfroamce) is AOT compiled, and the soft-realtime or non-realtime support code is JITted (which you want to run as fast as possible, but don't mind a rare hiccup).
> 1/ (often much) better runtime performance (i.e. more optimized code) and
Do you have a cite for that? Except for method inlining, which only requires a trivial JIT compiler, I haven't seen them beating normal compilers. The argument that you should be able to use the programs runtime data to perform specific optimizations seems to me to be oversold. Since the JIT compiler must be reasonably fast it eschews many optimization techniques that AOT compilers can afford to use.
Look at the tweet I linked to. And bear in mind that languages that are normally AOT-compiled (like C and C++) are often designed in such a way that information that is pertinent to most optimization is available at compile time, plus programmers in those languages don't make use of more general abstractions unless they have to (which makes, say, C++ more complicated as it has two dispatch mechanism that the user needs to be aware of). Such languages obviously won't be accelerated much by a JIT, but they have a big complexity cost, as various low-level optimization considerations have to be exposed by the language. JITs make simpler, higher-level languages as performant, and there's plenty of data to support that.
But your claim was "(often much) better runtime performance" What you link to doesn't support that. I can concede that a JIT can attain equivalent performance to AOT compiled code. But I haven't seen any evidence that JIT:ing in practice increases performance. Note that there are many modern high-level performant languages that do not use a JIT, like Nim, Haskell, Julia and Rust.
For example, the V8 engine has both a JIT an AOT compiler. The engine compiles all code with the fast AOT compiler and then recompiles frequently used functions with the optimizing JIT compiler. If it instead compiled all code with the optimizing compiler AOT, the JIT part wouldn't be needed and you would get just as fast code.
JIT can only beat AOT if you can exploit patterns in the dataflow of the program. Doing that profitably (i.e. the optimization must save more time than it costs to perform) is incredibly hard.
What do you mean? There is no way such optimizations could be done AOT on Ruby/JS code. Similarly, see chrisseaton's comments example on this thread.
> If it instead compiled all code with the optimizing compiler AOT, the JIT part wouldn't be needed and you would get just as fast code.
So why do you think they do that? :) The reason is that many optimizations are not available to an AOT compiler. An AOT compiler can optimize something only if it can prove that no matter what happens, the optimization will preserve semantics. A JIT can do speculative optimizations, i.e. optimizations that may change semantics and may not always be true, but would be true if the program behaves as it has so far. A JIT can do that because if its assumptions are wrong, it can deoptimize. Many, many abstractions are candidates for speculative but not definitive optimizations, such as virtual function calls, pattern matching, if statements etc.
> JIT can only beat AOT if you can exploit patterns in the dataflow of the program. Doing that profitably (i.e. the optimization must save more time than it costs to perform) is incredibly hard.
That is what pretty much what all (good) modern JITs do. But, as I said, a JIT is indeed more complex than an AOT compiler. Thankfully we now have great frameworks such as Graal/Truffle and RPython that take almost all the pain out of creating a good JIT (or a good compiler in general).
Haskell and Rust both require monomorphization, right? That's one thing the JVM doesn't require. You do pay a performance penalty for megamorphic code (http://insightfullogic.com/2014/May/12/fast-and-megamorphic-...) but it's still a difference in what's allowed.
I also don't understand what you're saying with regard to V8. Can the optimizing AOT compiler actually do enough optimizations on a language as dynamic as JS?
Sure, megamorphic code requires a JIT. But it's a trivial one so I don't count it. :) Essentially, if you have the expression:
z = x + y
If you statically can't know the types of x and y, you must do an expensive method lookup. You call some variant of the plus function if x and y are floats, another if they are strings or lists and so on. So a good runtime notes the types of x and y the first time the expression is evaluated and recompiles the expensive call to the general_add() function with, say if x and y are 32 bit ints, a quick add_32bit_ints() call.
The semantics of those are different. With trait objects the vtable is attached to the value. You could instead imagine a language with two types of trait bounds: one specialized statically (like Rust trait bounds), and the other handled dynamically (like type classes in Haskell). The semantics of these would be identical (vtable travels independently of values), the only difference is performance. I'm not sure if you'd ever want Haskell's implementation strategy in practice though, unless you want to support features that can't be supported by static specialization like polymorphic recursion or higher rank polymorphism (not to be confused with higher kinds). Interestingly C# does support those features and specialization because it specializes at run time. It's a less known very powerful feature, at least in theory :) You can abuse it to make the CLR generate arbitrary code at run-time.
In what way does the vtable being attached to the value cause different semantics (what you describe sounds like an implementation detail)? In particular, you can have Box<TraitObject> which has effectively identical semantics to TraitObject; yes, it's a fat pointer, but from the perspective of the trait itself there's no way to tell that this is the case. Anyway, the only ways I can think of to usefully differentiate the fat from a thin pointer in a parametric function are those in which Rust already fails to have proper parametricity for any type (including being able to access its type_id).
Steve Klabnik was not talking about Box<TraitObject> vs TraitObject, but about f(x:TraitObject) vs f<T:TraitObject>(x:T). In the former the vtable is attached to the value, in the latter the vtable travels separately. These do have different semantics, compare f<T:TraitObject>(x:Vec<T>) with f(x:Vec<TraitObject>). In the x:Vec<T> case there is a single vtable that gets passed to the function, and all elements of the vector share that same vtable. With x:Vec<TraitObject> each element of the vector has its own vtable.
In terms of pseudo Haskell types these two types would be:
TraitObject t => List t -> Result
List (exists t. TraitObject t) -> Result
Rust cleverly sweeps the existential under the rug.
Again, in what way is the use of fat pointers a semantic difference, outside of parametricity-breaking functions like size_of_ty and type_id? You are describing an implementation detail. I can think of times that the compiler should be able to desugar the fat trait objects into thin ones in the absence of a Reflect bound.
As I explained, the difference is that in one case you have one vtable per object in the vector, in the other case you have one vtable for the whole vector. With one vtable per object you can have one vector containing objects of two different types that implement the same trait with a different vtable. With one vtable for the whole vector you cannot. The same difference exists in many languages, e.g. C# List<IFoo> vs List<T> where T:IFoo.
I'm not sure what you're getting at regarding monomorphization, can you elaborate? What do Haskell or Rust have to do with this (and I don't think Haskell even does monomorphization)?
For anyone confused at the title - it's a reference to a part of the epic Wagner opera "Der ring des nibelungen" called "Götterdämmerung" or "Twilight of the Gods"
The title is still confusing though: I immediately parsed "Jitterdämmerung" as "the twilight of jitter". Which almost made sense, given that JITs increase the variance of run times...
I think "Jitterdämmerung" is strictly better than "Gitterdämmerung" would be, because there really are multiple JITs but there's only one Git. (On the other hand, the Hamming distance to "Götterdämmerung" would be less.)
+1. Actually, the awaited advancement is here already (hg instead of git), but for the accolytes' adherence to the buzz. The cargo cult of `SCM == git` seems at the height of its power.
While as a Wagnerian I noticed this immediately, I feel the subject of this post was not exactly the scale to warrant an allusion to an epic[1] tetralogy.
But then again, I could be being pedantic.
[1] Frankly, no words would do justice to describe the magnitude of Wagner’s historical opus.
http://insightfullogic.com/2014/May/12/fast-and-megamorphic-...
Worth noting also that Azul's Zing ReadyNow technology for AOT (and indeed their no-pause C4 collector) address some of the specific gripes raised here around the JVM...