Tracing JITs and modern CPUs

krig · on July 24, 2015

For anyone interested in learning more about Tracing JITs and the problems involved in implementing them should read the paper "Trace-based Just-in-time Compilation for Haskell" by Thomas Schilling [1].

In it, he explains the details of implementing a basic VM and tracing JIT based on LuaJIT, and deals with a lot of the issues involved. For example, the choice of where to begin a trace and for how long to trace is crucial for performance. Traces that are too long will rarely complete, and selecting poorly and tracing something that won't actually be on the hot path has significant cost. With poor trace selection, a tracing JIT can even be slower than an optimized interpreter. Interestingly, the language itself also influences the viability of tracing: A language with explicit loop instructions is easier to trace since any loop instruction is an intuitive starting point, whereas a language which relies on recursion and TCO is less cooperative in this regard. One possibility is rewriting recursive constructions to imperative loops in a pass prior to trace selection.

Personally, after reading the paper I think that there is the possibility for amazing performance from tracing JITs, but the unpredictability and reliance on heuristics makes the practical value over method JITs or static compilation questionable. It is similar to the complexities of GC implementation: As performance gains are made, complexity shoots through the roof while predictability suffers. There's no easy answer to that problem.

[1]: http://src.acm.org/2013/ThomasSchilling.pdf

mafribe · on July 24, 2015

The Schilling paper is mostly a re-rendering of the original paper

   Dynamo: A Transparent Dynamic Optimization System [1]

by Bala et al, which invented (or better reinvented and popularised) tracing JIT compilation. It might be worth reading the original. In addition, Haskell is not such a great target for JITing, as it's statically typed, leaving less scope for optimisation at run-time.

[1] https://www.cs.virginia.edu/kim/courses/cs771/papers/bala00d...

chrisseaton · on July 24, 2015

> Haskell is not such a great target for JITing, as it's statically typed, leaving less scope for optimisation at run-time

I'd refute this. Types are just one thing that you can speculatively optimise for at runtime which may not be practical to do at compile time. Other things include value ranges, tighter types than there are in the source code, branches taken/not-taken, contended/non-contended shared resources such as MVars and TVars, whether an Integer fits into a word or not, etc etc.

In my group we're looking at using a JIT for C, where we can do things such as inline through a function pointer by speculating that it is stable.

mafribe · on July 24, 2015

I agree that there is scope for JITing in Haskell. But JITing is not without cost, and in statically typed languages, some of the big gains that make JITs for Javascript or Python so powerful, go away.

If you are using C in a way that requires frequent invocation through a function pointer in hot code, you are probably using an OO-idiom, so casting, so circumventing C's typing.

sitkack · on July 25, 2015

In Dynamo, it was tracing actual instructions, not the source.

mafribe · on July 25, 2015

Yes, Dynamo traces machine code. What language to trace is orthogonal to the idea of tracing. Any language can be traced.

sitkack · on July 25, 2015

My comment should have added something about the trace not having access to type information in the language. During luajit and v8 traces the VM has access to a higher level form than what Dynamo was dealing with.

You were specifically talking about tracing a static vs dynamic language. I agree, the opportunities for a speed up under dynamic languages is larger because there is more code bloat to dispatch on object type, where in Haskell that should have been elided during compilation, the Haskell code is much tighter to begin with. So the speedups seen for tracing Haskell should be similar to the gains seen by the Dynamo team when they traced native executables (most were generated from C afaik). Dynamo didn't have type information other than what it could extract from the assembly.

I'll try and make a more complete post next time.

mafribe · on July 25, 2015

I don't know the innards of the Haskell compiler well enough, but, as chrisseaton writes above, tracing Haskell could lead to speedups better than Dynamo. For example data for generic functions in hot code could be unboxed, maybe laziness could be 'switched off' locally etc. But the easy gains that you see for dynamically typed languages, as we agree, are unlikely to be achievable. Nevertheless SPJ has suggested to the PyPy/RPython people to write a meta-tracer for Haskell, just to be able to measure what's possible.

As to Dynamo, I think some of the key gains came from the ability to shortcut jumps to jumps to jumps to ... that could arise from linking separately compiled code. However, modern CPUs diminish the penalty such frivolous jumping incurs, so Dynamo became less and less competitive with tracing on higher-level languages. (BTW, Dynamo can trace everything, not just C executables.)

krig · on July 24, 2015

Thanks! As someone who hasn't worked on any JIT compiler, the Schilling paper was very easy to read and comprehend. I'll definitely check out the Dynamo paper too.

Haskell seems like an interesting language to implement, being both statically typed and lazy. Static typing provides some optimization avenues but having to evaluate thunks at runtime complicates many common optimization strategies.

pron · on July 24, 2015

I had a short conversation at ECOOP with Ben Titzer (who works on V8 at Google, and I think used to work on HotSpot at Sun), who said simply: tracing JITs are good for short programs; for large programs you need a partial-evaluation method JIT. Partial evaluation also produces better results for a larger class of programs.

Tracing JITs' advantages are that they're simpler to write than partial evaluation JITs, and produce decent code faster, which is why they're used for JavaScript.

There is great progress in partial evaluation JITs going on in the Graal/Truffle project[1] (Graal is the compiler; Truffle is a framework to write language frontends for Graal), which will be available as a pluggable compiler for HotSpot in Java 9.

Here are a couple of optimizations Graal yields for Ruby: https://twitter.com/ChrisGSeaton/status/619885182104043520

[1]: https://wiki.openjdk.java.net/display/Graal/Publications+and...

mafribe · on July 24, 2015

As far as I'm aware partial evaluation JITs like Graal/Truffle have atrociously bad warmup times. Warmup is the phase before the optimised code is ready to be executed. This is already a problem for tracing JITs, but is supposedly worse for partial evaluation JITs. All papers I have seen measure JIT performance post warmup.

pron · on July 24, 2015

Right, and this is why people say that the JVM is slow to start. It actually starts up pretty fast (a hello world program starts the JVM, reads the class file from a JAR archive, runs it and terminates in under 80ms), but it takes a while to achieve max performance.

This is hardly a problem for server apps, and a potentially serious issue for client code like JavaScript on a web page.

HotSpot has recently moved to a tiered-compilation scheme, where a fast compiler produces unoptimized code quickly, which continues to collect metrics for the profile-guided optimization, and the optimizing compiler kicks in later, once enough profiling information has been collected.

mafribe · on July 24, 2015

The JVM is not a tracing JIT compiler. Tracing is a very slow operation. Hence the warmup problem is substantially worse for tracing JITs.

I saw figures for warmup in tracing JITs that were so hillariously bad, that I thought they must be a typo.

People I know are currently writing a paper on this subject. I hope it will come out in the next 2 or 3 months. That should put our discussion, and the warmup performance of tracing JITs, on a firmer basis.

loganhough · on July 24, 2015

> The JVM is not a tracing JIT compiler

He didn't claim or imply it was.

> Tracing is a very slow operation.

Wait a second - you were just saying that PE was slow. Now you are saying that tracing is slow instead?

> Hence the warmup problem is substantially worse for tracing JITs.

This is contrary to everything I've seen in the research. Tracing JITs like LuaJIT and PyPY as fast to warm up compared to method JITs and PEs.

mafribe · on July 25, 2015

I said, or was trying to say that DPE and tracing have bad warmup.

    Tracing JITs like LuaJIT and PyPY as fast to warm up compared to method JITs and PEs.

I have heard the opposite about PyPy, which is meta-tracing.

I'm a bit uncomfortable with saying "I have heard ..." As far as I'm aware there is no actual published research on this question. But there will soon be, so I would like to withdraw from this debate until I have read that research.

pjc50 · on July 24, 2015

We have known for a long time that the worst performance comes from iterating a "fetch value from memory -> make decision based on result about what to fetch next" pattern. Unfortunately almost everything that can be described as business logic looks like that. It's also common in object-orientated systems.

geocar · on July 24, 2015

Only if you program with atoms. If you program with arrays then it is my experience that very few problems look that way.

Array programming also keeps absolute program size small which means your program isn't running in main memory but in cache memory which produces anywhere from 10-100x speedups for business software just by itself!

I don't know if I fully understand what you mean by object-orientated.

pjc50 · on July 24, 2015

I don't suppose you could elaborate on what you mean by atoms vs arrays here? I've not heard anyone say "atom" outside of a LISP context.

A fairly random example of OO "business logic": https://git.eclipse.org/c/b3/b3.git/tree/org.eclipse.b3.aggr...

(I went clicking through Eclipse because I knew I could count on it to have lots and lots of this kind of abstraction-heavy bland Java. I'm not even saying this particular bit of code is slow, it's just that there's a lot of OO code that looks vaguely like this.)

vvanders · on July 24, 2015

I believe what he's referring to is also called AoS(Arrays of Structures) vs SoA(Structure of Arrays).

If you're iterating over one or two values in a struct via AoS it's very painful from a memory caching standpoint since you're only getting sizeof(member) / sizeof(struct) efficiency.

In a SoA situation all your data for members is tightly packed so you usually get better cache coherency.

That said the best approach is actually looking at your data access patterns and packing your data appropriately. Unfortunately some languages don't have value types(most managed languages except C#). This is why C/C++ is usually faster than managed languages, they can get close on generated code but can be off by a factor of 50x in memory access performance.

pron · on July 24, 2015

> This is why C/C++ is usually faster than managed languages, they can get close on generated code but can be off by a factor of 50x in memory access performance.

This is a common misconception that is a result of Java's current state. Control over memory layout and garbage collection are two completely orthogonal issues. At the moment, Java just happens to be both GCed and to afford little control over memory layout. Currently, the HotSpot team (HotSpot is the name of OpenJDK's JVM) is working on project Valhalla and project Panama, two efforts scheduled for Java 10 and intended to give the JVM all the memory layout control you need.

vvanders · on July 24, 2015

Not completely.

Most GCs want to be able to compact the heap, that means they lean in favor of reference based models by default.

It's not exclusive like you mention but to my knowledge C# is the only language with explicit guarantees. I was aware of the effort around Java's value types, however how long till we can reasonably see this in production?

You can do this in Java today but it involves lots of nasty ByteBuffer manipulation(in which FlatBuffers is fantastic for). You'll still pay for the conversions from bytes to the actual types you want.

If you're dealing with these types of performance problems it's best to treat them with a language that's well suited to deal with them. There's nothing wrong with using Java for higher level business logic and delegating the heavy lifting to a stack that's designed to deal with them.

pron · on July 24, 2015

> Most GCs want to be able to compact the heap, that means they lean in favor of reference based models by default.

I don't see how this follows. Copying GCs do compact the heap, but when you use values you're basically saying "these two things go together". You're only making life easier for the GC (well, you're also creating larger objects, but that might just require a small adjustment of the GC strategy).

> how long till we can reasonably see this in production?

Four years probably... Still, that doesn't change the fact that layout and GCs are orthogonal.

vvanders · on July 24, 2015

Expanding on my point a bit, I'm not arguing that in theory GC and memory layout are orthogonal.

Just that in practice you don't see memory layout in languages that are managed. Since I'm usually not in a habit of building a new languages to solve problems it's something worth understanding when making a technical selection on a language.

pjmlp · on July 24, 2015

D, Modula-3 and the Oberon family are two examples of value types by default with a GC enabled heap.

Eiffel also offers value types (aka expanded), but it is reference by default.

pjmlp · on July 24, 2015

> This is why C/C++ is usually faster than managed languages, they can get close on generated code but can be off by a factor of 50x in memory access performance.

This only happens because the Pascal branch of languages with GC, sadly failed in the mainstream (for several reasons).

Modula-3, Eiffel, Oberon, Oberon-2, Component Pascal all offer the same data access patterns and packing available to C and C++.

geocar · on July 25, 2015

Many programming languages put boxes around the bits that the CPU actually operate on. These boxes often store things like reference count and type (or a dispatch table for object-oriented languages with single-dispatch). That thing inside a box that is the actual value is called an atom.

Those boxes have a lot of overhead (in PHP it's around 140 bytes!), and because a major identified problem is the waste of valuable cache space and CPU time book-keeping these boxes, a lot of JIT and compiler research is about how to eliminate the boxes.

You can see why they do this when considering a program like:

    $a = array();
    for($i = 0; $i < 100000; ++$i) $a[$i] = rand()*65536;
    sort($a);

which might take 200k (and fits easily in cache) in C/C++ but 15 megabytes (and doesn't) in PHP.

However.

Eliminating those boxes has proven to be very difficult, and while I have heard great things about Java 10, I am reserved because I also heard great things about Java 5. And Self. I do not expect JIT/compilers will get smart enough to actually solve this problem in my lifetime.

On the other hand, array programming languages allow you to get that space-overhead that C/C++ has in an interactive (and usually interpreted) language, so it so happens that you don't have to choose between a nice interpreted programming language with introspection, automatic memory management, and so on, and high performance, but it does mean you have to program the arrays instead of the atoms.

I do not understand how that source file is "OO business logic".

vvanders · on July 24, 2015

Take a look a Mike Acton's talks, he does a pretty good job of breaking down why OOP just isn't built for high performance(arranging data by logical association rather than data access patterns).

Yeah, your metapoint is is worthwhile, all the JIT in the world aint gonna help you if you're cache missing every damn fetch(hello Java).

corysama · on July 24, 2015

> all the JIT in the world aint gonna help you if you're cache missing every damn fetch(hello Java).

Acton (sort of) has a talk about that too. Except he speaks about traditional, ahead-of-time compilers. Tldr is that because of how the vast majority of programmers arrange their data, CPUs spends the vast majority of their time stalling on memory. Therefore improvements in optimizing compilers can only marginally help the marginally small section of running time spent actually exercising the ALUs. The solution is to devote more effort to changing our habits regarding data layouts.

vvanders · on July 24, 2015

Yup, shaving 5-10 cycles from compiler optimizations doesn't help you when you've got a deficit of 200+ cycles per cache miss.

Working with FPGAs and DRAM really makes a lot of reasons behind this and lot of other "obscure" performance issues like pipelining crystal clear.

wiml · on July 24, 2015

OOP encourages this kind of memory layout, but I don't think it requires it. I don't think there's anything preventing, say, a JVM or a CLOS implementation from storing "hot" members/fields in compact homogeneous arrays and "cold" ones in more traditional heap-allocated structures, perhaps with the help of a few programmer annotations. Or a tracing VM cooperating with a copying GC to perform this kind of optimization on the fly. This seems like it could make an interesting CS research project, in the (unlikely) case that it isn't one already.

corysama · on July 24, 2015

That is an interesting feature in Jon Blow's "Jai" language project. Despite being a C-like language, it sets you up to be able to move members between hot and cold storage easily without modifying existing code that references those members.

You can effectively define a struct as references to separate sub-structs in separate arrays. Member access syntax can be flattened so that it doesn't matter in the syntax which sub-struct a particular field is in. Realize that you aren't using a field as often as you expected? Move it's declaration from the hot sub-struct to the cold one and recompile. That will move all instances of that field from the hot array to the cold array, but no other code needs to be edited.

pjmlp · on July 24, 2015

When I did "Algorithms and Data Structures" back in mid-90's, our teacher did consider amount of memory and disk access on her evaluation of our solutions. Being lazy on getting optimal solutions meant worse evaluations.

How many do that to their students nowadays?

lorenzhs · on July 24, 2015

I'm grading a graduate level course in Algorithm Engineering, and my testing framework measures time, memory requirements (peak/total/number of allocations), cache behaviour (L1/L2/L3 misses, TLB misses) and other performance counters. It's very easy to do this using libPAPI and malloc instrumentation ;)

agumonkey · on July 24, 2015

For a reason I read this as a clojure transducer exercise.

bluetomcat · on July 24, 2015

He mentions that the demands of tracing JITs and current Intel CPUs are well aligned. I wonder, how well do they play with the instruction cache? Generating little chunks of code on-demand and then executing them seems to be a recipe for instruction cache misses?

talles · on July 24, 2015

First time I see someone blogging with GitHub issues and, surprisingly, looks like it works well.

Markdown, tags, comments plus a index (README.md) if you need one.

qznc · on July 24, 2015

Also "follow" and "kudos", named "watch" and "star". Getting a feed of the blog requires filtering, though.

haberman · on July 24, 2015

I think the article is making out trace compilers to be far more limited than they actually are.

> Consider these optimization tips: Avoid nested loops.

I don't think this is true at all. The article is arguing that because a trace only contains one loop, that nested loops can't be trace-compiled.

As I understand it, nested loops are totally possible to trace-compile by having multiple traces that link to each other. See what Mike Pall wrote about LuaJIT (http://www.freelists.org/post/luajit/How-does-LuaJITs-trace-...):

"If the inner loop has a low iteration count, it'll be unrolled and inlined. For most loop nestings there'll be a single trace for the outer loop that goes 'around' the inner loop and joins its exit with its entry. Otherwise traces link to each other in linear pieces."

I would take what this article says with a heavy grain of salt. I don't think that tracing JITs are at all like "programming in a straight jacket."

Look at the LuaJIT performance optimization guide. There is nothing in there about avoiding nested loops, avoiding tiny loops, or avoiding loops with unpredictable iteration counts. It does however recommend reducing unpredictable branches: http://wiki.luajit.org/Numerical-Computing-Performance-Guide

mrottenkolber · on July 24, 2015

> I think the article is making out trace compilers to be far more limited than they actually are.

Luke (the posts' author) and others are building a high performance software Ethernet switch[1] using LuaJIT. Let me assure you that these considerations actually come from experience rather than speculation.

> As I understand it, nested loops are totally possible to trace-compile by having multiple traces that link to each other.

Sometimes it works, sometimes it doesn't. We can't really expect everybody using Snabb Switch to understand the internals of LuaJIT so restrictive guidelines are useful for us.

Disclaimer: Consulting for Snabb GmbH

[1]: https://github.com/snabbco/snabbswitchhttps://github.com/sna...

losername · on July 24, 2015

> https://github.com/snabbco/snabbswitch

fixed

haberman · on July 24, 2015

> Let me assure you that these considerations actually come from experience rather than speculation.

Good to know. The article did not make this clear.

TheLoneWolfling · on July 25, 2015

Two semi-tangential ideas:

I wish it was possible to do things along the lines of "fetch this data from RAM and at the same time start recomputing it, whichever is faster".

And also, I wish there was a "branch upcoming / branch execute" split. In other words, instead of just "here's a branch", it's more along the lines of "you will have to branch on <x>. <other instructions> now branch on <x>." Effectively a variable number of branch-delay slot(s). Wouldn't help for straight pointer-chasing, but in other situations it could.