Zero-Overhead Metaprogramming

pavpanchekha · on June 18, 2015

I saw this talk at FCRC/PLDI. It's both very impressive and very well-presented. What follows is what I gathered from the talk; I haven't read the paper but it's on my reading list. The basic idea is that the old notion of a polymorphic inline cache can be nicely generalized to erase away metaprogramming like method_missing or descriptors.

PIC is a way to speed up method dispatch. Normally, if you call a method foo in your program, but you don't know the class of the object you're calling foo on, you need to go look up the appropriate code to run in a virtual table or similar data structure. This is slow, because you might do this lookup very often. PIC means caching this lookup in the code that calls foo—basically adding an instruction like

    if (obj.class == CACHED_CLASS) CACHED_CODE()

The reason this helps is that we can do the test quickly, and the branch predictor will predict it very well; then our JIT can inline CACHED_CODE, perhaps optimize it against the other code nearby, and now we have nearly-optimal machine code. After "warm-up", in other words, we've dynamically inferred the type of obj and have optimized our machine code to take that into account.

Now on to meta-programming. Suppose instead of calling a method on an object we are instead instantiating a class object which is dynamically determined. PIC isn't exactly applicable, because it's not like the type of the class object is in question (the class object's class is probably Class or something). Instead, this paper proposes caching the identity of the class object, and the caching the initialization code and optimizing around it. You could do the same for things that override method_missing and similar, to cache all the decisions those overrides are making and then allowing the JIT to inline the code.

Their evaluation suggests that this method works really well, almost eliminating all of the overhead from meta-programming. This isn't too surprising: we know that even dynamic features are usually used in very "static" ways, that JITs are really good at finding these ways, and that after warming up JITs can usually optimize away all that dynamism to get very fast code.

I am kicking myself for never having thought of this, despite having read several of the PIC papers. I am not a big believer in meta-programming, but this work is simple, extremely useful in practice, and also deepens our understanding of a technique both theoretically and practically.

ezy · on June 18, 2015

I understand the PIC technique, but I'm having a hard time understanding how this is different from what a trace might do. That is, inline the impl of method_missing and then optimize the most common path with guards included. Or is it that I'm being too optimistic about the tracer -- and with a generalized PIC one isn't relying upon the tracer to get it right, but forcing a cache on meta-object protocol boundaries?

smarr · on June 18, 2015

You are right, with a tracer, you get most of that for free. For a meta-object protocol, you'll need however still a little bit help to avoid creating too many guards. And, the PICs or `dispatch chains' are very useful in the interpreter, which avoids enormous warmup penalties.

smarr · on June 18, 2015

That's a good summary, thanks. Glad that all that came across :)

nickpsecurity · on June 18, 2015

I appreciate your comment. Yeah, you can often tell how brilliant a solution is when you kick yourself for how obvious it sounds afterwards. I hope to see this stuff get widespread deployment.

hoverbear · on June 18, 2015

I'm very curious how your workflow is for putting together this paper! I put together https://github.com/Hoverbear/acm-pandoc-paper for my own purposes but there is some work to be done.

anonymousDan · on June 18, 2015

Meta question - anyone know if this was autogenerated from latex, and if so how?

smarr · on June 18, 2015

Not to 100%. I used a combination for tex4ht, tex4ht configuration, several hacky post processing scripts, and manual editing. Converting a Latex document to a nice HTML site is easily the worst part of the whole writing process :-/

anonymousDan · on June 18, 2015

Looks great. Out of interest, what were the main things that tex4ht doesn't give you out of the box?

smarr · on June 19, 2015

Clean HTML (no superfluous tags), proper links to references (the rendering splits the links between author and year), ligatures are broken in browsers, standard footnotes are insensible for HTML, the standard HTML used for headers, figures, and others wasn't to my liking.

jacobolus · on June 19, 2015

One recommendation: use ragged-right rather than justified text in the HTML version. Because browsers use a line-by-line text composer instead of full-paragraph composition, and don’t do any hyphenation, fully justified text looks terrible on web pages, especially in relatively narrow columns.

In the case of this specific document, there’s all kinds of nasty gappy whitespace in the webpage version.

It’s made worse by the spaces in “et al. 19XX” getting turned into non-breaking spaces. You really want to fix that if you can. For example, look at the 3rd and 9th line in this screenshot. http://i.imgur.com/H6ROnyQ.png

But even disregarding that problem, many lines end up looking pretty bad.

[Note: don’t read this as personal criticism. The fault lies about 90% with the browser vendors who care more about saving a few microseconds of compute time than making legible documents.]

wakaflockafliz · on June 19, 2015

I didn't notice this ever before, then I read your comment. Now I see it all over the place...oh man.

anewhnaccount · on June 19, 2015

https://github.com/bramstein/typeset

snops · on June 18, 2015

It probably was, difficult to tell as it is extremely clean HTML, but the referencing seems to hint at bibtex (a referencing system for latex), with class names like "bibitem" and "biblabel".

>and if so how How is easy, they are many different programs, here are a few: https://enc.com.au/docs/latexhtml/. Which in particular is a bit harder, there is no <meta name="generator" content="... that some Latex->HTML converter leaves in, but I suspect you could turn that feature off. I think this page uses custom CSS too, as its using Google Fonts.

atilaneves · on June 19, 2015

What about compile-time metaprogramming such as found in D and C++. That's zero-overhead (discounting possible template bloat) and requires no special techniques at all. Well, besides knowing how to write compile-time code, that is.

pron · on June 19, 2015

It is my understanding that Graal -- one of the two JITs used in the paper -- will be available for use with "stock OpenJDK HotSpot" as a plugin (as opposed to a special HotSpot build), starting in Java 9. At least, that's the plan[1]. Is that correct?

[1]: http://openjdk.java.net/jeps/243

cec · on June 18, 2015

Off topic, but does anyone know how the author generated this web version of the paper? It looks like as though they've automatically generated it from the LaTeX source of the PDF. It looks brilliant, and much easier to read in the browser.

malkia · on June 18, 2015

Haven't read it fully, but the first thing that popped in my head - wouldn't Apple benefit from this? There was one very detailed article on how their object method lookup changed over the years, but can't find it right now (Objective C).

chrisseaton · on June 19, 2015

I'm not an expert in Objective C but I believe that all dispatch is done by a library call that accepts the receiver, the message name, and the arguments. This means that there isn't any place to store per-call-site data, such as the kind of caches used in this paper.

pron · on June 19, 2015

Why does the partial evaluation approach result in larger performance variance than meta-tracing in the reported results?

smarr · on June 19, 2015

Yes, that's indeed a none-obvious issue, and looks rather strange on the graph.

That's not the 'partial evaluation' per se. Instead, it is the difference between RPython and HotSpot that surfaces here. RPython does currently generated singled threaded VMs, and neither the GC nor the compilation are done in separate threads. The HotSpot JVM however can do those things in parallel and additionally has other infrastructure threads running. In the end, this increases the likelihood that the OS reschedules the application thread, which becomes visible as jitter.

pron · on June 19, 2015

Oh, makes sense :) Is that mentioned in the paper and I missed it? If not, it might deserve a footnote, as the difference is glaring.

BTW, would Graal really become available as a stock-HotSpot plugin in Java 9 thanks to JEP 243? I see things are ready on Graal's end[1], but are they on HotSpot's?

[1]: https://bugs.openjdk.java.net/browse/GRAAL-49