Hacker News new | past | comments | ask | show | jobs | submit login
Writing memory safe JIT compilers (medium.com/graalvm)
163 points by vips7L 3 months ago | hide | past | favorite | 48 comments



I am a big fan of automatically generating optimizing VMs from interpreters, like Graal (in this article) and also PyPy and others do. But the other approach, of writing a custom JIT for each language, seems much more flexible even if it is more dangerous and time-consuming.

On small benchmarks I can believe the performance can be very similar between them, but I'd be more interested in large real-world codebases. I'm not aware of any myself, and can't seem to find any. Does anyone know?


> and also PyPy

this is a perpetually repeated misconception:

> Why did we Abandon Partial Evaluation?

https://www.pypy.org/posts/2018/09/the-first-15-years-of-pyp...

> others do

to my knowledge graal is the only production project using futamura projections (what you're talking about)


PyPy has experimented with multiple approaches here, yes, but as far as I know they did not abandon the overall approach of automatically converting an interpreter to an optimizing VM, which is what I mentioned. That's more general than partial evaluation (which is just one way to do such a conversion).


> automatically converting an interpreter to an optimizing VM

but that is not what PyPy does

> So, how did that tracing JIT generator work? A tracing JIT generates code by observing and logging the execution of the running program. This yields a straight-line trace of operations, which are then optimized and compiled into machine code. Of course most tracing systems mostly focus on tracing loops.

> As we discovered, it's actually quite simple to apply a tracing JIT to a generic interpreter, by not tracing the execution of the user program directly, but by instead tracing the execution of the interpreter while it is running the user program (here's the paper we wrote about this approach).

So it's just a tracing JIT but applied to the interpreter. One way to put it - it's effectively the same benefit as just running jython to begin with.


The fact that they had to make up a new term for the technique (a "meta-tracing JIT") should be a hint that what they're doing is somewhat novel/unusual.

If you read the paper[1] linked in your quote, you'd see that it is not "just a tracing JIT"; the interpreter being run under the JIT has some special hooks that let it tell the JIT e.g. where the program counter is:

> Since the tracing JIT cannot know which parts of the language interpreter are the program counter, the author of the language interpreter needs to mark the relevant variables of the language interpreter with the help of a hint. The tracing interpreter will then effectively add the values of these variables to the position key. This means that the loop will only be considered to be closed if these variables that are making up the program counter at the language interpreter level are the same a second time. Loops found in this way are, by definition, user loops.

This is vastly distinct from how Jython works.

[1]: https://foss.heptapod.net/pypy/extradoc/-/blob/branch/extrad...


> The fact that they had to make up a new term for the technique (a "meta-tracing JIT") should be a hint that what they're doing is somewhat novel/unusual.

you say this and then quote directly what the novelty is and so i ask you - does that piece warrant a whole new term?

> This is vastly distinct from how Jython works.

jython isn't doing anything - it's a python interpreter that runs on the jvm. my point was that that's the same thing: an interpreter for a language that itself is being jitted.


> does that piece warrant a whole new term?

In my opinion: Yes, definitely! Without meta-tracing techniques, a JIT'd interpreter can only hope to be on par with a compiled interpreter, never significantly faster. It will never go beyond the limitations of an interpreter, because the JIT can't "see" the user code that the interpreter is running.


I'm not familiar enough with Jython, but I do consider applying a tracing JIT to an interpreter as a way to automatically convert an interpreter to an optimizing VM.

I suppose if the tracing JIT were very complex and very tailored to a single language that might not make sense, but my impression of PyPy is that the opposite is true, and PyPy can in fact run other languages than Python.


> but I do consider applying a tracing JIT to an interpreter as a way to automatically convert an interpreter to an optimizing VM.

there's nothing automatic about it. the user above you quoted from their paper - the user (person writing the interpreter) must annotate their code to pass hints to the tracing jit.


It is still largely automatic. Some amount of special annotations are needed in all these related approaches. For example, in this article, @CompilationFinal is used in Graal interpreters.

(It is possible more annotations are needed in PyPy than Graal, but still, an annotated interpreter is far, far simpler than writing an optimizing JIT!)


Futamura projection refresher:

https://en.wikipedia.org/wiki/Partial_evaluation

Badass future space tech.


Does it involve robots that need to drink beer to function?


Definitely.. :)


> But the other approach, of writing a custom JIT for each language, seems much more flexible even if it is more dangerous and time-consuming.

This is also a good illustration of how we ended up with the NULL problem. I don't think it's as big of a deal in this case as interpreters/vms/compilers are designed to be fungible in ways that source code was not, but it's something worth thinking on.


The first two times, my brain wanted to read Futamura as Futurama. Silly me.


I may have made that verbal slip while giving a talk on Truffle.


I work on Truffle for more than 10 years and I recently wrote a comment on hackernews using Futurama instead of Futamura. That comment had it wrong twice.


That's the way I have been saying it in my head this whole time! Think you have enough weight with the team to get them to officially change their terminology?


Feel free to drop Dr. Futamura an email: https://fi.ftmr.info/

If he says yes to changing his name you have my full support.


I am so glad I'm not the only one whose brain pulls such pranks. Thank you!


Let’s face it, there are worse typos / verbal slips we could have made.


Futamura projections are super cool, reminds me of another post recently https://news.ycombinator.com/item?id=40406194


This stuff is way over my head but if I understand correctly, PyPy is the most famous implementation of all three Futamura projections to JIT a subset of Python: https://gist.github.com/tomykaira/3159910


I don't think they ever went beyond the first projection (has anyone?) and they don't seem to use Futamura projection anymore.



Of course, this just moves the safety question from a javascript-specific JIT to the Truffle JIT compiler and the partial evaluator. This can have some benefits (only one JIT to improve/fix across many languages), but can still have safety bugs.

And the big tradeoff is that the general JIT may be less capable of doing language-specific optimizations (indeed such optimizations have a chance to introduce bugs as the linked V8 blog shows, but they also can be correct and significantly improve perf in cases where the general JIT doesn't have the necessary info to do it itself).


That's true. However, the underlying JIT compiler only has to compile Java bytecode correctly. Java is a relatively simple and regular language for a compiler to digest. It also helps that the Graal/Truffle compilers are themselves written in Java. It's memory safe all the way down*, leaving the only safety problems being the logical correctness of the optimizations for Java. Which, sure, can still be incorrect, but as you observe, that's a much smaller surface area to defend and you only have to get it right once.

Also the Graal team do some pretty advanced stuff to find security problems in the optimizations, in particular:

1. Lots of fuzzing.

2. Comparisons between Graal's output and the output of C2, which is a totally independent codebase. So you've got two different compilers and if they compute very different machine code, and it's not known to be an expected difference, that is used as a trigger to investigate things.

There are also small amounts of unsafe code in Truffle where checks are bypassed for speed, because it can be proven to be safe.

So overall it's a big win even if it can't eliminate 100% of all safety problems. This is similar to how Rust works, where unsafe stuff is done but in clearly defined sections that are easy to locate and audit, and the bulk of the unsafe code that most apps need is in the standard library.

BTW you can implement language specific optimizations in Truffle. It couldn't be competitive with V8 if that wasn't possible. For example dynamically typed scripting languages often need an optimization called object shapes. That's a part of the Truffle framework so all scripting langs can benefit from it. It's irrelevant for Java-like langs though.

Disclosure: I wrote the article.

*exception, the GC


Is GraalVM actually competitive with V8?


It depends. In my experience generally GraalVM is slower to start, but after a few iterations it can be as fast or faster, and that the same results can be amplified if you use the JVM instead of AOT (i.e. it's even slower to start, but eventually it can be much faster.)

Of course this all depends on your specific code and uses cases.


Truffle/Graal is also able to do some insanely cool things related to optimizing across FFI boundries: if you have a Java program that uses the Truffle javascript engine for scripting, Truffle is able to do JIT optimization transparently across the FFI boundry so it has 0 overhead. IIRC they even have some special API to allow a Truffle->native C library to be exposed to the runtime in a way that allows for it to optimize away a lot of FFI overhead or inline the native function into the trace. They were advertising this "Polyglot VM" functionality a lot a few years ago, although now their marketing mostly focuses on the NativeImage part (which helps a lot with the slow startup you mention).

TruffleRuby even had the extremely big brain idea of running a Truffle interpreter for C for native extensions instead of actually compiling them to native code, just so that Truffle can optimize transparently across FFI boundaries. https://chrisseaton.com/truffleruby/cext/


I don't have anything to contribute to the Truffle discussion, but for those not familiar: Chris Seaton was an active participant on Hacker News, until his tragic death in late 2022. Wish he was still with us.

https://news.ycombinator.com/threads?id=chrisseaton

https://news.ycombinator.com/item?id=33893120


Yes, it's extremely sad :( He was a giant in the Ruby and Truffle communities, and TruffleRuby was a monumental work for both projects.


> TruffleRuby even had the extremely big brain idea of running a Truffle interpreter for C for native extensions […]

TruffleC was a research project and the first attempt of running C code on Truffle that I'm aware of. It directly interpreted C source code and while that works for small self-contained programs, you quickly run into a lot of problems as soon as you want to run larger real world programs. You need everything including the C library available as pure C code and you have to deal with the fact that a lot of C code uses some UB/IB. In addition, your C parser has to fully adhere to the C standard and once you want to support C++ too because a lot of code is written in C++, you have to re-start from scratch. I don't know if TruffleC was ever released as open source.

The next / current attempt is Sulong which uses LLVM to compile C/C++/Rust/… to LLVM IR ("bitcode") and then directly interprets that bitcode. It's a lot better, because you don't have to write your own complete C/C++/… parser/compiler, but bitcode still has various limitations. Essentially as soon as the program uses handwritten assembler code somewhere, or if it does some low level things like setjmp/longjmp, things get hairy pretty quickly. Bitcode itself is also platform dependent (think of constants/macros/… that get expanded during compilation), you still need all code / libraries in bitcode, every language uses a just so slightly different set of IR nodes and requires a different runtime library so you have to explicitly support them, and even then you can't make it fully memory safe because typical programs will just break. In addition, the optimization level you choose when compiling the source program can result in very different bitcode with very different IR nodes, some of which were not supported for a long time (e.g., everything related to vectorization). Sulong can load libraries and expose them via the Truffle FFI, and it can be used for C extensions in GraalPython and TruffleRuby AFAIK. It's open source [1] and part of GraalVM, so you can play around with it.

Another research project was then to directly interpret AMD64 machine code and emulate a Linux userspace environment, because that would solve all the problems with inline assembly and language compatibility. Although that works, it has an entirely different set of problems: Graal/Truffle is simply not made for this type of code and as a result the performance is significantly worse than Sulong. You also end up re-implementing the Linux syscall interface in your interpreter, you have to deal with all the low level memory features that are available on Linux like mmap/mprotect/... and they have to behave exactly as on a real Linux system, and you can't easily export subroutines via Truffle FFI in a way that they also work with foreign language objects. It does work with various guest languages like C/C++/Rust/Go/… without modifying the interpreter, as long as the program is available as native Linux/AMD64 executable and doesn't use any of the unimplemented features. This project is also available as open source [2], but its focus somewhat shifted to using the interpreter for execution trace based program analysis.

Things that aren't supported by any of these projects AFAIK are full support for multithreading and multiprocessing, full support for IPC, and so on. Sulong partially solves it by calling into the native C library loaded in the VM for subroutines that aren't available as bitcode and aborting on certain unsupported calls like fork/clone, but then you obviously lose the advantage of having everything in the interpreter.

The conclusion is, whatever you try to interpret C/C++/… code, get ready for a world of pain and incompatibilities if you intend to run real world programs.

[1] https://github.com/oracle/graal/tree/master/sulong

[2] https://github.com/pekd/tracer/tree/master/vmx86


> generally GraalVM is slower to start, but after a few iterations it can be as fast or faster

That's even true of the standard HotSpot-backed JVM. I've rewritten programs from Java to JS which has resulted in making them faster—because they're short-lived, and the JVM's slow startup chewing through the budget is never offset by any of the theoretical speedups that the JVM otherwise promises.


> In my experience generally GraalVM is slower to start, but after a few iterations it can be as fast or faster

Probably having to do with the JVM being optimized for long-running server processes.

>the same results can be amplified if you use the JVM instead of AOT (i.e. it's even slower to start, but eventually it can be much faster.)

OpenJ9 has a caching JIT server that (theoretically) would work around this.


I’m assuming that’s a benchmark that runs a lot of hot code? I wonder why the start up is slower, if it’s based off an interpreter, and all current VMs start that way too


It's because:

• GraalJS is today based on an AST interpreter, which are less efficient in general than bytecode interpreters. V8 starts by compiling JavaScript to an internal bytecode, interpreting that, then JIT compiling the hot spots. It's the same architecture as the JVM except that the bytecode isn't considered to be a stable or documented format.

• The Truffle JIT compiler is slower than V8's because partial evaluation adds overhead. Slower compiler = more time spent in the interpreter = slower warmup.

• V8 is heavily optimized for great startup time because web pages typically don't live very long.

The first problem is being tackled by adding a bytecode interpreter infrastructure to Truffle itself. In other words, the Truffle library will invent a bytecode format for your language, write the bytecode interpreter for it, then partially evaluate that to create the JIT compiler! It can also handle stuff like persisting the bytecode to disk, like .pyc files or .class files do. Moving all the stuff needed to implement fast languages into the framework is very much the Truffle way.

The second problem is harder to solve. There is supposedly a thing called (I think) the second Futamura projection, where you partially evaluate the partial evaluator, but IIRC that's very hard to actually implement and for server-side use cases, less important.


GraalVM’s interpreter, AST, etc is also written in Java, so there is a warm-up for these as well. Also, the memory representation has to be somewhat generic, and can’t be over-optimized for Javascript specifically. Add to it that JS has multiple tiers of interpreters, JIT compilers (I believe there are 3 at the moment?), all made specifically for JS, so it’s not really possible to compete with that here, while remaining general.


I haven’t tried the JS stuff in GraalVM, but I have tried it for Java. It’s often faster than the regular JVM, especially good at escape analysis.


You mean with the Graal JIT instead of C2? AFAIK Graals truffle (espresso) implementation of Java is still far behind HotSpot.


Yeah. The Graal JIT. I haven’t tried Espresso, but now I got curious…

For anyone else who got curious: https://www.graalvm.org/latest/reference-manual/java-on-truf...


Their perf claim is based on ancient benchmarks. Looks like the Octane suite. I bet they also made sure to ignore startup overheads.

I’d only believe their perf claims if they used a more modern benchmark suite.


It shows that it's /competitive/ despite having just written an interpreter (which is seriously impressive). Further optimizations are possible, as evidenced by newer benchmarks and further improvements to Chrome. But that required a lot more time and effort and funding by Google.

Oracle's profit motive is (AFAIK) enabling polyglot scripting inside their database. But Google has a profit motive of not having to pay Firefox or Safari a bigger chunk of the search engine ad profits.

So yes, I agree that the outdated benchmarks are a bit fishy. But if there was enough of a market I have no doubt that the Truffle/Graal devs would know how to put more funding to good use. There are alternative JVM runtimes that optimize for latency or start-up times. The Oracle JVM is optimized for their core market (long running server processes).


Memory safety does come with a price. It was ever thus.

It's competitive in some respects. It's not used in browsers so it's probably not competitive there where there are different code patterns to server-side JS (unless you want a security upgrade more than top performance), due indeed to the slower warmup time which really hurts in the web environment where code gets discarded regularly.

Also, V8 probably has better memory usage. I haven't checked.

The warmup time is being worked on but these are research problems, and the V8 team is much larger than the GraalJS team is.


> Memory safety does come with a price. It was ever thus.

The memory safety claim is about as fishy as the benchmarks.

I'd trust the JIT hardening of a modern JS engine over whatever hardening has happened in the OpenJDK any day of the week. In other words: if the JS engine was built on top of Truffle, then the exploit technique would be all about finding bugs in Truffle/Graal/OpenJDK. And I bet that's easier than finding bugs in V8 or any other JS engine, just because the JS engines have been fighting the security fire with extreme prejudice due to their security exposure and I doubt that the OpenJDK crew has had to since it's not as worth it to attack them. And if it was as worth it to attack them, then the complexity of the OpenJDK would make it an easier target to attack and a harder target to meaningfully lock down.


Related: JITs use different strategies to assemble and execute code:

- (Under W^X) User-land W->X transmutation support from the, i.e., pkey_mprotect(addr, len, PROT_EXEC | PROT_READ, pkey) against an * aligned_alloc()* region on Linux

- (Slower) Produce assembly and compile, or write an executable or library, and dlopen() it in an existing process or run it in a separate process and address space

- (Faster, least secure, not recommended) Write and execute against RWX pages directly


Is there a difference in the amount of memory used for both methods?


No surprise this is coming from the GraalVM folks. Java (as Oak) was originally created to solve -- not WORA, but the memory safety problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: