The Road to the JIT

joisig · on Dec 1, 2020

I listened to a very interesting podcast episode [1] on the Thinking Elixir Podcast on this subject recently, an interview with John Högberg and Lukas Larsson. Not covered in this blog post but discussed in some detail in the podcast is that because you get standard debugging symbols from the JIT, you will now be able to use gdb and prof and similar tools when working with the BEAM.

[1] https://podcasts.google.com/?feed=aHR0cHM6Ly90aGlua2luZ2VsaX...

liveoneggs · on Dec 2, 2020

     JAM (Joe’s Abstract Machine)

     In 1989 JAM (Joe’s Abstract Machine) was first implemented. Mike Williams wrote the runtime system in C, Joe Armstrong wrote the compiler, and Robert Virding wrote the libraries.

"Joe Armstrong" is a link to github.com/joearms -> " joearms has no activity yet for this period. "

Now I'm sad :(

yawaramin · on Dec 2, 2020

Look at it this way, he's in the Arctic Code Vault. His name and work will survive into the next millennium.

yawaramin · on Dec 2, 2020

A super well-written and easy-to-read post. It takes real skill and craft to communicate this well.

namelosw · on Dec 2, 2020

I'm actually very interested in the Prolog prototype - especially on how it's done and how do they iterate the design. Does anyone have any information about it?

icandoit · on Dec 2, 2020

I found this link to Armstrong 2003 thesis: https://web.archive.org/web/20041204143417/http://www.sics.s...

found that link here:

https://thenewstack.io/why-erlang-joe-armstrongs-legacy-of-f...

It's not working for me right now, but you may have better luck.

namelosw · on Dec 3, 2020

Thank you so much! Although I skimmed through it didn't say too much about the Prolog interpreter, it's very educative.

ngrilly · on Dec 1, 2020

Erlang (the language, the runtime and the OTP framework) is a fascinating alternative in the design space for programming languages. Always happy to read about it, even if I don't use it for my projects.

the_only_law · on Dec 2, 2020

Yes Erlang is one of the few languages I instantly liked upon reading about it and playing with it, even outside its typically parroted features it has a ton of stuff I wish was available in more languages (working with bit oriented protocols is so much more pleasant in Erlang than most languages thanks to bitstrings). However, as other have mentioned it’s not exactly a go to for general purpose stuff and even when it seems a perfect fit, I often have to look elsewhere for a variety of reasons.

brightball · on Dec 1, 2020

Agreed. After reading enough about it I’ve become convinced that there are only 2 different platforms in the world. OTP/BEAM and everything else.

Every other language that I’ve studied seems to just cope with the same problems in slightly different ways.

macintux · on Dec 1, 2020

Erlang/OTP/BEAM are so unusual within the computing world: highly opinionated components that, thanks to their constraints, are keenly complementary and optimized for each other.

No kitchen sink language here: play by their rules or choose a different environment.

idolaspecus · on Dec 2, 2020

Contrarily, this is exactly what makes Unix environments so powerful (and wonderful). The "Unix Philosophy" is typically considered unopinionated (in the "let the user choose" sense), but I'd say that it's more like it has a few very strong yet well-chosen opinions that produce a highly integrated universe where all sorts of different tools just work.

tiffanyh · on Dec 1, 2020

> "Dialyzer was only about 10 percent slower with the JIT than with HiPE"

This is the first time I've seen AsmJIT compared to HiPE. Interesting.

brokencode · on Dec 1, 2020

Just want to point out that later in the paragraph they said AsmJIT now roughly matches the performance of HiPE for Dialyzer.

kasperni · on Dec 1, 2020

Seeing how many billions have been posted into JIT runtimes such as CLR/JVM/V8. I'm questioning whether or not it is realistic to create a production ready JIT as a side project?

toast0 · on Dec 1, 2020

It's important to note that the goals of this BEAM JIT are a lot more modest than those other JITs.

There's no goal of heroic optimization; the optimization goal is really just to remove the overhead of interpretation.

This removes the need to apply the JIT only to some code, because it's fairly simple, it's fast enough to apply when the code is loaded, and so all code is JITed to native as it's loaded. Or you're on an unsupported platform and all code is interpretted.

Because it's all or nothing, testing the OTP release should uncover any JIT bugs; you won't have the hard to track bugs sometimes seen in other systems where a function's correctness depends on whether or not it was JITed and that depends on runtime state. That won't mean no JIT bugs, of course, but they should be easier to track down.

whizzter · on Dec 1, 2020

It's really a situation of the 90/10 rule, you'll get the first 90% of the benefits from 10% of the work and then it's many many incremental changes that gets progressively more expensive.

Also in the case of Erlang the language model will reward these first steps even more than most subsequent optimizations because much Erlang code in general don't have much tight loops of the kind that are prominent in benchmarks were a good register allocator would provide huge wins.

More on the 90/10 rule we need to remember that these expensive JIT's are very complicated with optimization levels and interpreters combined with tons of GC options whereas here they explicitly just dumped JIT-interpreter cross-calling to simplify the design as well as a more straightforward internal memory model with less complicated edge cases.

fiddlerwoaroof · on Dec 1, 2020

It sounds to me that “JIT” here just means “not a batch compiler”? This seems to me to be more like the way Common Lisp compilers work than what I think of a JITs

toast0 · on Dec 1, 2020

I'm not super familiar with LISPs, but what the JIT is doing here is transforming bytecode into native code, which is what other JITs do.

What it's not doing, but is commonly done in other JITs, is any sort of runtime profiling and chosing of which modules to transforming; all modules are transformed when loaded.

As described in the article, previous JIT attempts with BEAM did have that functionality, and they did meet the project goals; profiling cost too much, the compilation step was too expensive, and mixing modes between interpreted modules and native modules added too much complexity.

I haven't looked at the code behind this, but from articles, I haven't seen anything that would preclude running the bytecode to native code transformation ahead of time (or caching the just-in-time transformation for future use), but it's not part of the implementation as of now.

fiddlerwoaroof · on Dec 2, 2020

Reading through the article, it sounds similar to what most lisp compilers do: read source files into a data-structure, transform that structure in various ways to some kind of IR and then generate machine code from that IR. They generally do this the first time a file is loaded and generate a “fast load” file to speed up later attempts to load the file.

This sounds pretty similar, except there’s no fast load file.

fiddlerwoaroof · on Dec 1, 2020

I guess a difference is that Lisp implementations generally cache the generated code between runs.

hinkley · on Dec 1, 2020

The BEAM is mostly concerned with correctness and horizontal scalability. I think a little vertical scalability from a JIT raises the throughout of the system without really changing how it works or what you use it for. If anything, maybe you write less stuff in native code.

A great deal of research has been published in the last 25 years, and some of it invalidates earlier wisdom due to changes in processor design. Just following this trail and applying the 80/20 rule could get a lot done for a little effort. And a simple JIT has half a prayer of being correct.

bjoli · on Dec 1, 2020

A tracing JIT is no small feat. You could go simpler and have a template JIT, which has a lot less complexity. Guile went with that, and the speedup from the already quite fast 2.2 to 3.0 was significant.

Andy has been working at Igalia with the JS engines of chrome and Firefox, which makes me believe it might not be easily reached by mere mortals, but looking at the source it is quite easy to follow, even though I would not trust myself to make any significant changes.

artemonster · on Dec 1, 2020

LuaJIT?

MaxBarraclough · on Dec 1, 2020

You beat me to it.

If you're aiming to compete with HotSpot and .Net then you'll need to invest millions, but not all JITs are this ambitious. GNU Lightning is another example of a JIT with few people behind it.

brazzy · on Dec 1, 2020

What's "production ready"?

Works, is stable and faster than an interpreter? Sure, that is achievable for a motivated and skilled developer working on their own. At least for a reasonably simple language, maybe not for a beast like C++.

Competitive with one of those bigcorp-funded ones? Nope.

azhenley · on Dec 1, 2020

Putting multiple Erlang functions into a single C function to avoid using C’s stack sounds awesome but also horrifying.

shiny · on Dec 1, 2020

To anyone just reading comments, this is referring to the old BEAM/C compiler, not the new JIT.

chrisseaton · on Dec 1, 2020

> Putting multiple Erlang functions into a single C function to avoid using C’s stack sounds awesome but also horrifying.

Why's that horrifying? Isn't inlining a super-basic and well-understood optimisation that any serious compiler would be doing?

azhenley · on Dec 1, 2020

No, this isn't inlining. They're managing their own call stack and relying on undefined behavior. See the quote from the article:

> BEAM/C generated a single C function for each Erlang module. Local calls within the module were made by explicitly pushing the return address to the Erlang stack followed by a goto to the label of the called function. (Strictly speaking, the calling function stores the return address to BEAM register and the called function pushes that register to the stack.)

> Calls to other modules were done similarly by using the GCC extension that makes it possible to take the address of a label and later jumping to it. Thus an external call was made by pushing the return address to the stack followed by a goto to the address of a label in another C function.

> Isn’t that undefined behavior?

> Yes, it is undefined behavior even in GCC.

dnautics · on Dec 2, 2020

over in zigland in private we're saying that one of the endgoals of zig should be to reimplement the BEAM. One of the most active users of zig is doing some really neat things with task-stealing, allocation-free, lock-free schedulers that are a lot like the BEAM's techniques, except super performant: (watch this for till the benchmarks demos ~ finishing 1h:17min https://www.youtube.com/watch?v=UaF6-5BmX2I&t=1h9m)

jpcooper · on Dec 1, 2020

Why is it undefined behaviour?

IainIreland · on Dec 1, 2020

You aren't allowed to goto a label in a different function.

From the C standard:

> The identifier in a goto statement shall name a label located somewhere in the enclosing function.

`Shall` is a term of art here:

> If a "shall" or "shall not" requirement that appears outside of a constraint is violated, the behavior is undefined.

(Sections 6.8.6.1.1 and 4.2, respectively: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)

jhoechtl · on Dec 2, 2020

It's undefined in the C standard. That doesn't mean its insecure or undefined as seen from the machine code. nothin is hilding you back to do your own code generation and clearly define this behaviour.

MaxBarraclough · on Dec 2, 2020

They weren't doing their own code-generation, BEAM/C treated C as the target language, and then relied on a standard C compiler. From the article:

> it is undefined behavior even in GCC. It happened to work with GCC on Sparc, but not on GCC for X86.

jpcooper · on Dec 1, 2020

Appreciated.

chrisseaton · on Dec 1, 2020

Ah sorry I see.

jolux · on Dec 1, 2020

Is there currently a plan to move towards making the JIT the default implementation?

di4na · on Dec 1, 2020

Yes, it will be the default on all supported platform as of OTP 24

didibus · on Dec 1, 2020

Oh, so is Erlang Beam VM code currently only running interpreted?

dnautics · on Dec 1, 2020

it's interpreted bytecode. The bytecode is a highly optimized constrained language that looks a whole lot like machine language. It would not be unreasonable to build hardware that runs it, if you wanted to. I guess it would not be unreasonable to say, it's interpreted in the same way that running arm-compiled code on qemu on an x86 is interpreted.

dasyatidprime · on Dec 1, 2020

QEMU in fact does dynamic translation, which is more like how JIT works in bytecode languages; this is why it's much faster than some earlier machine emulator software. (I think e.g. Bochs didn't have dynamic translation back when QEMU was coming into play, but the Web suggests that there's at least discussion around that feature so maybe that's not true anymore.)

The BEAM is currently a bytecode interpreter. The article describes it as using threaded code, which if you look at the C source is true in a sense on platforms where the C compiler provides the GNU-style extension of being able to take the address of labels for dynamic `goto` (https://stenmans.org/happi_blog/?p=194 agrees with my skim of `beam_emu.c`)—though if that's the only use as it seems, I might find that an edge case of the term, since I associate “threaded code” with the style used in Forth where “user-level” subroutines can be pointed to directly. If computed `goto` isn't available, then it uses a conventional switch/case loop, repeatedly dispatching the result of fetching the next bytecode instruction pointer value.

dnautics · on Dec 1, 2020

Ah sorry, indeed. Bochs is a better analogy.

Though that brings up a question... I wonder if you could write a qemu backend to the BEAM, and start working towards dynamic translation that way?

dasyatidprime · on Dec 2, 2020

I haven't looked at QEMU's TCG much, but I doubt it would be a good choice for bytecode formats even if you could bash it into doing something useful as part of a backend. I'd expect it to be optimized for hardware-CPU-alike machine instructions, whereas BEAM is Bogdan's Erlang Abstract Machine. See https://blog.erlang.org/a-brief-BEAM-primer/ which points to https://github.com/erlang/otp/blob/master/lib/compiler/src/g.... If you look into how some of those are implemented, just as one example, the integer arithmetic operations handle automatic promotion to bignums—not something you'd find on a conventional CPU, but not uncommon in HLL bytecode. So you'd still have most of the implementation to do—and if you punt and hardwire all of those to runtime library calls, you're almost back in threaded-code land anyway; you really want to take specialization opportunities etc. into account…

I also don't know of any other HLL JITs that use TCG, which seems like weak evidence that it wouldn't be a win.

didibus · on Dec 1, 2020

Ok I see. Follow up question, isn't that the same as Python then? Or at least compiled to .pyc Python code?

And, would implementing the Beam bytecode using Graal polyglot layer be a good idea then? Allowing it to leverage JVM JIT ?

ramchip · on Dec 3, 2020

CPython has no compilation step. It reads source code and generates bytecode at runtime; this bytecode is cached inside `pyc` files. The bytecode is interpreted.

Erlang (pre-JIT) has a compilation step. The compiler reads the source code and generates bytecode files (.beam). The runtime interprets this bytecode; it doesn't care about the original source.

Erlang (with JIT) has the same compilation process, but at runtime, it also converts the bytecode to machine code in memory. Then it executes that code, which removes the interpretation overhead.

amelius · on Dec 2, 2020

> Ok I see. Follow up question, isn't that the same as Python then? Or at least compiled to .pyc Python code?

Given that Python uses a GIL (global interpreter lock), and Erlang typically has highly parallel workloads, my guess as an outsider would be that it's different.

dnautics · on Dec 1, 2020

sorry, I don't know the internals of the python VM. I honestly don't know why erlang vm is faster than python vm. But it is (in single-threaded raw compute, not taking into account GC)!

Re: graal, there is erjang, but nobody uses it.

KMag · on Dec 2, 2020

> I honestly don't know why erlang vm is faster than python vm.

One reason is that on platforms that support computed goto, when a BEAM file is loaded, the list of bytecodes for each function is translated into a list of machine code addresses that implement each bytecode (a.k.a. threaded code). Bytecode dispatch then skips one layer of indirection vs. CPython. CPython looks up each bytecode's machine code address in an array every time it dispatches a bytecode. (Or, if the platform doesn't support computed goto, CPython uses a regular switch statement, which hopefully gets compiled to a jump table.)

didibus · on Dec 2, 2020

Interesting, so it's very implementation specific. Could CPython do something similar? Or is it the Erlang language semantics that allow for this optimization?

KMag · on Dec 2, 2020

There's nothing stopping CPython from using a threaded interpreter, though doing it in C requires a compiler supporting GCC's nonstandard labels-as-values[0]/computed goto and undefined behavior (using goto to jump across functions).

Threaded code is a pretty standard implementation strategy for some languages (notably Forth), and my understanding it was even a pretty common compiler implementation strategy in the 1970s/ early 1980s. It's a pretty easy optimization, particularly for a stack-based VM or a VM with a small number of registers.

The main downside is memory usage, as Python bytecode can be memory-mapped directly from disk and therefore shared across processes and discarded by the OS under memory pressure rather than being written out to swap. There's obviously a bit of startup overhead at class load time.

Though, I'm a bit surprised that most stack-based interpreters don't initially load classes with functions implemented as a dead-simple threaded implementation consisting of a pointer to the start of a regular bytecode interpreter and a pointer into a memory-mapped buffer of the on-disk bytecode. Based on performance counters, they could JIT the hotspots into regular threaded code.

[0] https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

rkangel · on Dec 10, 2020

Part of the issue is that because Python is so dynamic lot of lookups have to be done to do something as simple as a function call. Tight loops of arithmetic (for example) in Python are actually fairly fast, but if you call a function it has to do a load of work to work out what code you actually want to run. It might have to:

Look up which object the method belongs to, look up the namespace, check if there's an implementation of __getattr__ or equivalent, maybe call that, get the result and then actually call the function. More complicated cases get worse.

Erlang/Elixir are much simpler and get their flexibility from functional programming. There's much less indirection so less work the interpreter needs to do.

cpeterso · on Dec 1, 2020

The article says: “The modern BEAM only has the interpreter.”

paulgdp · on Dec 1, 2020

For people interested in this subject, this recent HN post might be also relevant:

https://news.ycombinator.com/item?id=25253070

My interpretation is that the end goal seems similar, but this project starts with the Rust language instead.