I listened to a very interesting podcast episode [1] on the Thinking Elixir Podcast on this subject recently, an interview with John Högberg and Lukas Larsson. Not covered in this blog post but discussed in some detail in the podcast is that because you get standard debugging symbols from the JIT, you will now be able to use gdb and prof and similar tools when working with the BEAM.
JAM (Joe’s Abstract Machine)
In 1989 JAM (Joe’s Abstract Machine) was first implemented. Mike Williams wrote the runtime system in C, Joe Armstrong wrote the compiler, and Robert Virding wrote the libraries.
"Joe Armstrong" is a link to github.com/joearms -> " joearms has no activity yet for this period. "
I'm actually very interested in the Prolog prototype - especially on how it's done and how do they iterate the design. Does anyone have any information about it?
Erlang (the language, the runtime and the OTP framework) is a fascinating alternative in the design space for programming languages. Always happy to read about it, even if I don't use it for my projects.
Yes Erlang is one of the few languages I instantly liked upon reading about it and playing with it, even outside its typically parroted features it has a ton of stuff I wish was available in more languages (working with bit oriented protocols is so much more pleasant in Erlang than most languages thanks to bitstrings). However, as other have mentioned it’s not exactly a go to for general purpose stuff and even when it seems a perfect fit, I often have to look elsewhere for a variety of reasons.
Erlang/OTP/BEAM are so unusual within the computing world: highly opinionated components that, thanks to their constraints, are keenly complementary and optimized for each other.
No kitchen sink language here: play by their rules or choose a different environment.
Contrarily, this is exactly what makes Unix environments so powerful (and wonderful). The "Unix Philosophy" is typically considered unopinionated (in the "let the user choose" sense), but I'd say that it's more like it has a few very strong yet well-chosen opinions that produce a highly integrated universe where all sorts of different tools just work.
Seeing how many billions have been posted into JIT runtimes such as CLR/JVM/V8. I'm questioning whether or not it is realistic to create a production ready JIT as a side project?
It's important to note that the goals of this BEAM JIT are a lot more modest than those other JITs.
There's no goal of heroic optimization; the optimization goal is really just to remove the overhead of interpretation.
This removes the need to apply the JIT only to some code, because it's fairly simple, it's fast enough to apply when the code is loaded, and so all code is JITed to native as it's loaded. Or you're on an unsupported platform and all code is interpretted.
Because it's all or nothing, testing the OTP release should uncover any JIT bugs; you won't have the hard to track bugs sometimes seen in other systems where a function's correctness depends on whether or not it was JITed and that depends on runtime state. That won't mean no JIT bugs, of course, but they should be easier to track down.
It's really a situation of the 90/10 rule, you'll get the first 90% of the benefits from 10% of the work and then it's many many incremental changes that gets progressively more expensive.
Also in the case of Erlang the language model will reward these first steps even more than most subsequent optimizations because much Erlang code in general don't have much tight loops of the kind that are prominent in benchmarks were a good register allocator would provide huge wins.
More on the 90/10 rule we need to remember that these expensive JIT's are very complicated with optimization levels and interpreters combined with tons of GC options whereas here they explicitly just dumped JIT-interpreter cross-calling to simplify the design as well as a more straightforward internal memory model with less complicated edge cases.
It sounds to me that “JIT” here just means “not a batch compiler”? This seems to me to be more like the way Common Lisp compilers work than what I think of a JITs
I'm not super familiar with LISPs, but what the JIT is doing here is transforming bytecode into native code, which is what other JITs do.
What it's not doing, but is commonly done in other JITs, is any sort of runtime profiling and chosing of which modules to transforming; all modules are transformed when loaded.
As described in the article, previous JIT attempts with BEAM did have that functionality, and they did meet the project goals; profiling cost too much, the compilation step was too expensive, and mixing modes between interpreted modules and native modules added too much complexity.
I haven't looked at the code behind this, but from articles, I haven't seen anything that would preclude running the bytecode to native code transformation ahead of time (or caching the just-in-time transformation for future use), but it's not part of the implementation as of now.
Reading through the article, it sounds similar to what most lisp compilers do: read source files into a data-structure, transform that structure in various ways to some kind of IR and then generate machine code from that IR. They generally do this the first time a file is loaded and generate a “fast load” file to speed up later attempts to load the file.
This sounds pretty similar, except there’s no fast load file.
The BEAM is mostly concerned with correctness and horizontal scalability. I think a little vertical scalability from a JIT raises the throughout of the system without really changing how it works or what you use it for. If anything, maybe you write less stuff in native code.
A great deal of research has been published in the last 25 years, and some of it invalidates earlier wisdom due to changes in processor design. Just following this trail and applying the 80/20 rule could get a lot done for a little effort. And a simple JIT has half a prayer of being correct.
A tracing JIT is no small feat. You could go simpler and have a template JIT, which has a lot less complexity. Guile went with that, and the speedup from the already quite fast 2.2 to 3.0 was significant.
Andy has been working at Igalia with the JS engines of chrome and Firefox, which makes me believe it might not be easily reached by mere mortals, but looking at the source it is quite easy to follow, even though I would not trust myself to make any significant changes.
If you're aiming to compete with HotSpot and .Net then you'll need to invest millions, but not all JITs are this ambitious. GNU Lightning is another example of a JIT with few people behind it.
Works, is stable and faster than an interpreter? Sure, that is achievable for a motivated and skilled developer working on their own. At least for a reasonably simple language, maybe not for a beast like C++.
Competitive with one of those bigcorp-funded ones? Nope.
No, this isn't inlining. They're managing their own call stack and relying on undefined behavior. See the quote from the article:
> BEAM/C generated a single C function for each Erlang module. Local calls within the module were made by explicitly pushing the return address to the Erlang stack followed by a goto to the label of the called function. (Strictly speaking, the calling function stores the return address to BEAM register and the called function pushes that register to the stack.)
> Calls to other modules were done similarly by using the GCC extension that makes it possible to take the address of a label and later jumping to it. Thus an external call was made by pushing the return address to the stack followed by a goto to the address of a label in another C function.
over in zigland in private we're saying that one of the endgoals of zig should be to reimplement the BEAM. One of the most active users of zig is doing some really neat things with task-stealing, allocation-free, lock-free schedulers that are a lot like the BEAM's techniques, except super performant: (watch this for till the benchmarks demos ~ finishing 1h:17min https://www.youtube.com/watch?v=UaF6-5BmX2I&t=1h9m)
It's undefined in the C standard. That doesn't mean its insecure or undefined as seen from the machine code. nothin is hilding you back to do your own code generation and clearly define this behaviour.
it's interpreted bytecode. The bytecode is a highly optimized constrained language that looks a whole lot like machine language. It would not be unreasonable to build hardware that runs it, if you wanted to. I guess it would not be unreasonable to say, it's interpreted in the same way that running arm-compiled code on qemu on an x86 is interpreted.
QEMU in fact does dynamic translation, which is more like how JIT works in bytecode languages; this is why it's much faster than some earlier machine emulator software. (I think e.g. Bochs didn't have dynamic translation back when QEMU was coming into play, but the Web suggests that there's at least discussion around that feature so maybe that's not true anymore.)
The BEAM is currently a bytecode interpreter. The article describes it as using threaded code, which if you look at the C source is true in a sense on platforms where the C compiler provides the GNU-style extension of being able to take the address of labels for dynamic `goto` (https://stenmans.org/happi_blog/?p=194 agrees with my skim of `beam_emu.c`)—though if that's the only use as it seems, I might find that an edge case of the term, since I associate “threaded code” with the style used in Forth where “user-level” subroutines can be pointed to directly. If computed `goto` isn't available, then it uses a conventional switch/case loop, repeatedly dispatching the result of fetching the next bytecode instruction pointer value.
I haven't looked at QEMU's TCG much, but I doubt it would be a good choice for bytecode formats even if you could bash it into doing something useful as part of a backend. I'd expect it to be optimized for hardware-CPU-alike machine instructions, whereas BEAM is Bogdan's Erlang Abstract Machine. See https://blog.erlang.org/a-brief-BEAM-primer/ which points to https://github.com/erlang/otp/blob/master/lib/compiler/src/g.... If you look into how some of those are implemented, just as one example, the integer arithmetic operations handle automatic promotion to bignums—not something you'd find on a conventional CPU, but not uncommon in HLL bytecode. So you'd still have most of the implementation to do—and if you punt and hardwire all of those to runtime library calls, you're almost back in threaded-code land anyway; you really want to take specialization opportunities etc. into account…
I also don't know of any other HLL JITs that use TCG, which seems like weak evidence that it wouldn't be a win.
CPython has no compilation step. It reads source code and generates bytecode at runtime; this bytecode is cached inside `pyc` files. The bytecode is interpreted.
Erlang (pre-JIT) has a compilation step. The compiler reads the source code and generates bytecode files (.beam). The runtime interprets this bytecode; it doesn't care about the original source.
Erlang (with JIT) has the same compilation process, but at runtime, it also converts the bytecode to machine code in memory. Then it executes that code, which removes the interpretation overhead.
> Ok I see. Follow up question, isn't that the same as Python then? Or at least compiled to .pyc Python code?
Given that Python uses a GIL (global interpreter lock), and Erlang typically has highly parallel workloads, my guess as an outsider would be that it's different.
sorry, I don't know the internals of the python VM. I honestly don't know why erlang vm is faster than python vm. But it is (in single-threaded raw compute, not taking into account GC)!
> I honestly don't know why erlang vm is faster than python vm.
One reason is that on platforms that support computed goto, when a BEAM file is loaded, the list of bytecodes for each function is translated into a list of machine code addresses that implement each bytecode (a.k.a. threaded code). Bytecode dispatch then skips one layer of indirection vs. CPython. CPython looks up each bytecode's machine code address in an array every time it dispatches a bytecode. (Or, if the platform doesn't support computed goto, CPython uses a regular switch statement, which hopefully gets compiled to a jump table.)
Interesting, so it's very implementation specific. Could CPython do something similar? Or is it the Erlang language semantics that allow for this optimization?
There's nothing stopping CPython from using a threaded interpreter, though doing it in C requires a compiler supporting GCC's nonstandard labels-as-values[0]/computed goto and undefined behavior (using goto to jump across functions).
Threaded code is a pretty standard implementation strategy for some languages (notably Forth), and my understanding it was even a pretty common compiler implementation strategy in the 1970s/ early 1980s. It's a pretty easy optimization, particularly for a stack-based VM or a VM with a small number of registers.
The main downside is memory usage, as Python bytecode can be memory-mapped directly from disk and therefore shared across processes and discarded by the OS under memory pressure rather than being written out to swap. There's obviously a bit of startup overhead at class load time.
Though, I'm a bit surprised that most stack-based interpreters don't initially load classes with functions implemented as a dead-simple threaded implementation consisting of a pointer to the start of a regular bytecode interpreter and a pointer into a memory-mapped buffer of the on-disk bytecode. Based on performance counters, they could JIT the hotspots into regular threaded code.
Part of the issue is that because Python is so dynamic lot of lookups have to be done to do something as simple as a function call. Tight loops of arithmetic (for example) in Python are actually fairly fast, but if you call a function it has to do a load of work to work out what code you actually want to run. It might have to:
Look up which object the method belongs to, look up the namespace, check if there's an implementation of __getattr__ or equivalent, maybe call that, get the result and then actually call the function. More complicated cases get worse.
Erlang/Elixir are much simpler and get their flexibility from functional programming. There's much less indirection so less work the interpreter needs to do.
[1] https://podcasts.google.com/?feed=aHR0cHM6Ly90aGlua2luZ2VsaX...