Does anyone understand (or have any theories on) how this actually works? I don't understand how it's possible. Surely they didn't write a Linux version of Rosetta, it must be talking to the host OS somehow—but how? Where is the boundary?
I wonder how they handled TSO mode. Do they enable it for the whole VM? Otherwise I cannot see how it would safely work given they could be context-switched anytime by the guest kernel.
It seems pretty clear from TFA - there’s a directory share with the host, and given that Rosetta isn’t an emulator, but rather a translation layer, they don’t need a Linux version: x86 instructions go in, arm64 come out.
But surely that would be too slow? Although Rosetta is great at caching instructions ahead of time, it does need to emulate a lot of code (ie, anything generated at runtime).
It’s doing an AOT translation of x86-64 opcodes to ARM64 equivalents. There isn’t really any back and forth, it just digests the binary all at once.
This would still be pretty slow (see: Microsoft’s version of this under Windows ARM) due to the need to issue a ton of memory fence instructions to make ARM’s looser memory model behave like Intel’s, except that Apple baked the ability to switch the CPU into an Intel-like memory model directly into the silicon.
> It’s doing an AOT translation of x86-64 opcodes to ARM64 equivalents. There isn’t really any back and forth, it just digests the binary all at once.
No it's not. Apple is not immune from fundamental computer science principles whatever their marketing team says, and even the original keynote acknowledged that Rosetta 2 emulates some instructions at runtime.
Imagine you're running Python under Rosetta. The original Python interpeter takes Python code, translates it into x86 assembly, and runs that x86 assembly. Those x86 instructions did not exist prior to execution! Even if Rosetta could translate the entire interpreter into ARM code, the interpreter would still be producing x86 assembly.
Other types of programs produce code at runtime as well. Rosetta 2 is able to cache a very impressive amount of instructions ahead of time, but it's still doing emulation.
Yes, it includes a runtime JIT component for apps that happen to dynamically generate x86-64, but in all other cases the binary is AOT translated. It does this by inserting (during the AOT translation) function calls to a linked in, in-process translation function whenever it sees mmap’d or malloc’d regions being marked for execute and then jumped to - this data dependency on the jump instructions can be entirely determined from a static analysis of the executable, no violation of fundamental computing science principles required.
So yeah, no real back and forth to the host platform.
AOT (ahead of time) really does work on whole binaries. That is also why the first launch of an Intel app seems longer sometimes. Intel binary in, ARM binary out.
You're still right that that's not sufficient, for example anything that generates Intel code will definitely need JIT (just in time) translation. But presumably a lot of code will still hit the happy AOP path.
That being said, a JIT does not have to be super slow. The early vwmware products, back before there was virtualization support in Intel products, actually had to do some translation as well: https://www.vmware.com/pdf/asplos235_adams.pdf
> That is also why the first launch of an Intel app seems longer sometimes. Intel binary in, ARM binary out.
I mean, we can call it an ARM binary or we can call it an instruction cache. I generally prefer the latter term, because what Rosetta produces are not standalone executables, they're incomplete. I don't know how often the happy path is used, but Rosetta can always be observed doing work at runtime.
JITs are great and Rosetta 2 is incredible! I just can't imagine it working over any sort of shared filesystem, that would add an incredible amount of latency.
An interpreter is still producing x86 instructions at some point, right? Or else what does the CPU execute? Am I totally misunderstanding how interpreters work?
> An interpreter is still producing x86 instructions at some point, right?
Not dynamically. They just call predefined C (or whatever the interpreter was written in) functions based on some internal mechanism.
> Or else what does the CPU execute?
Usually either the interpreter is just walking the AST and calling C functions based on the parse tree’s node type (this is very slow), or it will convert the AST into an opcode stream (not x86-64 opcodes, just internal names for integers, like OP_ADD = 0, OP_SUB = 1, etc) when parsing the file, and then the interpreter’s “core” will look something like a gigantic switch state statement with case OP_ADD: add(lhs, rhs) type cases. “add” in this case being a C function that implements the add semantics in this language. (The latter approach, where the input file is converted to some intermediate form for more efficient execution after the parse tree is derived, is more properly termed a virtual machine and “interpreter” generally only refers to the AST approach. People tend to use “interpreter” pretty broadly in informal conversations, but Python is strictly speaking a VM, not an interpreter)
In either case, the only thing emitting x86-64 is the compiler that built the interpreter’s binary.
> Am I totally misunderstanding how interpreters work?
You’re confusing them with JITs.
If every interpreter had to roll their own dynamic binary generation, they’d be a hell of a lot less portable (like JITs).
Rosetta is very impressive! I just don't see how they could maintain that by passing instructions back and fourth over a shared drive, that would be a ridiculous amount of latency!
The shared drive is just a licensing trick. They do an ioctl over /proc/self/exe as the licensing mechanism. (and that's routed over with virtio-fs to the host)
They are exporting some sort of Linux ARM binary under a virtual filesystem mount point that handles execution of x64 images.
Probably, that binary is passing the instructions into native macOS Rosetta code for translation, but its also possible that the entire Rosetta code was ported to Linux.
Sort of, yes-- there is `binfmt_misc` handling of PE excutables and virtual filesystems (akin to VirtIOFS) involved, but no binary/architecture translation like Rosetta.