It's not emulating x86: it looks like it's assembling instructions on the fly and executing them in a mmap'd region. In other words, it's a very simple JIT.
But you probably can run it on an M1 anyways, since Apple's Rosetta will do the dynamic binary translation for you under the hood. YMMV.
It's a bit more complicated than that. Code is assembled into a shared memory buffer. The application spawns a child process that runs the code in the shared memory buffer. The parent process attaches to the child using ptrace to inspect and manipulate the CPU state and memory of the subprocess.
The app is entirely written in Ruby. So, it might run on Apple M1, but only if you're running an x86 Ruby interpreter through Rosetta.
would it? rosetta is a jit translator isn't it? how would it know to translate the instructions that are being generated on the fly interactively? unless there's hardware support in the m1 for translation or some other interrupt that gets triggered to do translation on the fly...
In general, the way you handle translation of machine code tends to resolve around compiling small dynamic traces (basically, the code from the current instruction pointer to the next branch instruction), with a lot of optimizations on top of that to make very common code patterns much faster than having to jump back to your translation engine every couple of instructions. The interactive generation this article implies is most likely to be effected with use of the x86 trap flag (which causes a trap interrupt after every single instruction is executed), which is infrequent enough that it's likely to be fully interpreted instead of using any sort of dynamic trace caching. In the case of x86 being generated by a JIT of some sort, well, you're already looking at code only when it's being jumped to, so whether the code comes from the program, some dynamic library being loaded later, or being generated on the fly doesn't affect its execution.
Rosetta contains both an AOT static binary translator and a JIT dynamic binary translator. That’s how Apple managed to get JS engines working even when the host browser was running as x86-on-M1.
I'd assume Rosetta works for newly marked executable pages by not actually flagging them as executable. When control flow attempts to transfer there, a page fault will occur since the page is not actually executable, this is the interrupt that allows Rosetta to step in, see what code was about to be executed, and write out a new ARM equivalent of the code to other memory, and redirect execution to the new equivalent ARM code, before resuming.
This basic sort of support is needed for any application that targeting x86 that uses any form of dynamic code generation, which is probably a whole lot more than most people think (even some forms of dynamic linking utilize small amounts of generated code, due to being more efficient than calling a method though a pointer to a pointer to the method).
x86 code is never actually marked as executable from the CPU's point of view, since that CPU does not know how to execute x86 code. The pages which contain the translated code are, but those are not something the x86 code knows about.
> x86 code is never actually marked as executable from the CPU's point of view, since that CPU does not know how to execute x86 code. The pages which contain the translated code are, but those are not something the x86 code knows about.
No, pages and the executable bit are something that the processor knows about.
Sorry, I don't understand what you are trying to say. Of course the CPU knows about pages and the executable bit? But there is no executable bit on a page filled with x86 code running on an ARM CPU, because the ARM CPU cannot execute that. It can only execute the translated ARM code that sits somewhere else, essentially out of sight for the x86 program.
The JIT'd ARM code pages are W^X, and that's not optional on macOS ARM. But W^X was opt-in on x86 macOS, so for backwards compatibility Rosetta can't require the x86 code to implement it in order to function.
So your model of how Rosetta works is off - the translation would need to support remapping the original code page read-only regardless of whether the x86 code did so, and letting a subsequent write invalidate the JIT cache of that page, instead of relying solely on the emulated process to implement W^X.
Systems that install new machine code without changing page permissions run an instruction cache barrier after installing and before running. Rosetta catches this instruction.
x86 does not require an icache flush because it has a unified cache. Rosetta emulates this correctly, which means it must be able to invalidate its code without encountering such an instruction.
The region is RWX, and code is put into it and then executed without a cache flush. This requires careful setup by the runtime, and here's how Rosetta does it, line by line:
1. buffer is created and marked as RW-, since the next thing you do with a RWX buffer is obviously going to be to write code into it.
2. buffer is written to directly, without any traps.
3. The indirect function call is compiled to go through an indirect branch trampoline. It notices that this is a call into a RWX region and creates a native JIT entry for it. buffer is marked as R-X (although it is not actually executed from, the JIT entry is.)
4. The write to buffer traps because the memory is read-only. The Rosetta exception server catches this and maps the memory back RW- and allows the write through.
5. Repeat of step 3. (Amusingly, a fresh JIT entry is allocated even though the code is the same…)
As you can see, this allows for pretty acceptable performance for most JITs that are effectively W^X even if they don't signal their intent specifically to the processor/kernel. The first write to the RWX region "signals" (heh) an intent to do further writes to it, then the indirect branch instrumentation lets the runtime know when it's time to do a translation.
Writing to an address would invalidate all JIT code associated with it, not just code that starts at that address. Lookup is done on the indirect branch, not on write, so if a new entry would be generated once execution runs through it.
> How do you think it detects a change to executable memory without a permissions change or a flush?
One way how this could be implemented was the way mentioned above: By making sure all x86-executable pages are marked r/o (in the real page tables, not from "the x86 API"). Whenever any code writes into it, the resulting page fault can flush out the existing translation and transparently return back to the x86 program, which can proceed to write into the region without taking a write fault (the kernel will actually mark them as writable in the page tables now).
When the x86 program then jumps into the modified code, no translation exists anymore, and the resulting page fault from trying to execute can trigger the translation of the newly modified pages. The (real, not-pretend) writable bit is removed from the x86 code pages again.
To the x86 code, the pages still look like they are writable, but in the actual page tables they are not. So the x86 code does not (need to) change the permission of the pages.
I don't know if that's exactly how it is implemented, but it is a way.
How you are disagreeing with me, then? The actual page table entries that the ARM CPU looks at will never mark a page containing x86 code as executable. x86 execution bit semantics are implemented, but on a different layer. From the ARM CPU's POV, the x86 code is always just data.
> The implementation of AMD64 is in software. It knows about page executable bits. The 'x86' code knows about them.
Where did I claim anything else? The thing I claimed the x86 code does not know about is the pages that contain the translated ARM code, which are distinct from the pages that contain the x86 code. The former pages are marked executable in the actual page tables, the latter pages have a software executable bit in the kernel, but are not marked as such in the actual page tables.
> Again, how do you think things like V8 and the JVM work on Rosetta otherwise?
Did I write something confusing that gave the wrong impression? My last answer says: "x86 execution bit semantics are implemented, but on a different layer".
you think that x86 pages are marked executable by the arm processor? probably not.
maybe arm pages with an arm wrapper that calls the jit for big literals filled with x86 code are, or arm pages loaded with stubs that jump into the jit to compile x86 code sitting in data pages are... but if the arm processor cannot execute x86 pages directly, then it wouldn't make a lot of sense for them to be marked executable, would it?
Ah, in this case I took "x86 execution semantics" just as how it behaves from user space, i.e. what permissions you can set and that they behave the same from an x86 observer (no matter what shenanigans is actually going behind the scenes).
But you probably can run it on an M1 anyways, since Apple's Rosetta will do the dynamic binary translation for you under the hood. YMMV.