It's not emulating x86: it looks like it's assembling instructions on the fly an...

bdowling · on Nov 29, 2021

It's a bit more complicated than that. Code is assembled into a shared memory buffer. The application spawns a child process that runs the code in the shared memory buffer. The parent process attaches to the child using ptrace to inspect and manipulate the CPU state and memory of the subprocess.

The app is entirely written in Ruby. So, it might run on Apple M1, but only if you're running an x86 Ruby interpreter through Rosetta.

woodruffw · on Nov 29, 2021

Ah, great point! I had assumed that the Ruby interpreter would be x86, but that isn’t a reasonable assumption now that native builds are common.

spiffistan · on Nov 30, 2021

JIT makes sense given his other current project: https://github.com/tenderlove/tenderjit

a-dub · on Nov 29, 2021

would it? rosetta is a jit translator isn't it? how would it know to translate the instructions that are being generated on the fly interactively? unless there's hardware support in the m1 for translation or some other interrupt that gets triggered to do translation on the fly...

jcranmer · on Nov 30, 2021

Dynamic JIT translation for x86 is pretty old-hat at this point; the general state of the art can be summarized in the (now 16 years old) Pin paper: https://www.cin.ufpe.br/~rmfl/ADS_MaterialDidatico/PDFs/prof...

In general, the way you handle translation of machine code tends to resolve around compiling small dynamic traces (basically, the code from the current instruction pointer to the next branch instruction), with a lot of optimizations on top of that to make very common code patterns much faster than having to jump back to your translation engine every couple of instructions. The interactive generation this article implies is most likely to be effected with use of the x86 trap flag (which causes a trap interrupt after every single instruction is executed), which is infrequent enough that it's likely to be fully interpreted instead of using any sort of dynamic trace caching. In the case of x86 being generated by a JIT of some sort, well, you're already looking at code only when it's being jumped to, so whether the code comes from the program, some dynamic library being loaded later, or being generated on the fly doesn't affect its execution.

woodruffw · on Nov 29, 2021

Rosetta contains both an AOT static binary translator and a JIT dynamic binary translator. That’s how Apple managed to get JS engines working even when the host browser was running as x86-on-M1.

jsmith45 · on Nov 29, 2021

I'd assume Rosetta works for newly marked executable pages by not actually flagging them as executable. When control flow attempts to transfer there, a page fault will occur since the page is not actually executable, this is the interrupt that allows Rosetta to step in, see what code was about to be executed, and write out a new ARM equivalent of the code to other memory, and redirect execution to the new equivalent ARM code, before resuming.

This basic sort of support is needed for any application that targeting x86 that uses any form of dynamic code generation, which is probably a whole lot more than most people think (even some forms of dynamic linking utilize small amounts of generated code, due to being more efficient than calling a method though a pointer to a pointer to the method).

a-dub · on Nov 29, 2021

so every write to an executable page would have to clear that bit then, triggering an interrupt on jump to let the translator jump in?

i'd venture a guess that the rosetta jit stuff probably does some kind of prelinking.

kinda makes me wish i had an m1 mac to play with...

anyfoo · on Nov 30, 2021

x86 code is never actually marked as executable from the CPU's point of view, since that CPU does not know how to execute x86 code. The pages which contain the translated code are, but those are not something the x86 code knows about.

chrisseaton · on Nov 30, 2021

> x86 code is never actually marked as executable from the CPU's point of view, since that CPU does not know how to execute x86 code. The pages which contain the translated code are, but those are not something the x86 code knows about.

No, pages and the executable bit are something that the processor knows about.

anyfoo · on Nov 30, 2021

Sorry, I don't understand what you are trying to say. Of course the CPU knows about pages and the executable bit? But there is no executable bit on a page filled with x86 code running on an ARM CPU, because the ARM CPU cannot execute that. It can only execute the translated ARM code that sits somewhere else, essentially out of sight for the x86 program.

chrisseaton · on Nov 30, 2021

> Sorry, I don't understand what you are trying to say.

Rosetta implements x86 execution bit semantics.

It does this by invalidating translated pages when the system call to set the execution bit is set.

Which bit do you not understand?

How do you think for example the JVM works today on Rosetta?

brigade · on Nov 30, 2021

The JIT'd ARM code pages are W^X, and that's not optional on macOS ARM. But W^X was opt-in on x86 macOS, so for backwards compatibility Rosetta can't require the x86 code to implement it in order to function.

So your model of how Rosetta works is off - the translation would need to support remapping the original code page read-only regardless of whether the x86 code did so, and letting a subsequent write invalidate the JIT cache of that page, instead of relying solely on the emulated process to implement W^X.

chrisseaton · on Nov 30, 2021

Systems that install new machine code without changing page permissions run an instruction cache barrier after installing and before running. Rosetta catches this instruction.

brigade · on Nov 30, 2021

X86 does not require any explicit barrier if you modify through the same virtual address as execution, so no.

chrisseaton · on Nov 30, 2021

Not sure which bit you’re saying ‘no’ to.

Most JITs do execution an icache flush, and Rosetta does catch it to invalidate their code.

For example https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x...

Otherwise, how do you think it works?

saagarjha · on Nov 30, 2021

x86 does not require an icache flush because it has a unified cache. Rosetta emulates this correctly, which means it must be able to invalidate its code without encountering such an instruction.

chrisseaton · on Nov 30, 2021

> x86 does not require an icache flush

It does if you wrote instructions from one address and execute them from another, which is why they use a flush.

> Rosetta emulates this correctly

Maybe you know more than I do, it my understanding is it does not emulate it correctly if you do not flush or change permissions.

How do you think it detects a change to executable memory without a permissions change or a flush?

saagarjha · on Nov 30, 2021

Rosetta needs to support code that looks like this:

  char *buffer = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE, -1, 0);
  *buffer = 0xc3;
  ((void (*)())buffer)();
  *buffer = 0xc3;
  ((void (*)())buffer)();

The region is RWX, and code is put into it and then executed without a cache flush. This requires careful setup by the runtime, and here's how Rosetta does it, line by line:

1. buffer is created and marked as RW-, since the next thing you do with a RWX buffer is obviously going to be to write code into it.

2. buffer is written to directly, without any traps.

3. The indirect function call is compiled to go through an indirect branch trampoline. It notices that this is a call into a RWX region and creates a native JIT entry for it. buffer is marked as R-X (although it is not actually executed from, the JIT entry is.)

4. The write to buffer traps because the memory is read-only. The Rosetta exception server catches this and maps the memory back RW- and allows the write through.

5. Repeat of step 3. (Amusingly, a fresh JIT entry is allocated even though the code is the same…)

As you can see, this allows for pretty acceptable performance for most JITs that are effectively W^X even if they don't signal their intent specifically to the processor/kernel. The first write to the RWX region "signals" (heh) an intent to do further writes to it, then the indirect branch instrumentation lets the runtime know when it's time to do a translation.

chrisseaton · on Nov 30, 2021

That’s a more limited case than what we’re talking about in this thread.

Think about code that is modified without jumping into it, such as stubs that are modified or certain kinds of yield points.

saagarjha · on Nov 30, 2021

Writing to an address would invalidate all JIT code associated with it, not just code that starts at that address. Lookup is done on the indirect branch, not on write, so if a new entry would be generated once execution runs through it.

anyfoo · on Nov 30, 2021

> How do you think it detects a change to executable memory without a permissions change or a flush?

One way how this could be implemented was the way mentioned above: By making sure all x86-executable pages are marked r/o (in the real page tables, not from "the x86 API"). Whenever any code writes into it, the resulting page fault can flush out the existing translation and transparently return back to the x86 program, which can proceed to write into the region without taking a write fault (the kernel will actually mark them as writable in the page tables now).

When the x86 program then jumps into the modified code, no translation exists anymore, and the resulting page fault from trying to execute can trigger the translation of the newly modified pages. The (real, not-pretend) writable bit is removed from the x86 code pages again.

To the x86 code, the pages still look like they are writable, but in the actual page tables they are not. So the x86 code does not (need to) change the permission of the pages.

I don't know if that's exactly how it is implemented, but it is a way.

anyfoo · on Nov 30, 2021

> Which bit do you not understand?

How you are disagreeing with me, then? The actual page table entries that the ARM CPU looks at will never mark a page containing x86 code as executable. x86 execution bit semantics are implemented, but on a different layer. From the ARM CPU's POV, the x86 code is always just data.

chrisseaton · on Nov 30, 2021

> those are not something the x86 code knows about

The implementation of AMD64 is in software. It knows about page executable bits. The 'x86' code knows about them.

Again, how do you think things like V8 and the JVM work on Rosetta otherwise?

anyfoo · on Nov 30, 2021

> The implementation of AMD64 is in software. It knows about page executable bits. The 'x86' code knows about them.

Where did I claim anything else? The thing I claimed the x86 code does not know about is the pages that contain the translated ARM code, which are distinct from the pages that contain the x86 code. The former pages are marked executable in the actual page tables, the latter pages have a software executable bit in the kernel, but are not marked as such in the actual page tables.

> Again, how do you think things like V8 and the JVM work on Rosetta otherwise?

Did I write something confusing that gave the wrong impression? My last answer says: "x86 execution bit semantics are implemented, but on a different layer".

a-dub · on Nov 30, 2021

you think that x86 pages are marked executable by the arm processor? probably not.

maybe arm pages with an arm wrapper that calls the jit for big literals filled with x86 code are, or arm pages loaded with stubs that jump into the jit to compile x86 code sitting in data pages are... but if the arm processor cannot execute x86 pages directly, then it wouldn't make a lot of sense for them to be marked executable, would it?

chrisseaton · on Nov 30, 2021

No the AMD64 page executable bit system is implemented in software by Rosetta.

saagarjha · on Nov 30, 2021

No, it doesn't need to. Rosetta only emulates userspace, so it just needs to give the illusion of protections to the program.

anyfoo · on Nov 30, 2021

Ah, in this case I took "x86 execution semantics" just as how it behaves from user space, i.e. what permissions you can set and that they behave the same from an x86 observer (no matter what shenanigans is actually going behind the scenes).

chrisseaton · on Nov 30, 2021

> rosetta is a jit translator

> how would it know to translate the instructions that are being generated on the fly interactively?

Just answered your own question.