Asmrepl: REPL for x86 Assembly Language

rudolfwinestock · on Nov 29, 2021

I've lost the original reference, but Joe Marshall once wrote in comp.lang.lisp:

Here's an anecdote I heard once about Minsky. He was showing a student how to use ITS to write a program. ITS was an unusual operating system in that the 'shell' was the DDT debugger. You ran programs by loading them into memory and jumping to the entry point. But you can also just start writing assembly code directly into memory from the DDT prompt. Minsky started with the null program. Obviously, it needs an entry point, so he defined a label for that. He then told the debugger to jump to that label. This immediately raised an error of there being no code at the jump target. So he wrote a few lines of code and restarted the jump instruction. This time it succeeded and the first few instructions were executed. When the debugger again halted, he looked at the register contents and wrote a few more lines. Again proceeding from where he left off he watched the program run the few more instructions. He developed the entire program by 'debugging' the null program.

Stratoscope · on Nov 30, 2021

This is how I write code today!

Everything needs a catchy name, so I call it Debugger Driven Development.

I write the first few lines of a function, however much I feel sure about, then add a dummy statement at the end and set a breakpoint there. When the code stops at the breakpoint, I can see exactly what my code did and what data I now have on hand.

I use that new knowledge to write the next few lines, again until I get to something I'm unsure of or where I'd just like to get a better view of the data. Set a breakpoint there and view that new data.

Repeat as needed until the function is done.

At my work we have many internal APIs that are "documented" but the documentation is fairly lacking. With the debugger, I can see not only what the API claims to do, but what it really does with my actual input.

I am bummed that so many developers today eschew debuggers. I even read an article recently along the lines of "These famous programmers don't use debuggers, and you shouldn't either". Why would anyone want to talk people out of using such a useful tool? It makes no sense to me.

Annatar · on Nov 30, 2021

We did that in the 1980’s and the 1990’s on the Commodore 64 and the Amiga, only then the debugger was called a monitor and it was quickly discovered that programming in them was slow and error prone because they lacked the ability to recompute. That is how assemblers were invented: Turbo Assembler on the Commodore 64 and ASM-One, TRASH’M-One, Seka and MasterSeka on the Amiga. ASM-One and TRASH’M-One include an excellent debugger built natively into the integrated development environment, and stepping through the code after assembling it is a joy.

djmips · on Nov 30, 2021

True! but that's not how assemblers were invented. Assemblers were around since the sixties at least.

mejutoco · on Nov 30, 2021

Reminds me of Common Lisp with slime.

https://dev.to/kmruiz/working-with-your-running-program-an-i...

jamesfinlayson · on Nov 30, 2021

I'd love to use a debugger more, but in my day job at least, the use of Docker makes it a hassle to set up. For standalone Java applications, or when using Visual C++ though, it's so much better than printing out state.

userbinator · on Nov 30, 2021

I am bummed that so many developers today eschew debuggers. I even read an article recently along the lines of "These famous programmers don't use debuggers, and you shouldn't either". Why would anyone want to talk people out of using such a useful tool? It makes no sense to me.

That would be because of what I call "long vs. short-term". If you're only thinking of the next few lines and doing that a lot, you will have effectively trained yourself out of looking at the bigger picture. As someone who taught programmers, I've seen what "debugger driven development" code looks like (because that's how some of them will try to start writing code.) It's not pretty. There's a reason a lot of highly productive (but not necessarily famous) programmers consider debuggers as a last-resort tool: writing code that needs debugging should be a rare occurrence.

saagarjha · on Nov 30, 2021

It’s sadly fallen victim to the “if you’re not using a debugger you’re not a real programmer” reversal.

userbinator · on Nov 30, 2021

Many PC magazines of the late 80s/early 90s had program listings (in Asm) for small utilities that you created by typing them into DEBUG, the very basic debugger that came with DOS. The C64 and ZX ones had similar listings, although I believe they were more commonly in the platform's variant of BASIC. Unfortunately, I don't think this culture existed around Apple's machines since the Macintosh and the Lisa that came before it.

deckard1 · on Nov 30, 2021

the way to low-level format early MFM drives was to run DEBUG and enter assembly and call some special routine in the controller chip on the drive. These were instructions that came in the manual with the drive. Pretty wild times. We've come a long way.

lproven · on Nov 30, 2021

DEBUG g=c800:5

From 30-year-old memory.

Yes, we actually had to do stuff like this. When I started my first job in the late 1980s, it was normal for hard disks to have a little label with a list of the (known) bad blocks on the drive. (Hand-written by someone at the factory on the first (circa 15-20MB) disks I used; later, on things like big ESDI disks, dot-matrix printed.) In some formatting tools, you had to manually enter them; formats took a long time, so eliminating retries on bad blocks could save half an hour.

Novell Netware came with its own low-level formatter called `COMPSURF`: COMPrehensive SURFace analysis. Dozens of people would be sharing a server's hard disk, so data losses would be extra-bad -- and might well bring down the server, losing everyone's work.

Note: the assumption in the early days of Novell was that workstations didn't have hard disks of their own and booted off the server too, making a LAN tens of thousands of £/$ cheaper than giving everyone their own HDD.

Running COMPSURF before you installed took hours. Server HDDs were big -- hundreds of megabytes! Scanning all that took ages.

protomyth · on Nov 30, 2021

I know a few Smalltalk programmers who did the same thing. They basically wrote their programs in the debugger. It was an interesting technique.

saagarjha · on Nov 30, 2021

I do a limited form of this with C/C++ projects that take forever to compile, with many conditional breakpoints that alter control flow to my liking that I then "solidify" into actual code when I have it looking like I want.

pjmlp · on Nov 30, 2021

With VC++ you can solidify them directly.

saagarjha · on Nov 30, 2021

Oh, really? I'm not very familiar with it, do you have the name of the feature so I can look it up?

pjmlp · on Nov 30, 2021

Before VS 2022, it used to be called edit-and-continue, now they are doubling now on it improving the use cases that are actually supported, and it got renamed as hot reload.

https://devblogs.microsoft.com/cppblog/c-edit-and-continue-i...

https://docs.microsoft.com/en-us/visualstudio/debugger/edit-...

https://docs.microsoft.com/en-us/visualstudio/debugger/hot-r...

Here is a demo of its latest state from the VS 2022 launch event, https://youtu.be/8SP1w7i8r-Y?list=PLReL099Y5nRc9f9Jpo1R7FsdH...

msravi · on Nov 30, 2021

That's how I learnt x86 assembly using MS-DOS and "debug" - which was a program that came with DOS. The proper assemblers at that time were Microsoft's MASM and Borland's TASM. With no access to those, the only option was to use the one bundled in DOS. Fun times, where you had to compute relative addresses of JMPs, based on the address where the JMP instruction sat. And then, you could even write to a particular cylinder/sector/offset on the hard disk and replace the boot sector with your own code.

larsbrinkhoff · on Nov 30, 2021

I tried to recreate such a session: https://www.youtube.com/watch?v=7Ub36q03vkc

molticrystal · on Nov 30, 2021

Those who enjoy Asmrepl might also enjoy "Cheap EMUlator: lightweight multi-architecture assembly playground" [0] it supports 32 and 64 bit variations of intel, arm, mips and sparc instruction sets and also provides a visual experience and supports many operating systems.

If you are on Windows and need something in a console, a nice colorful asm repl is available WinRepl [1] which is similar to " yrp604/rappel (Linux) and Tyilo/asm_repl".

[0] https://github.com/hugsy/cemu

[1] https://github.com/zerosum0x0/WinREPL/

pavlov · on Nov 29, 2021

Somebody should wrap this into a VGA-As-A-Service platform so that kids could learn programming the correct way:

  mov ax, 13h
  int 10h

zokier · on Nov 29, 2021

Ducktaping assembler into DOSBox Debugger would be interesting project, it provides almost whole UI otherwise by itself https://zwomp.com/index.php/2020/05/01/understanding-the-dos...

dcveloper · on Nov 29, 2021

I recently found some of my old Pascal mode 13h projects from when I was a teenager. Is DOSBox the best way to run those and Turbo Pascal?

a-priori · on Nov 29, 2021

The assembly you’ve listed there assumes it runs in real mode and ring 0. You’d need to use virtualization of some kind to execute that.

StillBored · on Nov 29, 2021

With a BIOS, or CSM on UEFI.

hoseja · on Nov 30, 2021

That's the only Holy way to run.

jmmv · on Nov 29, 2021

Not exactly the same, but https://www.endbasic.dev/ tries to achieve precisely that: a REPL with built in graphics for learning purposes, albeit with BASIC instead of asm.

nitrogen · on Nov 29, 2021

Sort of like this, but with BASIC? https://en.wikipedia.org/wiki/Logo_(programming_language)

jmmv · on Nov 30, 2021

Pretty much, though I’m trying to replicate what I got in an Amstrad CPC computer.

kingcharles · on Nov 29, 2021

I'm guessing you typed that from memory. What were you doing in the 90s? (demo coder, video game developer checking in here)

enriquto · on Nov 30, 2021

These two lines are deeply ingrained in the minds of a whole generation of programmers. They start a graphics mode of size 320x200 with a 256 byte palette and you can start dumping your pixels in the 0xa0000 segment right away.

I am yet to find a modern graphics programming environment that is so comfortable and easy to use as this.

StillBored · on Nov 30, 2021

Well your sorta comparing heavyweight OS graphics stack APIs with old school firmware ones. Even so, things like SDL2 are dead simple, one requests a window region and its possible to write bytes to the resulting buffer that show up in a window. That said modern firmware interfaces are still pretty clean. If you write a UEFI hello world, its possible to access the raw frame buffer with just a connection to the GOP, which is just a couple lines of code in C. Its conceptually pretty close to what your describing, except its designed to work with a slightly more modern programming paradigm.

https://wiki.osdev.org/GOP

enriquto · on Nov 30, 2021

I wouldn't call SDL2 "dead simple", unless sarcastically. Just opening an empty SDL window requires writing about 20 lines of code that deal with several different abstractions, a "window", a "surface", a "renderer, an "event". I only want an array of pixels that I can edit and see the results in realtime. It is of course possible to do that, but it seems ridiculously overcomplicated.

I was taught as a kid to program simple graphical demos using peek and poke in basic. Then in assembler. In either case, stupid me got colored pixels on the screen after a few minutes of work. Kids these days, how do they start? Please, don't tell me "matplotlib" or I will cry myself to sleep.

StillBored · on Dec 1, 2021

It depends a bit on your language bindings, if you look at say a python example:

https://about-prog.com/en/articles/python/hello_world_sdl2

Its basically, grab a window, show it, grab a render/draw buffer update it and make it visible.

I don't really find the base C version much more complex, although it does have a bit of boilerplate around init/window creation/grab surface/display surface/etc. I'm not sure I would consider that particularly complex. Sure SDL can get complex when you start trying to use GL/etc but if all you want is a buffer to write bytes that become pixels its pretty straightforward IMHO.

I would say its roughly the same level of complexity (if not less) than HTML canvas+JS.

deepspace · on Nov 30, 2021

Wow, this brings back memories of my final project for my "programming for Engineering students" course in the mid-80s.

I wrote a DOS TSR program (remember those?) which would pop up a window when you pressed a key sequence and present you with an ASM86 REPL.

You could selectively 'save' pieces of code, and then when you exited the window, it would paste the saved code as inline assembly code (a hex byte array surrounded by some Turbo Pascal syntax) into your keyboard buffer - the assumption being that you are running the Turbo Pascal IDE, of course.

The TSR itself was written in x86 assembly, which added a level of complexity. I would have given and arm and a leg to be able to do it in a high-level language like Ruby.

djmips · on Nov 30, 2021

Why not Turbo Pascal?

lproven · on Nov 30, 2021

I can only take a guess: that for a TSR on an early DOS PC, you really wanted it to be small. TSRs took a significant chunk of your base memory, and as you only got 640 kB of that, you wanted to save as much as you could.

In the later days of DOS, programs grew so big that they wanted all of that 640 kB to themselves. Optional-extra type TSRs went out of fashion and DOS (first DR DOS 5, then MS played copy-cat with MS-DOS 5) gained built-in memory managers to load necessary TSRs (e.g. CD, mouse and keyboard drivers, disk cache, etc.) into Upper Memory Blocks.

UMBs were a 386 thing: you used a 386 memory manager to map any unused bits of the I/O space in the PC's memory map (i.e. from 641 kB up to 1 MB) as RAM. Anything that wasn't being used for ROM or memory-mapped I/O, you could put RAM there and then load TSRs into these little chunks of RAM -- 1 or 2 dozen kB of RAM each.

https://en.wikipedia.org/wiki/Upper_memory_area

Yes, we were that desperate for base memory. It didn't matter if you had 2 or 4 or 16MB of RAM, DOS could only run programs in the 1st 1 MB of it, and only freely use the first ⅓ of that first meg. All the rest could only be used for data, disk caches, and other non-executable stuff.

A side-effect of having a 386 memory manager, for real DOS power users, was that fancy 3rd party ones like Quarterdeck QEMM could also offer multitasking. Quarterdeck sold a tool called DESQview that let you run multiple DOS programs side-by-side and switch between them -- radical stuff in the 1980s.

But once you had that, you didn't need TSRs any more.

webdoodle · on Nov 29, 2021

I fondly remember writing my first game using assembly that I hand typed from a magazine article on an Amiga. It didn't work because of a reversed peek/poke. It took us all day to figure it out, but we got it working!

SavantIdiot · on Nov 30, 2021

Apple //e & ][+ had one built in. It was called the "monitor". You typed "CALL -151" and you started typing assembly code. You could run, save, dump memory and read registers. When I got my first 286 I was surprised I couldn't do the same thing.

Someone · on Nov 30, 2021

“Monitor” was a common name for such software (https://en.wikipedia.org/wiki/Machine_code_monitor)

It is a step up from having front-panel switches (https://en.wikipedia.org/wiki/Front_panel)

On early mini- and microcomputers, those sometimes had to be used to enter the boot loader (https://en.wikipedia.org/wiki/Booting#Minicomputers, https://en.wikipedia.org/wiki/Booting#Booting_the_first_micr...). That was a step down from mainframes, which could automatically read in a program to run at boot.

It wouldn’t surprise me much if there were people alive today who still have some muscle memory to rapidly enter such a boot sequence for an Altair.

Narishma · on Nov 30, 2021

You could do the same thing on your 286. PCs came with DEBUG.COM, which does the same things.

SavantIdiot · on Nov 30, 2021

I didn't know that until about a decade later, unfortunately!

People forget that in 1984 information wasn't a click away.

The problem with owning a Hong Kong-made 286 clone in 1984, and using pirated software, is that it was extremely hard to learn things. I was limited by the books at my local "Waldenbooks" computer section, which was about 20 books. Computer shopper and Byte magazine were kinda helpful, but I learned very, very slowly. It wasn't until I entered college that I started learning rapidly, but the focus wasn't on PCs (it was still MTS mainframes). It took until my first job writing 16-bit drivers that I finally started learning the nuts and bolts of MSDOS.

djmips · on Nov 30, 2021

The Apple ][ with Integer basic had a better one which had a built in mini-assembler. Very fun and useful. It was a real shame it got pushed out by the bloated Microsoft Basic. ;-)

incanus77 · on Nov 30, 2021

Built into the Commodore 128, too.

kstrauser · on Nov 30, 2021

I had the same (SmartMon) on a C64. I didn't know at the time that it was unusual.

panzagl · on Nov 30, 2021

At the time, it wasn't...

fouc · on Nov 29, 2021

Does anyone remember Ketman (1997)? A combination assembler interpreter & tutor for MSDOS. That was the first time I saw a REPL for assembly language.

http://web.archive.org/web/20051211022146/http://www.btinter...

nielsbot · on Nov 29, 2021

Not snark, but a serious question: What would one use this for?

tenderlove · on Nov 29, 2021

I wrote it because I can never remember what the `test` instruction does to the zero flag. Every time I use the instruction I have to look up the docs. Looking up docs is fine, but running code in a REPL helps me remember things better.

djmips · on Nov 30, 2021

It's a shame that modern debuggers don't have mini-assemblers included like the original Apple II. Having a REPL would be real nice. For one I wouldn't have to type 90 (NOP) into memory windows to blank out code like non mortal ASSERTS.

tzot · on Nov 29, 2021

I believe in most (if not all) architectures that have it, a test instruction is the same as comparing to zero; so testing zero sets the zero flag :)

mkup · on Nov 29, 2021

TEST instruction in x86/x64 is the same as AND instruction (bitwise and), but result of the computation is discarded (only flags are retained).

tzot · on Nov 30, 2021

Thanks; my assembly experience was with earlier processors, with a single argument for their test instruction (kind of like calling x86 test with two same arguments). I should have checked what the x86 test instruction does before replying.

woodruffw · on Nov 29, 2021

I do a lot of program analysis work, and it's occasionally useful to see the pre- and post-machine states of arbitrary instructions. I have my own (more? less?) hacky version of this program that I use for that purpose; I know other people use GEF and similar GDB extensions for similar purposes.

qsort · on Nov 29, 2021

For visually exploring the results of applying instructions. Similar to how you would use jshell.

unbanned · on Nov 29, 2021

Education

sebow · on Nov 29, 2021

Learning assembly can be a pain, especially without something like gdb (with layout regs &layout asm). This is much simpler and doesn't require you to type like 4-5 extra commands(start gdb, put breakpoint, set layouts, step through the code),thus avoiding the pain that gdb can be for very-simple asm programs.

6bfdc1954b8e · on Nov 29, 2021

Shellcode testing I suppose.

kitd · on Nov 29, 2021

Cool!

This reminds me of a fun project I once did, writing an x86 assembler in Lotus 123, using lookup tables. On the odd occasion when it worked, it was immensely fulfilling.

westurner · on Nov 29, 2021

This could be implemented with Jupyter notebooks as a Jupyter kernel or maybe with just fancy use of explicitly returned objects that support the (Ruby-like, implicit) IPython.display.display() magic.

IRuby is the Jupyter kernel for Rubylang:

iruby/display: https://github.com/SciRuby/iruby/blob/master/lib/iruby/displ...

iruby/formatter: https://github.com/SciRuby/iruby/blob/master/lib/iruby/forma...

More links to how Jupyter kernels and implicit display() and DAP: Debug Adapter Protocol work: "Evcxr: A Rust REPL and Jupyter Kernel" https://news.ycombinator.com/item?id=25923123

"ENH: Mixed Python/C debugging (GDB,)" https://github.com/jupyterlab/debugger/issues/284

... "Ask HN: How did you learn x86-64 assembly?" https://news.ycombinator.com/item?id=23930335

master_yoda_1 · on Nov 30, 2021

Its better to use GCC intrinsic api https://gcc.gnu.org/onlinedocs/gcc-5.3.0/gcc/x86-Built-in-Fu...

It really tough to write x86 assembly.

NavinF · on Nov 30, 2021

Dunno how much things have changed, but intrinsics were kinda useless in the past: https://danluu.com/assembly-intrinsics/

saagarjha · on Nov 30, 2021

Using intrinsics correctly generally requires understanding assembly, because they are supposed to match the assembly you'd want to generate. Just sprinkling them around because you're not familiar with x86 assembly is unlikely to be productive.

asimjalis · on Nov 29, 2021

Neat. This could be embedded into a Lisp/Clojure syntax.

User23 · on Nov 29, 2021

A toy project I have in mind is bootstrapping a lisp in asm and then using lisp macros as assembler macros to build up a high level language that would effectively be native code.

Jach · on Nov 30, 2021

Sounds like it'd be cool for the sake of it, but just in case you (or other readers) aren't aware (Edit -- looks like you are very aware ;) SBCL already compiles Lisp code to native code. It's not the same as (asm) macros all the way down, but still. You can even inspect the assembly of a function with the built-in function DISASSEMBLE, and see how it changes with different optimization levels or type declarations or other things. https://pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembly-co... is worth a read too for a cool experiment in generating custom assembly for a VM idea.

lispm · on Nov 30, 2021

With various implementations (Clozure Common Lisp) one can write inline assembler interactively.

praveen9920 · on Nov 30, 2021

I wonder if something like this for wasm. Would be interesting to see something like this in browser

woodruffw · on Nov 30, 2021

My understanding of wasm (which could be very wrong) is that it's a stack-based virtual machine (like cpython), rather than a load/store or register/memory ISA.

You could probably visualize the operand stack and opcode sequence, but it wouldn't be quite as "flashy" as x86's state transitions look when visualized here.

jonny_eh · on Nov 29, 2021

Is this emulating x86? Can I run it on an M1?

woodruffw · on Nov 29, 2021

It's not emulating x86: it looks like it's assembling instructions on the fly and executing them in a mmap'd region. In other words, it's a very simple JIT.

But you probably can run it on an M1 anyways, since Apple's Rosetta will do the dynamic binary translation for you under the hood. YMMV.

bdowling · on Nov 29, 2021

It's a bit more complicated than that. Code is assembled into a shared memory buffer. The application spawns a child process that runs the code in the shared memory buffer. The parent process attaches to the child using ptrace to inspect and manipulate the CPU state and memory of the subprocess.

The app is entirely written in Ruby. So, it might run on Apple M1, but only if you're running an x86 Ruby interpreter through Rosetta.

woodruffw · on Nov 29, 2021

Ah, great point! I had assumed that the Ruby interpreter would be x86, but that isn’t a reasonable assumption now that native builds are common.

spiffistan · on Nov 30, 2021

JIT makes sense given his other current project: https://github.com/tenderlove/tenderjit

a-dub · on Nov 29, 2021

would it? rosetta is a jit translator isn't it? how would it know to translate the instructions that are being generated on the fly interactively? unless there's hardware support in the m1 for translation or some other interrupt that gets triggered to do translation on the fly...

jcranmer · on Nov 30, 2021

Dynamic JIT translation for x86 is pretty old-hat at this point; the general state of the art can be summarized in the (now 16 years old) Pin paper: https://www.cin.ufpe.br/~rmfl/ADS_MaterialDidatico/PDFs/prof...

In general, the way you handle translation of machine code tends to resolve around compiling small dynamic traces (basically, the code from the current instruction pointer to the next branch instruction), with a lot of optimizations on top of that to make very common code patterns much faster than having to jump back to your translation engine every couple of instructions. The interactive generation this article implies is most likely to be effected with use of the x86 trap flag (which causes a trap interrupt after every single instruction is executed), which is infrequent enough that it's likely to be fully interpreted instead of using any sort of dynamic trace caching. In the case of x86 being generated by a JIT of some sort, well, you're already looking at code only when it's being jumped to, so whether the code comes from the program, some dynamic library being loaded later, or being generated on the fly doesn't affect its execution.

woodruffw · on Nov 29, 2021

Rosetta contains both an AOT static binary translator and a JIT dynamic binary translator. That’s how Apple managed to get JS engines working even when the host browser was running as x86-on-M1.

jsmith45 · on Nov 29, 2021

I'd assume Rosetta works for newly marked executable pages by not actually flagging them as executable. When control flow attempts to transfer there, a page fault will occur since the page is not actually executable, this is the interrupt that allows Rosetta to step in, see what code was about to be executed, and write out a new ARM equivalent of the code to other memory, and redirect execution to the new equivalent ARM code, before resuming.

This basic sort of support is needed for any application that targeting x86 that uses any form of dynamic code generation, which is probably a whole lot more than most people think (even some forms of dynamic linking utilize small amounts of generated code, due to being more efficient than calling a method though a pointer to a pointer to the method).

a-dub · on Nov 29, 2021

so every write to an executable page would have to clear that bit then, triggering an interrupt on jump to let the translator jump in?

i'd venture a guess that the rosetta jit stuff probably does some kind of prelinking.

kinda makes me wish i had an m1 mac to play with...

anyfoo · on Nov 30, 2021

x86 code is never actually marked as executable from the CPU's point of view, since that CPU does not know how to execute x86 code. The pages which contain the translated code are, but those are not something the x86 code knows about.

chrisseaton · on Nov 30, 2021

> x86 code is never actually marked as executable from the CPU's point of view, since that CPU does not know how to execute x86 code. The pages which contain the translated code are, but those are not something the x86 code knows about.

No, pages and the executable bit are something that the processor knows about.

anyfoo · on Nov 30, 2021

Sorry, I don't understand what you are trying to say. Of course the CPU knows about pages and the executable bit? But there is no executable bit on a page filled with x86 code running on an ARM CPU, because the ARM CPU cannot execute that. It can only execute the translated ARM code that sits somewhere else, essentially out of sight for the x86 program.

chrisseaton · on Nov 30, 2021

> Sorry, I don't understand what you are trying to say.

Rosetta implements x86 execution bit semantics.

It does this by invalidating translated pages when the system call to set the execution bit is set.

Which bit do you not understand?

How do you think for example the JVM works today on Rosetta?

brigade · on Nov 30, 2021

The JIT'd ARM code pages are W^X, and that's not optional on macOS ARM. But W^X was opt-in on x86 macOS, so for backwards compatibility Rosetta can't require the x86 code to implement it in order to function.

So your model of how Rosetta works is off - the translation would need to support remapping the original code page read-only regardless of whether the x86 code did so, and letting a subsequent write invalidate the JIT cache of that page, instead of relying solely on the emulated process to implement W^X.

chrisseaton · on Nov 30, 2021

Systems that install new machine code without changing page permissions run an instruction cache barrier after installing and before running. Rosetta catches this instruction.

brigade · on Nov 30, 2021

X86 does not require any explicit barrier if you modify through the same virtual address as execution, so no.

chrisseaton · on Nov 30, 2021

Not sure which bit you’re saying ‘no’ to.

Most JITs do execution an icache flush, and Rosetta does catch it to invalidate their code.

For example https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x...

Otherwise, how do you think it works?

saagarjha · on Nov 30, 2021

x86 does not require an icache flush because it has a unified cache. Rosetta emulates this correctly, which means it must be able to invalidate its code without encountering such an instruction.

chrisseaton · on Nov 30, 2021

> x86 does not require an icache flush

It does if you wrote instructions from one address and execute them from another, which is why they use a flush.

> Rosetta emulates this correctly

Maybe you know more than I do, it my understanding is it does not emulate it correctly if you do not flush or change permissions.

How do you think it detects a change to executable memory without a permissions change or a flush?

saagarjha · on Nov 30, 2021

Rosetta needs to support code that looks like this:

  char *buffer = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE, -1, 0);
  *buffer = 0xc3;
  ((void (*)())buffer)();
  *buffer = 0xc3;
  ((void (*)())buffer)();

The region is RWX, and code is put into it and then executed without a cache flush. This requires careful setup by the runtime, and here's how Rosetta does it, line by line:

1. buffer is created and marked as RW-, since the next thing you do with a RWX buffer is obviously going to be to write code into it.

2. buffer is written to directly, without any traps.

3. The indirect function call is compiled to go through an indirect branch trampoline. It notices that this is a call into a RWX region and creates a native JIT entry for it. buffer is marked as R-X (although it is not actually executed from, the JIT entry is.)

4. The write to buffer traps because the memory is read-only. The Rosetta exception server catches this and maps the memory back RW- and allows the write through.

5. Repeat of step 3. (Amusingly, a fresh JIT entry is allocated even though the code is the same…)

As you can see, this allows for pretty acceptable performance for most JITs that are effectively W^X even if they don't signal their intent specifically to the processor/kernel. The first write to the RWX region "signals" (heh) an intent to do further writes to it, then the indirect branch instrumentation lets the runtime know when it's time to do a translation.

chrisseaton · on Nov 30, 2021

That’s a more limited case than what we’re talking about in this thread.

Think about code that is modified without jumping into it, such as stubs that are modified or certain kinds of yield points.

saagarjha · on Nov 30, 2021

Writing to an address would invalidate all JIT code associated with it, not just code that starts at that address. Lookup is done on the indirect branch, not on write, so if a new entry would be generated once execution runs through it.

anyfoo · on Nov 30, 2021

> How do you think it detects a change to executable memory without a permissions change or a flush?

One way how this could be implemented was the way mentioned above: By making sure all x86-executable pages are marked r/o (in the real page tables, not from "the x86 API"). Whenever any code writes into it, the resulting page fault can flush out the existing translation and transparently return back to the x86 program, which can proceed to write into the region without taking a write fault (the kernel will actually mark them as writable in the page tables now).

When the x86 program then jumps into the modified code, no translation exists anymore, and the resulting page fault from trying to execute can trigger the translation of the newly modified pages. The (real, not-pretend) writable bit is removed from the x86 code pages again.

To the x86 code, the pages still look like they are writable, but in the actual page tables they are not. So the x86 code does not (need to) change the permission of the pages.

I don't know if that's exactly how it is implemented, but it is a way.

anyfoo · on Nov 30, 2021

> Which bit do you not understand?

How you are disagreeing with me, then? The actual page table entries that the ARM CPU looks at will never mark a page containing x86 code as executable. x86 execution bit semantics are implemented, but on a different layer. From the ARM CPU's POV, the x86 code is always just data.

chrisseaton · on Nov 30, 2021

> those are not something the x86 code knows about

The implementation of AMD64 is in software. It knows about page executable bits. The 'x86' code knows about them.

Again, how do you think things like V8 and the JVM work on Rosetta otherwise?

anyfoo · on Nov 30, 2021

> The implementation of AMD64 is in software. It knows about page executable bits. The 'x86' code knows about them.

Where did I claim anything else? The thing I claimed the x86 code does not know about is the pages that contain the translated ARM code, which are distinct from the pages that contain the x86 code. The former pages are marked executable in the actual page tables, the latter pages have a software executable bit in the kernel, but are not marked as such in the actual page tables.

> Again, how do you think things like V8 and the JVM work on Rosetta otherwise?

Did I write something confusing that gave the wrong impression? My last answer says: "x86 execution bit semantics are implemented, but on a different layer".

a-dub · on Nov 30, 2021

you think that x86 pages are marked executable by the arm processor? probably not.

maybe arm pages with an arm wrapper that calls the jit for big literals filled with x86 code are, or arm pages loaded with stubs that jump into the jit to compile x86 code sitting in data pages are... but if the arm processor cannot execute x86 pages directly, then it wouldn't make a lot of sense for them to be marked executable, would it?

chrisseaton · on Nov 30, 2021

No the AMD64 page executable bit system is implemented in software by Rosetta.

saagarjha · on Nov 30, 2021

No, it doesn't need to. Rosetta only emulates userspace, so it just needs to give the illusion of protections to the program.

anyfoo · on Nov 30, 2021

Ah, in this case I took "x86 execution semantics" just as how it behaves from user space, i.e. what permissions you can set and that they behave the same from an x86 observer (no matter what shenanigans is actually going behind the scenes).

chrisseaton · on Nov 30, 2021

> rosetta is a jit translator

> how would it know to translate the instructions that are being generated on the fly interactively?

Just answered your own question.

emersonrsantos · on Nov 29, 2021

It's a modern DOS debug.com