Hacker News new | past | comments | ask | show | jobs | submit login
Why is Rosetta 2 fast? (dougallj.wordpress.com)
650 points by pantalaimon on Nov 9, 2022 | hide | past | favorite | 358 comments



This is a great writeup. What a clever design!

I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.

What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.

Anyone know how they implemented PPC-to-x86 translation?


> Anyone know how they implemented PPC-to-x86 translation?

They licensed Transitive's retargettable binary translator, and renamed it Rosetta; very Apple.

It was originally a startup, but had been bought by IBM by the time Apple was interested.


> It was originally a startup, but had been bought by IBM by the time Apple was interested.

Rosetta shipped in 2005.

IBM bought Transitive in 2008.

The last version of OS X that supported Rosetta shipped in 2009.

I always wondered if the issue was that IBM tried to alter the terms of deal too much for Steve's taste.


A lesser known bit of trivia about this is that IBM would go on to use Transitive's technology for the exact opposite of Rosetta -- x86 to PowerPC translation, in the form of "PowerVM Lx86", released that year (2008).

It's very fascinating to me, since IBM appears to have extended the PowerPC spec with this application specifically in mind. Up until POWER10, the Power/PowerPC ISA specified an optional feature called "SAO", allowing individual pages of memory to be forced to use an x86-style strong memory model, comparable to the proprietary extension in Apple's CPUs, but much more granular (page level and enforced in L1/L2 cache as opposed to the entire core).

As far as I can tell, Transitive's technology was the only application to ever use this feature, though it's mainlined in the Linux kernel, and documented in the mprotect(2) man page. IBM ditched the extension for POWER10, which makes sense, since Lx86 only ever worked on big endian releases of RHEL and SLES which are long out of support now.

One mystery to me, though, is IBM added support for this page marker to the new radix-style MMU in POWER9. It's documented in the CPU manual, but Linux has no code to use it -- unless I've missed it, Linux only has code to set the appropriate bits in HPT mode, and no reference to the new method for marking radix pages SAO as the manual describes. I can't imagine there was any application on AIX which used this mode (it only decreases performance), and unless you backported a modern kernel to a RHEL 5 userland, you couldn't use Lx86 with the new radix mode. Much strangeness...


In theory, PROT_SAO should be useful for qemu, and trivial to make patches implementing there. That's assuming the kernel actually sets it, though. The problem I encountered when I set out to do it a year or so ago, was that I couldn't find a good test case to fail without it...


The kernel definitely sets the WIMG bits at https://github.com/torvalds/linux/blob/master/arch/powerpc/m... (line 336, if HN removes it), though I've never been able to "make it work" either.

I used box64 as a test case, where I had a game that would run in emulation, but only if I pinned it to a single core. On ARM64, it also worked, as the JIT translator on box64 uses manually inserted memory fences to force strongly ordered access.

The game never worked correctly, even after I patched the kernel to mark every page on the system as SAO, and confirmed this worked by checking the set memory flags. This might be a mistake in my understanding of what SAO should do, though. (or another failure in box64 on ppc64le)

One thought I've had recently is perhaps it's like the recently discovered tagged memory extension and only worked in big endian? There's nothing in the docs to suggest this, but since the only test case was BE-only, maybe?


Apple is also not tied to reverse compatibility.

Their customers are not enterprise, and consequently they are probably the best company in the world at dictating well-managed, reasonable shifts in customer behavior at scale.

So they likely had no need for Rosetta as of 2009.


Right, thanks for correcting my faulty memory on the timing.

It is possible that IBM tried to squeeze Apple, but given that IBM's interest in Transitive was for enterprise server migration, I suspect it is more likely that Apple got tired of paying whatever small royalty they'd contracted for with Transitive, and decided enough people had fully migrated to native x86 apps that they wouldn't alienate too many customers.


>"The last version of OS X that supported Rosetta shipped in 2009."

Interesting so was Rosetta 2 written from the ground up then? Did Apple manage to hire any of the former Transitive engineers after IBM acquired them? It seems like this would be a niche group of engineers that worked in this area no?


>"The last version of OS X that supported Rosetta shipped in 2009."

Interesting, so was Rosetta 2 written from the ground up then? Did Apple manage to hire any of the former Transitive engineers after IBM acquired them? It seems like this would be a niche group of engineers that worked in this area no?


I agree it was a bit worryingly short-lived. However the first version of Mac OS X that shipped without Rosetta 1 support was 10.7 Lion in summer 2011 (and many people avoided it since it was problematic). So nearly-modern Mac OS X with Rosetta support was realistic for a while longer.


> However the first version of Mac OS X that shipped without Rosetta 1 support was 10.7 Lion

Yes, but I was pointing out when the last version of OS X that did support Rosetta shipped.

I have no concrete evidence that Apple dropped Rosetta because IBM wanted to alter the terms of the deal after they bought Transitive, but I've always found that timing interesting.

In comparison, the emulator used during the 68k to PPC transition was never removed from Classic MacOS, so the change stood out.


The Classic environment was removed from OS X and all the IP involved was Apple’s.

The timing is interesting, but I wouldn’t put beyond Apple to remove a feature simply to sediment a transition (and decrease support cost).


> In comparison, the emulator used during the 68k to PPC transition was never removed from Classic MacOS, so the change stood out.

It was never removed because Classic MacOS itself was never fully native.


> It was never removed because Classic MacOS itself was never fully native.

Are there any current OSes that have the same level of historical cruft that Mac OS Classic had?


Windows.

Depending on which API you are calling, you have to represent a “string” differently. This is just one example

https://learn.microsoft.com/en-us/cpp/text/how-to-convert-be...


There are still dialog boxes that date back to Windows 3.1 that show up in Windows 10 and 11.


I agree. And I suppose since it was so intrinsic to the operating system, if a 68k app worked in Mac OS 9 (some would some might not), you could continue to run it in the Classic Environment (on a PPC Mac, not Intel Mac) Mac OS 10.4 Tiger in the mid 20 00's!


I could have have sworn that a unibody MacBook Pro where I did an in-place upgrade to Lion somehow held onto Rosetta.


I guess that's perjury because it cannot be true! Even Snow Leopard didn't even include Rosetta 1. But if it was deemed necessary, it would download and install it on-demand, similar to how the Java system worked.



That’s really interesting. You might enjoy reading about the VM embedded into the Busicom calculator that used the Intel 4004 [1]

They squeezed a virtual machine with 88 instructions into less than 1k of memory!

[1] https://thechipletter.substack.com/p/bytecode-and-the-busico...


In the mists of history S. Wozniak wrote the SWEET-16 interpreter for the 6502. A VM with 29 instructions implemented in 300 bytes.

https://en.wikipedia.org/wiki/SWEET16


That is nifty! Sounds very similar to a Forth interpreter.


There's also OpenFirmware's platform independent Forth bytecode "FCode":

https://en.wikipedia.org/wiki/Open_Firmware

>Open Firmware Forth Code may be compiled into FCode, a bytecode which is independent of instruction set architecture. A PCI card may include a program, compiled to FCode, which runs on any Open Firmware system. In this way, it can provide boot-time diagnostics, configuration code, and device drivers. FCode is also very compact, so that a disk driver may require only one or two kilobytes. Therefore, many of the same I/O cards can be used on Sun systems and Macintoshes that used Open Firmware. FCode implements ANS Forth and a subset of the Open Firmware library.


And here I was feeling impressed with myself for implementing the Nand2Tetris VM translator in ~2k of python... wow. Respect for the elders!


From what I understand; they purchased a piece of software that already existed to translate PPC to x86 in some form or another and iterated on it. I believe the software may have already even been called ‘Rosetta’.

My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.

I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.



Transitive.


I don't know how they did it, but they did it very very slowly. Anything "interactive" was unuseable.


Assuming you're talking about PPC-to-x86, it was certainly usable, though noticeably slower. Heck, I used to play Tron 2.0 that way, the frame rate suffered but it was still quite playable.


Interactive 68K programs were usually fast. The 68K programs would still call native PPC QuickDraw code. It was processor intensive code that was slow. Especially with the first generation 68K emulator.

Connectix SpeedDoubler was definitely faster.


Most of the Toolbox was still running emulated 68k code in early Power Mac systems. A few bits of performance-critical code (like QuickDraw, iirc) were translated, but most things weren't.


I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.

What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.

I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.

And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.


Could it be simply because many binaries were produced by much older, outdated optimizers. Or optimized for size.

Also, optimizers usually target “most common denominator” so native binaries rarely use full power of current instruction set.

Jumping from that peculiar finding to praising runtime JIT feels like a longshot. To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.


> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.

That's why bitcode is dead now.

By the way, I don't know why this thread is about how JITs can optimize programs when this article is about how Rosetta is not a JIT and intentionally chose a design that can't optimize programs.


Bitcode is dead because Apple got fed up to keep maintaining their own fork with guarantees that regular LLVM doesn't offer.

Since 1961 plenty of systems have used bytecodes as executable formats, the most sucessfull still in use, IBM and Unisys mainframes and microcomputers.


> That's why bitcode is dead now.

WebAssembly seems alive and well. I'm not sure how similar it is to java bitcode but its the same core idea.

That said, WASM through v8 is ~3x slower than the same code compiled natively. (Some of this might be due to the lack of SIMD in wasm).


"Bitcode" is the name of a specific Xcode feature in this case, not a general comment.


> This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.

> That's why bitcode is dead now.

Isn't this what Android does today? Applications are distributed in bytecode form and then optimized for the specific processor at install time.


The bitcode Apple used for their platforms was at a much, much lower level than bytecode used on Android.


Yeah, it was also a custom version, somehow people keep thinking they used LLVM bitcode straight out of the box.


I don't know what Android does… some kind of Java but not Java, right?

In that case it's much less expressive, so developers simply can't do the unsafe/specialized code in the first place. Which means they can't write in C or assembly.

Bitcode was a specific Apple feature that used LLVM's compiler IL and might have promised extra portability, but it didn't really work out and was removed this year. ("LLVM" stands for "low level virtual machine" which is funny because it isn't low level and isn't a virtual machine.)


LLVM is lower level than Python or Java bytecode. LLVM is also virtual machine in the sense of an abstract machine, similar in idea to a process virtual machine. Most usages of "virtual machine" today are talking about system virtual machines, but it's important to note that "virtual machine" is an overloaded phrase.


There is no C or Assembly in IBM and Unisys mainframes and microcomputers.


> Or optimized for size.

Note that on gcc (I think) and clang (I'm sure), -Oz is a strict superset of -O2 (the "fast+safe" optimizations, compared to -O3 that can be a bit too aggressive, given C's minefield of Undefined Behavior that compilers can exploit).

I'd guess that, with cache fit considerations, -Oz can even be faster than -O2.


Interesting, I didn't know about -Oz (only -Os for size). -Oz is reportedly mac-specific https://stackoverflow.com/questions/1778538/how-many-gcc-opt...


All reasonable points, but examples where JIT has an advantage are well supported in research literature. The typical workload that shows this is something with a very large space of conditionals, but where at runtime there's a lot of locality, eg matching and classification engines.


> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

Or distribute it in source form and make compilation part of the install process. Aka, the Gentoo model.



It was particularly poignant at the time because JITed languages were looked down on by the “static compilation makes us faster” crowd. So it was a sort of “wait a minute Watson!” moment in that particular tech debate.

No one cares as much now days, we’ve moved our overrated opinion battlegrounds to other portions of what we do.


I eventually changed my opinion into JIT being the only way to make dynamic languages faster, while strong typed ones can benefit from having both AOT/JIT for different kinds of deployment scenarios, and development workflows.


I think I landed in a place where it's basically "the compiler has insufficient information to achieve ideal optimization because some things can only be known at runtime."

Which is not exclusively an argument for runtime JIT— it can also be an argument for instrumenting your runtime environment, and feeding that profiling data back to the compiler to help it make smarter decisions the next time. But that's definitely a more involved process than just baking it into the same JavaScript interpreter used by everyone— likely well worth it in the case of things like game engines, though.


The problem with JIT is not all information known at runtime is the correct information to optimize one.

In finance the performance critical code path is often the one run least often. That is you have a if(unlikely_condition) {run_time_sensitive_trade();}. In this case you need to tell the compiler to ensure the CPU will have a pipeline stall because of a branch misprediction most of the time to ensure the time that counts the pipeline doesn't stall.

The above is a rare corner case for sure, but it is one of those weird exceptions you always need to keep in mind when trying to make any blanket rule.


The other issue with JIT is that it is unreliable. It optimizes code by making assumptions. If one of the assumptions is wrong, you pay a large latency penalty. In my field of finance, having reliably low latency is important. Being 15% faster on average, but every once in a while you will be really slow, is not something customers will go for.


I'm not in finance. I just remember one talk by a finance guy and it was a mind bender so I remember it.


How does the compiler arrange for the CPU to mispredict the branch most of the time? I didn't think there were any knobs for the branch predictor other than static ones (e.g. backwards-jumps statically predicted as taken, or PowerPC branch hint bit).


C++20 is gaining the [[likely]] attribute for guiding the predictor in a portable way:

http://eel.is/c++draft/dcl.attr.likelihood

Prior to this it was also possible, but required compiler-specific macros like GCC's __builtin_expect, which is used all over the kernel:

https://github.com/torvalds/linux/search?q=__builtin_expect


To my knowledge this does not have any direct impact to the CPU branch prediction mechanism. If that had been the case then we would at least have some x86-(64) instruction to manipulate with the BP. If I write a quick example such as https://godbolt.org/z/j7Y81j5fe, I also see no such instruction in the generated output.

But what likely/unlikely mechanism can do is that it can serve as a hint to the compiler which the compiler can then use to generate more optimal code layout. For example, if we provide the compiler with likely/unlikely hints in our code, compiler will try to use that hint to stitch together the more probable code paths first. In theory this approach should result with better utilization of CPU instruction cache and thus it may lead to better performance.


There are no ways to hint the branch predictor on some common CPUs. I think Intel says it's fully dynamic and has no static predictions.

If you want to cause a pipeline stall, there's usually serializing instructions like cpuid/eieio.


I believe intel used to always predict forward branches not taken and backward branches always taken if it didn’t have any history.

Regardless, you can at least lay out code in a way that heavily favors one branch over the other, and potentially optimize for one branch (I.e. give one branch very expensive prelude/exit to make another branch very cheap). The last thing I’ve seen compilers greatly struggle with though…


20 years of "progress" and hey, C++20 is gaining the likely attribute.


CPUs have documentation, for this. I forget which one, but the one I did read it was as simple as the true case is assumed more common, and it is easy to arrange logic around that (I probably mis remember, but close enough to that). The common case also should be near the if in memory (inline code), so it is likely on the same cache line (or the next which is prefetched), while the other case is farther away and so you can stall the cache if the if goes that way.


The problem with static compilation is not all information known at compile time is the correct information to optimize on.

Assuming either source of data is the single source of truth for all optimizations is a fallacy.

Use the right tool for the right job. Use all the tools if you can.


Static compilers usually don't have to make such a tradeoff, though. They are free to spend arbitrarily long amounts of time optimizing all branches. And they often do exactly that.

Static + LTO w/ PGO is pretty much the practical ideal. JITs don't offer much until you start adding dynamically loaded code where LTO just isn't possible anymore.


Tooling and real data sets.

Most PGO for AOT scenarios suffers from tooling experience and not using data sets similar to production workflows.


Perhaps, but that's orthogonal. JITs don't just come with ideal production set sampling either after all. They only capture snippets, and rarely re-optimize already compiled functions in the face of changing workloads or new information. You can do crowd-sourced profiles (like Android does), but that's a completely independent set of infrastructure from JIT vs. AOT. You can feed that same profile to PGO for AOT. In fact, that's how Android uses it. They don't feed the profile to the JIT, they feed it to their offline, install-time or idle-maintenance AOT compiler.


If your hardware is designed to allow very lightweight profiling and tracing, then static + LTO w/ PGO can still be improved by runtime re-optimization. If designed properly, the runtime overhead can be brought arbitrarily low by increasing the sampling period.


Are you speaking in theoreticals or can you point to any actual example of what you're describing? Usually for a JIT to work the source binary has to be unoptimized to begin with, otherwise information is lost that the JIT needs.

So What language/runtime out there ships unoptimized bytecode, an optimized precompiled static + LTO w/ PGO, and can re-optimize with runtime gathered information via a JIT?

Heck, what language/runtime is even designed around being performance-focused with a JIT to deliver even more performance in the first place? Pretty much ever JIT'd language makes trade-offs that sacrifice up-front performance and later hopes the JIT can claw some of it back. maybe this is kinda WASM-ish territory, although it then sacrifices up-front performance for security and hopes the JIT can claw it back.


The basic elements all exist.

As others have pointed out, HP's Dynamo managed to use runtime re-optimization to improve performance of many binaries without cooperation from the original compiler. Runtime optimization doesn't strictly require any of the information lost in optimized binary builds.

Last I checked, the Android Runtime would AoT-compile Dalvik bytecode at install time, and in the background re-optimize the binary based on profiling information. Though, I don't think it performs hot code replacement.

I'm not sure the latest with Oracle's Java AoT. Last I checked, Oracle's JVM wasn't able to inline or re-optimize AoT-compiled code through JIT'd code.

Some optimizations, such as loop unrolling make JIT'ing harder. However, strength reduction, loop-invariant code hoisting, etc. make the JIT's life easier. Back around 2005-2006, my employer was getting good mileage out of a Java bytecode optimizer. If your AoT and JIT are cooperating, the AoT can stuff any helpful metadata (type information, aliasing analysis, serialized control flow graph, etc.) into an auxiliary section of the binary.

I'd like to eventually write a C compiler that essentially compiles to old-school threaded code: arrays of pointers to basic blocks (strait-line code with a single entry point and one or more exit points). Function entry would just pass the array of basic blocks to a trampoline function that calls the first basic block. Each basic block would return and index into the function's array of basic blocks for the trampoline to call next. Function return would be signaled by a basic block returning -1 to the trampoline loop. A static single assignment representation of each extended basic block would be stashed in an auxiliary ELF section. On a regular system without the runtime optimizer, the only startup overhead would be due to bloated binary size. A wild guess at the performance overhead without the runtime optimizer would be in the 5% to 15% range. However, on a system with the ELF dynamic loader replaced with a runtime-optimizer, the runtime would set up a perf signal handler that would keep counters for identifying hotspots to trace. If the tracing conditions were met, the perf signal handler would walk back up the call stack to find the last occurrence of the address of the trampoline, and replace it with a version of the trampoline that in addition, stores the address of the next extended basic block to run. Once a trace loops back on itself or meets some other TBD criteria, the runtime would stitch together the SSA representations of the constituent extended basic blocks, and generate a new optimized basic block that's the inline of the components of the trace, and finally place the address of the new extended basic block in the correct place in the array of extended basic blocks, thus performing runtime code replacement. Anyway, that's my grand vision. I've taken the introductiory Stanford compilers course, and am slowly working my way forward, but I have a job and a young kid, so I'm not holding my breath.

Among other things, this allows for inlining of hot paths across dynamic library boundaries. It also improves code locality and should increase the percentage of not-taken branches in the hot path, which should help reduce problems with aliasing in the branch predictor.


> In finance

Isn't low-latency trading only a subset of finance?


Yes. And the part I mentioned is a subset of low latency. See the other reply.


It's also an argument for having much more expressive and precise type systems, so the compiler has better information.

Once you've managed to debug the codegen anyway (see: The Long and Arduous Story of Noalias).


Is it? I'd love to see a breakdown of what classes of information can be gleaned from profile data, and how much of an impact each one has in isolation in terms of optimization.

Naively, I would have assumed that branch information would be most valuable, in terms of being able to guide execution toward the hot path and maximize locality for the memory accesses occurring on the common branches. And that info is not something that would be assisted by more expressive types, I don't think.



That's a lesson Intel had to (re-?)learn with Itanium as well.


> […] "the compiler has insufficient information to achieve ideal optimization because some things can only be known at runtime."

This is where the profile guided optimisation comes in – for statically compiled languages, with a caveat being not always straightforward to come up with a set of inputs that will trigger an execution of all possible code paths. One solution is to provide the coverage specifically for the performance critical code paths and let the rest just be.


> it can also be an argument for instrumenting your runtime environment

Aren't JITs already self-instrumenting? What would you instrument that the JIT is not already keeping track of?


Darn it, replied too early. See sibling comment I just posted. The problem with dynamic languages is that you need to speculate and be ready to undo that speculation.


Before I talked myself out of writing my own programming language, I used to have lunch conversations with my mentor who was also speed obsessed about how JIT could meet Knuth in the middle by creating a collections API with feedback guided optimization, using it for algorithm selection and tuning parameters by call site.

For object graphs in Java you can waste exorbitant amounts of memory by having a lot of “children” members that are sized for a default of 10 entries but the normal case is 0-2. I once had to deoptimize code where someone tried to do this by hand and the number they picked was 6 (just over half of the default). So when the average jumped to 7, then the data structure ended up being 20% larger than the default behavior instead of 30% smaller as intended.

For a server workflow, having data structured tuned to larger pools of objects with more complex comparison operations can also be valuable, but I don’t want that kitchen sink stuff on mobile or in an embedded app.

I still think this is viable, but only if you are clever about gathering data. For instance the incremental increase in runtime for telemetry data is quite high on the happy path. But corner cases are already expensive, so telemetry adds only a few percent there instead of double digits.

The nonstarter for this ended up being that most collections APIs violate Liskov, so you almost need to write your own language to pick a decomposition that doesn’t. Variance semantics help a ton but they don’t quite fix LSP.


That is in a sense what JIT caching with PGO feedback loop does to a certain extent.

As for mobile, this isn't far off from the whole set of hand written Assembly interpreter, JIT with PGO data caching, AOT compilation on idle with the PGO collected data, sharing PGO metadata with other devices via the PlayStore, that Android happens to do since version 7, and refined later.


Swift and Objective-C have collections that change implementation as needed, but they can do it because they're well abstracted (enough) and usually immutable so there's less chance of making a bad assumption.

Most other languages only have mutable collections, and name collection types after their implementation details instead of what the user actually wants from them.


Dynamic languages need inline caches, type feedback, and fairly heavy inlining to be competitive. Some of that can be gotten offline, e.g. by doing PGO. But you can't, in general, adapt to a program that suddenly changes phases, or rebinds a global that was assumed a constant, etc. Speculative optimizations with deopt are what make dynamic languages fast.


I think you're over estimating the impact or relevance of that anecdote. Particularly since the "static compilation makes us faster" crowd turned out to be correct, and people use JITs for non-performance reasons and just pay the performance tax they so often come with. The time-constrained nature in which a JIT has to run just largely kills it's theoretical runtime information gathering advantages. Devirtualization remains about the most advanced trick in the JIT book, which is generally not an issue static compiled languages struggle with in the first place.


Sure, because everyone ships static linked binaries.


I take it you are not very familiar with the website known as Hacker News.


People have mentioned the Dynamo project from HP. But I think you're actually thinking of the Aries project (I worked in a directly adjacent project) that allowed you to run PA-RISC binaries on IA-64.

https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html


Something that fascinates me about this kind of A -> A translation (which I associate with the original HP Dynamo project on HPPA CPUs) is that it was able to effectively yield the performance effect of one or two increased levels of -O optimization flag.

Right now it's fairly common in software development to have a debug build and a release build with potentially different optimisation levels. So that's two builds to manage - if we could build with lower optimisation and still effectively run at higher levels then that's a whole load of build/test simplification.

Moreover, debugging optimised binaries is fiddly due to information that's discarded. Having the original, unoptimised, version available at all times would give back the fidelity when required (e.g. debugging problems in the field).

Java effectively lives in this world already as it can use high optimisation and then fall back to interpreted mode when debugging is needed. I wish we could have this for C/C++ and other native languages.


It depends greatly on which optimization levels you’re going through. —O0 to -O1 can easily be a 2-3x performance improvement, which is going to be hard to get otherwise. -O2 to -O3 might be 15% if you’re lucky, in which case -O+LTO+PGO can absolutely get you wins that beat that.


-O2 to -O3 has in some benchmarks made things worse. In others it is a massive win, but in generally going above -O2 should not be done without bench marking code. There are some optimizations that can make things worse or better for reasons that compiler cannot know.


Over-optimizing your "cold" code can also make things worse for the "hot" code, eg by growing code size so much that briefly entering the cold space kicks everything out of caches.


I have often lamented not being able to hint to the JIT when I’ve transitioned from startup code to normal operation. I don’t need my Config file parsing optimized. But the code for interrogating the Config at runtime better be.

Everything before listen() is probably run once. Except not ever program calls listen().


And then there’s always the outlier where optimizing for size makes the working memory fit into cache and thus the whole thing substantially faster.


One of the engineers I was working with on a project was from Transitive (the company that made QuickTransit which became Rosetta) found that their JIT based translator could not deliver significant performance increases for A->A outside of pathological cases, and it was very mature technology at the time.

I think it's a hypothetical. The Mill Computing lectures talk about a variant of this, which is sort of equivalent to an install-time specializer for intermediate code which might work, but that has many problems (for one thing, it breaks upgrades and is very, very problematic for VMs being run on different underlying hosts).


>"The Mill Computing lectures talk about a variant of this ..."

Might you or someone else have a link those Mill Computing lectures?


Sure thing. I’m on mobile but the 2nd one was easy to find and is here -

https://youtu.be/QGw-cy0ylCc


If JIT-ing a statically compiled input makes it faster, does that mean that JIT-ing itself is superior or does it mean that the static compiler isn't outputting optimal code? (real question. asked another way, does JIT have optimizations it can make that a static compiler can't?)


It's more the case that the ahead-of-time compilation is suboptimal.

Modern compilers have a thing called PGO (Profile Guided Optimization) that lets you take a compiled application, run it and generate an execution profile for it, and then compile the application again using information from the profiling step. The reason why this works is that lots of optimization involves time-space tradeoffs that only make sense to do if the code is frequently called. JIT only runs on frequently-called code, so it has the advantage of runtime profiling information, while ahead-of-time (AOT) compilers have to make educated guesses about what loops are the most hot. PGO closes that gap.

Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.


> Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.

In theory JIT can be a lot more efficient, optimizing for not only the exact instruction set, and do per CPU architecture optimizations, such as instruction length, pipeline depth, cache sizes, etc.

In reality I doubt most compiler or JIT development teams have the resources to write and test all those potential optimizations, especially as new CPUs are coming out all the time, and each set of optimizations is another set of tests that has to be maintained.


Like another commented, JIT compilers do this today.

The thing that makes this mostly theoretical is that the underlying assumption is only true when you neglect that an AOT has zero run-time cost while a JIT compiler has to execute the code it's optimizing and the code to decide if it's worth optimizing and generate new code.

So JIT compiler optimizations are a bit different than AOT optimizations since they have to both generate faster/smaller code and the execute code that performs the optimization. The problem is that most optimizations beyond peephole are quite expensive.

There's another thing that AOT compilers don't need to deal with, which is being wrong.Production JITs have to implement dynamic de-optimization in the case that an optimization was built on a bad assumption.

That's why JITs are only faster in theory (today), since there are performance pitfalls in the JIT itself.


But, JIT vs. AoT is a false dichotomy. Given light-weight enough profiling utilizing cooperation between hardware designers and compiler writers, one could have AoT with feedback-guided optimization and link-time optimization, and still have just-in-time re-optimization.

Concretely, I think you'd want hardware that supported reservoir sampling of where CPU cycles are spent, sampling of which branches are mispredicted, and which code locations are causing cache misses. You'd also want lightweight hardware recording of execution traces.


Guess what Apple has? :)


Nearly all JS engines are doing concurrent JIT compilation now, so some of the compilation cost is moved off the main thread. Java JITs have had multiple compiler threads for more than a decade.


But they all still optimize their JITs to prioritize compilation speed & RAM usage (JIT'd code is dirty pages after all) over maximum optimizations. This is why you see things like WebKits multi-tier JIT strategy: https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/

They still want to swap in that JIT'd result ASAP since after all by the time it's been flagged for compilation it's already too late & is a hot hot hot function.


Which is why in the Java world (including Android), and .NET, now we use JIT caches as well, so that this data isn't lost between runs.


The well funded production JIT compilers (HotSpot, V8, etc.) absolutely do take advantage of these. The vector ISA can sometimes be unwieldy to work with but things like replacing atomics, using unaligned loads, or taking advantage of differing pointer representations is common.


They do some auto-vectorization, but AFAIK they don't do micro-optimizations for different CPUs.


gcc and clang at least have options so you can optimize for specific CPUs. I'm not sure how good they are (most people want a generic optimization that runs well on all CPUs of the family, so there likely is lots of room for improvement with CPU specific optimization), but they can do that. This does (or at least can, again it probably isn't fully implemented), account for instruction length, pipeline depth, cache size.

The Javascript V8 engine, and the JVM both are popular and supported enough that I expect the teams working on them take advantage of every trick they can for specific CPUs, they have a lot of resources for this. (at least the major x86 and ARM chips - maybe they don't for MIPS or some uncommon variant of ARM...). Of courses there are other JIT engines, some uncommon ones don't have many resources and won't do this.


> take advantage of every trick they can for specific CPUs

Not to the extent clang and gcc do, no. V8 does, e.g. use AVX instructions and some others if they are indicated to be available by CPUID. TurboFan does global scheduling in moving out of the sea of nodes, but that is not machine-specific. There was an experimental local instruction scheduler for TurboFan but it never really helped big cores, while measurements showed it would have helped smaller cores. It didn't actually calculate latencies; it just used a greedy heuristic. I am not sure if it was ever turned on. TurboFan doesn't do software pipelining or unroll/jam, though it does loop peeling, which isn't CPU-specific.


> gcc and clang at least have options so you can optimize for specific CPUs. I'm not sure how good they are

They are not very good at it, and can't be. You can look inside them and see the models are pretty simple; the best you can do is optimize for the first step (decoder) of the CPU and avoid instructions called out in the optimization manual as being especially slow. But on an OoO CPU there's not much else you can do ahead of time, since branches and memory accesses are unpredictable and much slower than in-CPU resource stalls.


In addition to the sibling comments, one simple opportunity available to a JIT and not AOT is 100% confidence about the target hardware and its capabilities.

For example AOT compilation often has to account for the possibility that the target machine might not have certain instructions - like SSE/AVX vector ops, and emit both SSE and non-SSE versions of a codepath with, say, a branch to pick the appropriate one dynamically.

Whereas a JIT knows what hardware it's running on - it doesn't have to worry about any other CPUs.


One great example of this was back in the P4 era where Intel hit higher clock speeds at the expense of much higher latency. If you made a binary for just that processor a smart compiler could use the usual tricks to hit very good performance, but that came at the expense of other processors and/or compatibility (one appeal to the AMD Athlon & especially Opteron was that you could just run the same binary faster without caring about any of that[1]). A smart JIT could smooth that considerably but at the time the memory & time constraints were a challenge.

1. The usual caveats about benchmarking what you care about apply, of course. The mix of webish things I worked on and scientists I supported followed this pattern, YMMV.


AOT compilers support this through a technique called function multi-versioning. It's not free and only goes so far, but it isn't reserved to JITs.

The classical reason to use FMV is for SIMD optimizations, fwiw


It means that in this case, the static compiler emitted code that could be further optimised, that's all. It doesn't mean that that's always the case, or that static compilers can't produce optimal code, or that either technique is "better" than the other.

An easy example is code compiled for 386 running on a 586. The A->A compiler can use CPU features that weren't available to the 386. As with PGO you have branch prediction information that's not available to the static compiler. You can statically compile the dynamically linked dependencies, allowing inlining that wasn't previously available.

On the other hand you have to do all of that. That takes warmup time just like a JIT.

I think the road to enlightenment is letting go of phrasing like "is superior". There are lots of upsides and downsides to pretty much every technique.


It depends on what the JIT does exactly, but in general yes a JIT may be able to make optimisations that a static compiler won't be aware of because a JIT can optimise for the specific data being processed.

That said, a sufficiently advanced CPU could also make those optimisations on "static" code. That was one of the things Transmeta had been aiming towards, I think.


A JIT can definitely make optimizations that a static compiler can't. Simply by virtue of it having concrete dynamic real-time information.


Yes, the JIT has more profile guided data as to what your program actually does at runtime, therefore it can optimize better.


On the other hand some optimization are so expensive that a JIT just doesn't have the execution budget to perform them.

Probably the optimal system is an hybrid iterative JIT/AOT compiler (which incidentally was the original objective of LLVM).


Post-build optimization of binaries without changing the target CPU is common. See BOLT https://github.com/facebookincubator/BOLT


I've run Ruby C extensions on a JIT faster than on native, due to things like inlining and profiling working more effectively at runtime.


>"I remember years ago when Java adjacent research was all the rage, ..."

What is meant by "Java adjacent research"? I'm not familiar with what that was.


> where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds.

That already exists with c/c++...unity builds aka single translation unit builds. Compiling and linking a ton of object files takes an inordinate amount of time, often the majority of the build time


Outside of gaming, or hyper-CPU-critical workflows like video editing, I'm not really sure if people actually even care about that last 10% of performance.

I know most of the time I get frustrated by everyday software, its doing something unnecessary in a long loop, and possibly forgetting to check for Windows messages too.


Performance also translates into better battery life and cheaper datacenters.


I’m likely misunderstanding what you said, but I thought pre-compiled headers were pretty much standard these days.


What on earth did I say to merit the downvotes?


Is this for itanium


> The output was faster than the input.

So if you ran the input back through the output multiple times then that means you could eventually get the runtime down to 0.


But unfortunately, the memory use goes to infinity.


Probably the output of the decade-old compiler that produced the original binary had no optimizations.


That too but the eternal riddle of optimizer passes is which ones reveal structure and which obscure it. Do I loop unroll or strength reduce first? If there are heuristics about max complexity for unrolling or inlining then it might be “both”.

And then there’s processor family versus this exact model.


Does anyone know the names of the key people behind Rosetta 2?

In my experience, exceptionally well executed tech like this tends to have 1-2 very talented people leading. I'd like to follow their blog or Twitter.


I am the creator / main author of Rosetta 2. I don't have a blog or a Twitter (beyond lurking).


If you're feeling inclined, here's a slew of questions:

What was the most surprising thing you learned while working on Rosetta 2?

Is there anything (that you can share) that you would do differently?

Can your recommend any great starting places for someone interested in instruction translation?

Looking forward, did your work on Rosetta give you ideas for unfilled needs in the virtualization/emulation/translation space?

What's the biggest inefficiency you see today in the tech stacks you interact most with?

A lot of hard decisions must have been made while building Rosetta 2; can you shed light on some of those and how you navigated them?


Should you feel inspired to share your learnings, insights, or future ideas about the computing spaces you know, me and I'm sure many other people would be interested to listen!

My preferred way to learn about a new (to me) area of tech is to hear the insights of the people who have provably advanced that field. There's a lot of noise to signal in tech blogs.


Thanks for your amazing work!

May I ask – would it be possible to implement support for 32-bit VST and AU plugins?

This would be a major bonus, because it could e.g. enable producers like me to open up our music projects from earlier times, and still have the old plugins work.


Impressive work, Cameron! Hope you're doing well.


Are you able to speak at all to the known performance struggles with x87 translation? Curious to know if we're likely to see any updates or improvements there into the future.


There are two ways to approach x87: either saying to heck with it and just using doubles for everything (this is essentially what Qemu does) or creating a software fp80 implementation. Both approaches get burned by the giant amount of state, and state weirdness, that x87 brings to the table. It's also not possible to "fix" things by optimizing for the cases where the x87 unit's precision is set to the same as fp32 or fp64, as the precision flags don't impact the exponent range.

But even on native hardware using x87 is vastly slower than fp64, and it's just a shame that only win64 had the good sense to define long double as being fp64 instead of fp80 as every other x86_64 platform did :-/


> It's also not possible to "fix" things by optimizing for the cases where the x87 unit's precision is set to the same as fp32 or fp64, as the precision flags don't impact the exponent range.

I've been meaning to look into this. Certainly you can't blindly optimise all x87 code sequences to fp32 or fp64. But some sequences are safe.

For example, adding two numbers and saving back to memory is safe to optimise (at least for the infinity case, I haven't double-checked the subnormal behaviour). It's only when you need to add three or more numbers that you run into issues (though you can go further, if all N numbers have the same sign, you will get the correct result, you just might have saturated at infinity a few operations earlier than native x87)

Same goes for multiplication of two numbers (and N numbers that all provably >= 1.0)

The question is if such code sequences are common enough to bother trying to identify at compile time and optimise.


> For example, adding two numbers and saving back to memory is safe to optimise (at least for the infinity case, I haven't double-checked the subnormal behaviour). It's only when you need to add three or more numbers that you run into issues (though you can go further, if all N numbers have the same sign, you will get the correct result, you just might have saturated at infinity a few operations earlier than native x87)

No, you cannot. Operations you can optimise are negation, nan, and infinity checks (ignoring pseudo nans and pseudo infinity checks of course).

fp80 has a 15bit exponent, and functionally a 63 significand, vs fp64's 11bit exponent and 52bit significand. When setting x87 to a reduced precision mode you aren't switching to fp64, you're getting a mix with a 15 bit exponent and a 53 significand. The effect is that you retain 53bits of precision for values where fp64 has entered subnormals, and conversely you maintain 53 bits of precision after fp64 has overflowed. There are perf benefits to reducing precision in x87 (at least in the 90s), but the main advantage is consistent rounding with fp64 while in the range of normalized fp64 values.


The key to this optimisation idea is the exponent gets truncated back to 8 bits when being written back to memory.

For the example of two fp32s adding to infinity.

With 15bit exponent: The add results in a non-infinite with exponent outside the -125 to 127 range. Then when writing back to memory, it the FPU notices the exponent is outside of the valid range, clamps it and writes infinity to memory.

With 8 bit exponent: The add immediately clamps to infinity in the register. and then it writes to memory.

In both cases you get the same result in memory, so the result is valid as long as the in-register version is killed. And the same should apply to the subnormal case (I have not double checked the x87 spec). If you start with two subnormals that are valid f32s, add them, get a subnormal result and then write back to memory as a f32, it should be guaranteed to produce the same result with both a 15bit exponent and a 8bit exponent. It doesn't matter if the subnormal mantissa was truncated before writing back to the register, or writing back to memory. It was still truncated.

You only start getting accuracy issues start doing multiple additions in a row without truncating the exponent. If you add 3 floats, the result of A+B might be infinite, but A+B+C could result in a normal f32 if you had a 15bit exponent (when A+B is positive infinity, and C is negative)

This line of thought could potentially be pushed further. If you can prove (or guard) at compile time that all N floats in a sequence of N adds will be positive (and not subnormal), then you can't have a case where one of intermediary exponent exceeds 127 but then the final exponent is less than 128. If there is an infinite anywhere along the chain, it will saturate to infinity. With 15bit exponent, the saturation might not be applied until the f32 value is written to memory at the end, but because of the preconditions the optimiser can guarantee the same result in memory (either infinity or a normal) at the end while only using 8 bit exponents operations.

Most of the above should also apply to other operations like multiply. I've only done some preliminary thinking about this idea, enough to be sure that some operations could be optimised. I'm only fully confident about clamping to infinity case, and I'm going to be really bummed if when I get around to double-checking, there is something about how the x87 deals with subnormals that I'm not aware of. Or some other x87 weirdity.


> The key to this optimisation idea is the exponent gets truncated back to 8 bits when being written back to memory.

That's incorrect - the clamping of the exponent only occurs if you were to use FST/m32 or FST/m64, but if you're using x87 you're presumably doing FSTP/m80fp so there is no truncation or rounding on store, regardless of the prevision flag in the control word.

It sounds like what you're trying to arrange is an optimization such that a rosetta like translator/emulator can optimize this highly awesome function to be performed entirely using hardware fp32 or fp64 support:

    fp32 f(fp32 *fs, size_t count) {
      // pseudo code obviously :D
      ControlWord cw = fstcw();
      cw.precision = Precision32;
      fldcw(cw);
      fp80 result = 0;
      for (unsigned j = 0; j < count; j++) {
        result += fs[j];
      }
      return (fp32)result;
      // pretend we restored state before returning :)
    }
The problem you run into though, is that an optimization pass can make decisions based on anything other than the code it is presented with. So your optimizer can't assume sign or magnitude here, so that += has to be able to under or overflow the range that fp32 offers.

Things get really miserable once you go beyond +/-, because you can end up in a position where an optimization to do everything in the 32/64 bit units means that you won't get observable double rounding.

This is kind of moot in the rosetta case as I don't believe we ever implemented support for the precision control bits

More fun are the transcendentals - x87 specifies them as using range reduction and so they are incredibly inaccurate (in the context of maths functions), especially around multiples of pi/4, and if you go test it you'll find rosetta will produce the same degree of terrible output :D


I was thinking more about functions along the lines of this vertex transform function that you might theoretically find as hot code in a late 90s or early 2000s windows game (before hardware transform and lighting).

    void transform_verts(fp32 *m, fp32 *verts, size_t vert_count) {
      // it's a game, decent chance it applies percision32 across the whole process
      // Especially since directx < 10 automatically sets it when a 3d context is created
      while (vert_count--) {
        verts[0] = verts[0] * m[0] + verts[0] * m[1] + verts[0] * m[2] + m[3];
        verts[1] = verts[1] * m[4] + verts[1] * m[5] + verts[1] * m[6] + m[7];
        verts[2] = verts[2] * m[8] + verts[2] * m[9] + verts[2] * m[10] + m[11];
        verts += 3;
      }
    }
Would be nice if we could optimise it all to pure hardware fp32 without any issues. But not really possible with those six operation long chains. And you are right, we can't really assume anything about the data.

But we can go for guards and fallbacks instead. Implement that loop body as something like

    loop: 
        // Attempt calculation with hardware fp32
        $1 = hwmul(verts[0], m[0])
        $2 = hwmul(verts[0], m[1])
        $3 = hwmul(verts[0], m[2])
        $4 = hwadd($1, $2)
        $5 = hwadd($3, $4)
        $6 = hwadd($5, m[3]) // any infs from above sub-equations will saturate though to here
        if any(is_subnormal_or_zero([$1, $2, $3, $4, $5]) || $6 is inf: // guard
           // one of the above subcalulcations became either inf or subnormal, so our
           // hwresult might not be accurate. recalculate with safe softfloat
           $8 = swadd(swmul(verts[0], m[0]), swmul(verts[0], m[1]))
           $6 = swadd(swadd($8, swmul(vert[0], m[2])), m[3])
    
        verts[0] = $6
    
        // repeat above pattern for verts[1] and verts[2]
    
        goto loop
        
I think that produces bit-accurate results?

Sure, it might seem complicated to calculate twice. But the resulting code might end up faster than just pure softfloat code across average data. Maybe this is the type of optimisation that you only attempt at the highest level on a multi-tier JIT for really hot code. You could perhaps even instrument the function first to get an idea what the common shape of the data is.

> This is kind of moot in the rosetta case as I don't believe we ever implemented support for the precision control bits

So it's already producing inaccurate results for code that sets precision control? Might as well just switch over to hardware fp32 and fp64 /s

I guess for the rosetta usecase, Intel macs didn't until 2006 and so most of the install base of x86 programs will be compiled with SSE2 support, and commonly 64bit.

Probably the most common usecase for x87 support in rosetta will be 64bit code used long doubles and compilers/ABIs annoying implemented them as x87.


Sorry for delay (surgery funsies)

> So it's already producing inaccurate results for code that sets precision control? Might as well just switch over to hardware fp32 and fp64 /s

:D

But in practice the only reason for changing the x87 precision is performance, which was then simply retained in hardware for backwards compatibility. Modern code (as in >= SSE era) simply uses fp32 or fp64 which is faster, more memory compact, has vector units, has a much more sane ISA, etc. Anyone who does try to toggle x87 mode in general is in for a world of hurt because the system libraries all assume the unit is operating in default state.

You are correct that the only reason x86_64 needs x87 is that the unix x86_64 ABI decided to specify the already clearly deprecated format the implementation of long double. I often looked wistfully at win64 where long double == double.


Amazing work! It's nice to put a name to it :)


Huh, this is timely. Incredibly random but: do you know if there was anything that changed as of Ventura to where trying to mmap below the 2/4GB boundary would no longer work in Rosetta 2? I've an app where it's worked right up to Monterey yet inexplicably just bombs in Ventura.


This should work (Wine obviously needs it when running 32-bit apps). Are you explicitly specifying a small PAGEZERO when compiling?


Yup!


Pretty sure mmap goes almost directly to the kernel in Rosetta 2, and Apple silicon requires at least 4 GB.


This works fine in Big Sur and Monterey on Apple Silicon, hence my point about if something changed in Ventura.

Edit: Big Sur -> Ventura.


Not affiliated and don't know, but curious why you're doing that in the first place?


's not my doing, it's just an older project that's slowly migrating to a newer system but is held back by everyone having lives. I wouldn't do it normally, heh.


Isn't Rosetta 2 "done"? What are you working on now?


The original Rosetta was written by Transitive, which was formed by spinning a Manchester University research group out. See https://www.software.ac.uk/blog/2016-09-30-heroes-software-e...

I know a few of their devs went to ARM, some to Apple & a few to IBM (who bought Transitive). I do know a few of their ex staff (and their twitter handles), but I don’t feel comfortable linking them here.


IIRC the current VP of Core OS at Apple is ex-Manchester/Transitive.


> To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah///unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on old version of macOS, so things may have changed and improved since then.

My aotool project uses a trick to extract the AOT binary without root or disabling SIP: https://github.com/lunixbochs/meta/tree/master/utils/aotool


> Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front.

Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?


> If so, does it affect the application startup times?

It does, but only the very first time you run the application. The result of the transpilation is cached so it doesn't have to be computed again until the app is updated.


Similar to DEC's FX!32 in that regard. FX!32 allowed running x86 Windows NT apps on Alpha Windows NT.


There was also an FX!32 for Linux. But I think it may have only included the interpreter part and left out the transpiler part. My memory is vague on the details.

I do remember that I tried to use it to run the x86 Netscape binary for Linux on a surplus Alpha with RedHat Linux. It worked, but so slowly that a contemporary Python-based web browser had similar performance. In practice, I settled on running Netscape from a headless 486 based PC and displaying remotely on the Alpha's desktop over ethernet. That was much more usable.


Does that essentially mean each non-native app is doubled in disk use? Maybe not doubled but requires more space to be sure.


Yes... you can see the cache in /var/db/oah/

Though only the actual binary size that gets doubled. For large apps it’s usually not the binary that’s taking up most of the space.


Yes.


And deleting the cache is undocumented (it is not in the file system) so if you run Mac machines as CI runners they will trash and brick themselves running out of disk space over time.


Really? This SO question says it's stored in /var/db/oah/

https://apple.stackexchange.com/questions/427695/how-can-i-l...


You mean the cache is ever expanding?


What in the actual fuck. That is such an insane decision. Where is it stored then? Some dark corner of the file system inaccessible via normal means?


GP is incorrect - they’re stored in individual files inside /var/db/oah, and can be deleted without causing harm.


The first load is fairly slow, but once it's done it every load after that is pretty much identical to what it'd be running on an x86 mac due to the caching it does.


For me my M1 was fast enough that the first load didn't seem that different - and more importantly subsequent loads were lighting fast! It's astonishing how good Rosetta 2 is - utterly transparent and faster than my Intel Mac thanks to the M1.


If installed using a packaged installer, or the App Store, the translation is done during installation instead of at first run. So, slow 1st launch may be uncommon for a lot of apps or users.


Yes, it does. The delay of the first start of an app is quite noticeable. But the transpiled binary is apparently cached somewhere.


/var/db/oah.


"I believe there’s significant room for performance improvement in Rosetta 2... However, this would come at the cost of significantly increased complexity... Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that."


Would be a waste of effort when the tool is designed to be obsolete in a few years as everything gets natively compiled.


They could've amazed a few people a bit more by emulating x86 apps even faster (but M1+Rosetta can already run some stuff faster than an Intel Mac), but then the benefit of releasing native apps would be much decreased ("why bother, it's good enough ...").

It's a delicate political game that they, yet again, seem to be playing pretty well.


One thing that’s interesting to note is that the amount of effort expended here is not actually all that large. Yes, there are smart people working on this, but the performance of Rosetta 2 for the most part is probably the work of a handful of clever people. I wouldn’t be surprised if some of them have an interest in compilers but the actual implementation is fairly straightforward and there isn’t much of the stuff you’d typically see in an optimizing JIT: no complicated type theory or analysis passes. Aside from a handful of hardware bits and some convenient (perhaps intentionally selected) choices in where to make tradeoffs there’s nothing really specifically amazing here. What really makes it special is that anyone (well, any company with a bit of resources) could’ve done it but nobody really did. (But, again, Apple owning the stack and having past experience probably did help them get over the hurdle of actually putting effort into this.)


Yeah, agreed. I get the impression it's a small team.

But there is a long-tail of weird x86 features that are implemented, that give them amazing compatibility, that I regret not mentioning:

* 32-bit support for Wine

* full x87 emulation

* full SSE2 support (generally converting to efficient NEON equivalents) for performance on SIMD code

I consider all of these "compatibility", but that last one in particular should have been in the post, since that's very important to the performance of optimised SIMD routines (plenty of emulators also do SIMD->SIMD, but others just translate SIMD->scalar or SIMD->helper-runtime-call).


Guilty!


I think it's about the incentive and not about other companies not doing it. Apple decided to move to ARM and the reason is probably in their strong connection to the ARM ecosystem which basically means that they have an edge with their vertical-integration approach when compared to the other competitors. Apple is one of the three _founding_ companies of ARM. Other two were VLSI Technology and Acorn.


Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.


That’s not correct, the article goes into details why.


That is correct, the article goes into details why. See the "Apple's Secret Extension" section as well as the "Total Store Ordering" section.

The "Apple's Secret Extension" section talks about how the M1 has 4 flag bits and the x86 has 6 flag bits, and how emulating those 2 extra flags would make every add/sub/cmp instruction significantly slower. Apple has an undocumented extension that adds 2 more flag bits to make the M1's flag bits behave the same as x86.

The "Total Store Ordering" section talks about how Apple has added a non-standard store ordering to the M1 than makes the M1 order its stores in the same way x86 guarantees instead of the way ARM guarantees. Without this, there's no good way to translate instructions in code in and around an x86 memory fence; if you see a memory fence in x86 code it's safe to assume that it depends on x86 memory store semantics and if you don't have that you'll need to emulate it with many mostly unnecessary memory fences, which will be devastating for performance.


I’m aware of both of these extensions; they’re not actually necessary for most applications. Yes, you trade fidelity with performance, but it’s not that big of a deal. The majority of Rosetta’s performance is good software decisions and not hardware.


Yeah, these features exist, and they help, but I don't think they should be given all the credit. Both "Apple's Secret Extension" and "Total Store Ordering" are features that other emulators can choose to disable to get exactly the same performance.

"Apple's Secret Extension" isn't even used by Rosetta 2 on Linux (opting for, at least, explicit parity flag calculations rather than reduced fidelity). It's still fast.

TSO is only required for accuracy on multithreaded applications, and the PF and AF flags are basically never used, and, if they are, will usually be used immediately after being set, allowing emulators to achieve reasonable fidelity by only calculating them when used.

There's perhaps a better argument for performance-via-vertical-integration with the flag-manipulation extensions, which I believe Apple created and standardised, but which now anyone can use.

But the reason I wrote this post is that I think most of the ideas are transferable and could help other emulators :)


> TSO is only required for accuracy on multithreaded applications

If by accuracy you mean not segfaulting then yes. Every moderately complex x86-64 application will have memory fences in the generated machine code. x86-64 design of store-buffers and load-buffers are making the memory fences a necessity. In reality it's enough just to use the mutex or atomics in your code to end up with the memory fence in your generated machine code. So, I'd say that this particular part of Rosetta/M1 design is quite important, if not the most important. Without it applications wouldn't run.


You can approximate it fairly well without much impact.


Not true. The required fencing has huge impact. I led development of the chpe compiler for windows on arm, and the fencing was major source of our gains.


I don't think we disagree :) If you're going for full accuracy you morally need barriers all over the place. If have TSO in your chips that makes things far easier, alternatively you can do stuff with RCpc if your hardware supports it. Otherwise you get stuck with fences everywhere, or you force your hardware into TSO compliance mode (read: turn off all the other cores) and that sucks.

The other option is you relax on the "required fencing", with the assumption that most accesses do not actually exercise the full semantics that TSO guarantees. Obviously some synchronization does matter, so you need heuristics and those won't always work. My understanding was that XTA has some of these, with knobs to turn them off if they don't work? You probably know more about that than I do. In iSH we play it even more fast-and-loose, with all regular memory accesses being lowered to ARM loads and stores, and locked operations to whatever seemed the closest. It's definitely not production-grade but we have shockingly good compatibility for what it is.


Apple is doing some really interesting but really quiet work in the area of VMs. I feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.

As a somewhat related aside, I have been watching Bun (low startup time Node-like on top of Safari’s JavaScript engine) with enough interest that I started trying to fix a bug, which is somewhat unusual for me. I mostly contribute small fixes to tools I use at work. I can’t quite grok Zig code yet so I got stuck fairly quickly. The “bug” turned out to be default behavior in a Zig stdlib, rather than in JavaScript code. The rest is fairly tangential but suffice it to say I prefer self hosted languages but this probably falls into the startup speed compromise.

Being low startup overhead makes their VM interesting, but the fact that it benchmarks better than Firefox a lot of the time and occasionally faster than v8 is quite a bit of quiet competence.


> feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.

And maybe also by keeping the technology closed and Apple-specific. Many people who could be interested in using it don't have access to it.


WebKit B3 is open source: https://webkit.org/docs/b3/


Exactly. As someone who would be very interested in this, but don't use Apple products, it's just not exciting because it's not accessible to me (I can't even test it as a user). If they wanted to write a whitepaper about it to share knowledge, that might be interesting, but given that it's Apple I'm not gonna hold my breath.


Apple (mostly WebKit) writes a significant amount about how they designed their VMs.


> The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which convert floating-point condition flags to/from a mysterious “external format”. By some strange coincidence, this format is x86, so these instruction are used when dealing with floating point flags.

This really made me chuckle. They probably don't want to mention Intel by name, but this just sounds funny.

https://developer.arm.com/documentation/100076/0100/A64-Inst...


I hope Rosetta is here to stay and continues developement. And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.


> I hope Rosetta is here to stay and continues developement.

It almost certainly is not. Odds are Apple will eventually remove Rosetta II, as they did Rosetta back in the days, once they consider the need for that bridge to be over (Rosetta was added in 2006 in 10.4, and removed in 2011 from 10.7).

> And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.

That's not going to happen unless Apple decides to switch from ARM to RISC-V, and... why would they? They've got 15 years experience and essentially full control on ARM.


> Odds are Apple will eventually remove Rosetta II, as they did Rosetta back in the days, once they consider the need for that bridge to be over (Rosetta was added in 2006 in 10.4, and removed in 2011 from 10.7).

The difference is that Rosetta 1 was PPC → x86, so its purpose ended once PPC was a fond memory.

Today's Rosetta is a generalized x86 → ARM translation environment that isn't just for macOS apps. For example, it works with Apple's new virtualization framework to support running x86_64 Linux apps in ARM Linux VMs.

https://developer.apple.com/documentation/virtualization/run...


Rosetta 1 had a ticking time bomb. Apple was licensing it from a 3rd party. Rosetta 2 is all in house as far as we know.

Different CEO as well. Jobs was more opinionated on “principles” - Cook is more than happy to sell what people will buy. I think Rosetta 2 will last.


What important Intel-only macOS software is going to exist in five years?

It's basically only games and weird tiny niches, and Apple is pretty happy to abandon both those categories. The saving grace is that there's very few interesting Mac-exclusive games in the Intel era.


Starting with Ventura, Linux VMs can use Rosetta 2 to run x64 executables. I expect x64 Docker containers to remain relevant for quite a few years to come. Running those at reasonable speeds on Apple Silicon would be huge for developers.


Yeah, Apple killed all "legacy" 32-bit support, so one would think there's not much software which is both x86-64 and not being actively developed.


Rosetta 2 can run 32-bit software, there just isn't a 32-bit macOS anymore so the only client is WINE.


2006 Apple was very different from 2011 Apple, renewing that license in 2011 was probably considered cost prohibitive for the negligible benefit.


We’ll see, but even post-Cook Apple historically hasn’t liked the idea of third parties leaning on bridge technologies for too long. Things like Rosetta are offered as temporary affordances to allow time for devs to migrate, not as a permanent platform fixture.


They’ve also allowed Rosetta 2 in Linux VMs - if they are serious about supporting those use cases then I think it’ll stay.


> Rosetta 1 had a ticking time bomb. Apple was licensing it from a 3rd party.

Yes, I'm sure Apple had no way of extending the license.

> Cook is more than happy to sell what people will buy. I think Rosetta 2 will last.

There's no "buy" here.

Rosetta is complexity to maintain, and an easy cut. It's not even part of the base system.

And “what people will buy” certainly didn’t prevent essentially removing support for non-hidpi displays from MacOS. Which is a lot more impactful than Rosetta as far as I’m concerned.


What do you mean non HiDPI display support being removed from Mac OS? I’ve been using a pair of 1920x1080 monitors with my Mac Mini M1 just fine? Have they somehow broken something in Mac OS 13 / Ventura? (I haven’t clicked the upgrade button yet, I prefer to let others leap boldly first).


> removing support for non-hidpi displays from MacOS

Did that really reduce sales? Consider that the wide availability of crappy low end hardware gave Windows laptops a terrible reputation. Eg https://www.reddit.com/r/LinusTechTips/comments/yof7va/frien...


> Consider that the wide availability of crappy low end hardware gave Windows laptops a terrible reputation.

Standard DPI displays are not "crappy low-end hardware"?

I don't think there's a single widescreen display which qualifies as hiDPI out there, that more or less doesn't exist: a 5K 34" is around 160 DPI (to say nothing of the downright pedestrian 5K 49" like the G9 or the AOC Agon).


Ehh I had a 2013 MacBook pro back in 2013 with a 2560x1600 display. That's 227 dpi. A decade later, I think it's safe to say that anything smaller than that is extremely low-end in 2022.

I agree it's kinda sad how few desktop monitors are high dpi. It gets even worse if you limit yourself to low latency monitors.

Anyway I haven't used macos in a while so I'm not sure what you mean by Apple not supporting non-hidpi


The actual screen dimensions make a huge difference to whether or not a given DPI value is low or high end. My current monitor is 157 DPI and I can assure you it is not an extremely low end monitor at all. Unless your frame of reference is anything below $5k is low end or something.


Right, what I meant to say is that anything under 1600p is low end because it was already widely available in 2013.


dpi isn't the right metric. you don't keep a phone the same distance as a 49" monitor.


DPI is absolutely the right metric, the target goal just depends on the usage.

A hidpi computer display is north of 200 (iMacs are a bit below 220, MacBooks a bit above).

A phone is north of 300 (iPhones have ranged from 324 to 476).


DPI works fine as long as the type of device is in the context. The context here is desktop/laptop.


> Jobs was more opinionated on “principles” - Cook is more than happy to sell what people will buy.

Well, the current "principle" is "iOS is enough, we're going to run iOS apps on MacOS, and that's it".

Rosetta isn't needed for that.


It's strange to see people downvoting this when three days ago App Store on MacOS literally defaulted to searching iOS and iPad apps for me https://twitter.com/dmitriid/status/1589179351572312066


> That's not going to happen unless Apple decides to switch from ARM to RISC-V, and... why would they? They've got 15 years experience and essentially full control on ARM.

15? More than a quarter century. They were one of the original investors in ARM and have produced plenty of arm devices since then beyond the newton and the ipod.

I’d bet they use a bunch of risc v internally too if they just need a little cpu to manage something locally on some device and just want to avoid paying a tiny fee to ARM or just want some experience with it.

But RISC V as the main CPU? Yes, that’s a long way away, if ever. But apple is good at the long game. I wouldn’t be surprised to hear that Apple has iOS running on RISC V, but even something like the lightning-to-HDMI adapter runs IOS on ARM.


> 15? More than a quarter century. They were one of the original investors in ARM and have produced plenty of arm devices since then beyond the newton and the ipod.

They didn't design their own chips for most of that time.


At the same time as the ARM investment they had a Cray for...chip design.


Yes and?

Apple invested in ARM and worked with ARM/Acorn on what would become ARM6, in the early 90s. The newton uses it (specifically the ARM610), it is a commercial failure, later models use updated ARM CPUs to which AFAIK Apple didn't contribute (DEC's StrongARM, and ARM's ARM710).

<15 years pass>

Apple starts working on bespoke designs again around the time they start working on the iPhone, or possibly after they realise it's succeeding.

That doesn't mean they stopped using ARM in the meantime (they certainly didn't).

The iPod's SoC was not even designed internally (it was contracted out to PortalPlayer, later generations were provided by Samsung). 15 times and the revolution of Jobs' return (and his immediate killing of the Newton) is a long time for an internal team of silicon designers.


Back then, a Cray-1 could execute an infinite loop in 4.7 seconds.


It would be funny/not funny if in a few years Apple removes Rosetta 2 for Mac apps but keeps the Linux version forever so docker can run at reasonable speeds.


> They've got 15 years experience

Did you only start counting from 2007 when the iPhone was released? All the iPods prior to that were using ARM processors. The Apple Newton was using ARM processors.


> All the iPods prior to that were using ARM processors.

Most of the original device was outsourced and contracted out (for reasons of time constraint and lack of internal expertise). PortalPlayer built the SoC and OS, not Apple. Later SoC were sourced from SigmaTel and Samsung, until the 3rd gen Touch.

> The Apple Newton was using ARM processors.

The Apple Newton was a completely different Apple, and there were several years' gap between Jobs killing the Newton and the birth of iPod, not to mention the completely different purpose and capabilities. There would be no newton-type project until the iPhone.

Which is also when Apple started working with silicon themselves: they acquired PA in 2008, Intrinsity in 2010, and Passif in 2013, released their first partially in-house SoC in 2010 (A4), and their first in-house core in 2013 (Cyclone, in the A7).


iPods and Newton were entirely different chips and OS's. The first iPods weren't even on an OS that Apple created - they licensed it.


>That's not going to happen unless Apple decides to switch from ARM to RISC-V, and... why would they? They've got 15 years experience and essentially full control on ARM.

Two points here.

• First off, Apple developers are not binded to Apple. The knkwledge gained can be used elsewhere. See Rivos and Nuvia for example.

• Second, Apple reportedly has already ported many of it's secondary cores to RISC-V. It's not unreasonable that they will switch in 10 years or so.


For me, those two points make it clear that it would be possible for Apple to port to RISC-V. But it's still not clear what advantages they would gain from doing so, given that their ARM license appears to let them do whatever they want with CPUs that they design themselves.


The first point precludes Apple's gain from the discussion.


Apple reportedly has already ported many of it's secondary cores to RISC-V

Really? In current hardware or is this speculation?


>Many dismiss RISC-V for its lack of software ecosystem as a significant roadblock for datacenter and client adoption, but RISC-V is quickly becoming the standard everywhere that isn’t exposed to the OS. For example, Apple’s A15 has more than a dozen Arm-based CPU cores distributed across the die for various non-user-facing functions. SemiAnalysis can confirm that these cores are actively being converted to RISC-V in future generations of hardware.[0]

So to answer your question, it is not in currently in hardware, but it is more than just speculation.

[0]https://www.semianalysis.com/p/sifive-powers-google-tpu-nasa...


If you've got some management core somewhere in your silicon you can, with RISC-V, give it a MMU but no FPU and save area. You're going to be writing custom embedded code anyways so you get to save silicon by only incorporating the features that you need instead of having to meet the full ARM spec. And you can add your own custom instructions for the job at hand pretty easily.

That would all be a terrible idea if you were doing it for a core intended to run user applications, but that's not what Apple, Western Digital, NVidia are embracing RISC-V for embedded cores. If I were ARM I'd honestly be much more worried about RISC-V's threat to my R and M series cores than my A series cores.


Arm64 allows FPU-less designs. There are some around…


Sure. The FPU is optional on a Cortex M2, for instance. But those don't have MMUs. You'd certainly need an expensive architectural license to make something with an MMU but no FPU if you wanted to and given all the requirements ARM normally imposes for software compatibility[1] between cores I'd tend to doubt that they'd let you make something like that.

[1] Explicitly testing that you don't implement total store ordering by default is one requirement I've heard people talk about to get a custom core licensed.


Apple has an architecture license (otherwise they could not design their own cores, which they’ve been doing for close to a decade), and already had the ability to take liberties beyond what the average architecture licensee can, owing to being one of ARM’s founders.


There are arm64 designs with MMUs but not FPUs. See Cortex-R82.

That [1] is wrong too. Tegra Xavier shipped with sequential consistency out of the box :)


Don’t think any are shipping, but they’re hiring RISC-V engineers.


> it’s not unreasonable that they will switch in 10 years or so.

You’ve not provided any rationale at all for why they should switch their application cores let alone on this specific timetable.

Switching is an expensive business and there has to be a major business benefit for Apple in return.


> They've got 15 years experience and essentially full control on ARM.

Do they? ARM made it very clear that they consider all ARM cores their own[1]

[1]: https://www.theregister.com/2022/11/07/opinion_qualcomm_vs_a...


> Do they?

They do, yes. They were one of the founding 3 members of ARM itself, and the primary monetary contributor.

Through this they acquired privileges which remain extant: they can literally add custom instructions to the ISA (https://news.ycombinator.com/item?id=29798744), something there is no available license for.

> ARM made it very clear that they consider all ARM cores their own[1]

The Qualcomm situation is a breach of contract issue wrt Nuvia, it's a very different issue, and by an actor with very different privileges.


Is there a real source for this claim? It gets parroted a lot on HN and elsewhere, but I've also heard it's greatly exagerated. I don't think Apple engineers get to read the licences, and even if they did, how do we know they understood it corretly and that it got repeated correctlty? I've never seen a valid source for this claim.


For what claim? They they co-founded ARM? That’s historical record. That they extended the ISA? That’s literally observed from decompilations. That they can do so? They’ve been doing it for at least 2 years and ARM has yet to sue.

> I've never seen a valid source for this claim.

What is “a valid source”? The linked comment is from Hector Martin, the founder and lead of Asahi, who worked on and assisted with reversing various facets of Apple silicon, including the capabilities and extensions of the ISA.


>For what claim?

that they have "essentially full control on ARM"

Having an ALA + some extras doesn't mean "full control."

he also says:

>And apparently in Apple's case, they get to be a little bit incompatible

So he doesn't seem to actually know the full extent to which Apple has more rights, even using the phrase "a little bit" — far from your claim. And he (and certainly you) has not read the license. Perhaps they have to pay for each core they release on the market that breaks compatabilty? Do you know? Of course not. A valid source would be a statement from someone who read the license or one of the companies. There is more to a core than just the ISA. If not, why is Apple porring cores to RISC-V? If they have so much control ?


Why does it need a "real source"? ARM sells architecture licenses, Apple has a custom ARM architecture. 1 + 1 = 2.

https://www.cnet.com/tech/tech-industry/apple-seen-as-likely...

"ARM Chief Executive Warren East revealed on an earnings conference call on Wednesday that "a leading handset OEM," or original equipment manufacturer, has signed an architectural license with the company, forming ARM's most far-reaching license for its processor cores. East declined to elaborate on ARM's new partner, but EETimes' Peter Clarke could think of only one smartphone maker who would be that interested in shaping and controlling the direction of the silicon inside its phones: Apple."

https://en.wikipedia.org/wiki/Mac_transition_to_Apple_silico...

"In 2008, Apple bought processor company P.A. Semi for US$278 million.[28][29] At the time, it was reported that Apple bought P.A. Semi for its intellectual property and engineering talent.[30] CEO Steve Jobs later claimed that P.A. Semi would develop system-on-chips for Apple's iPods and iPhones.[6] Following the acquisition, Apple signed a rare "Architecture license" with ARM, allowing the company to design its own core, using the ARM instruction set.[31] The first Apple-designed chip was the A4, released in 2010, which debuted in the first-generation iPad, then in the iPhone 4. Apple subsequently released a number of products with its own processors."

https://www.anandtech.com/show/7112/the-arm-diaries-part-1-h...

"Finally at the top of the pyramid is an ARM architecture license. Marvell, Apple and Qualcomm are some examples of the 15 companies that have this license."


The common refrain is that "Since Apple helped found ARM they have a super special relationship that gives them more rights than anyone else."

That they had to specifically sign an architectural license in 2008 sounds like that is not at all true but that they are just another standard licensee (albeit one with very deep pockets).


I should have been more explicit. I am questioning the claim that Apple has "full control on ARM" with no restriction on the cores they make, grandfathered in from the 1980s. Nobody has ever substantiated that claim.


Apple is in a somewhat different position to Qualcomm in that they were a founding member of ARM. I've also heard rumours that aarch64 was designed by apple and donated to ARM (hence why apple was so early to release an aarch64 processor). So I somewhat doubt ARM will be a position to sue them any time soon.


The Qualcomm situation is based on breaches of a specific agreement that ARM had with Nuvia, which Qualcomm has now bought. It's not a generalizable "ARM thinks everything they license belongs to them fully in perpetuity" deal.


I hope not. Rosetta 2, as cool as it is, is a crutch to allow Apple to transition away from x86. If it keeps beeing needing it's a massive failure for Apple and the ecosystem.


More likely to be useful RISC-V to Arm then Apple can support running virtual machines for another architecture on its machines.


Not having any particular domain experience here, I've idly wondered whether or not there's any role for neural net models in translating code for other architectures.

We have giant corpuses of source code, compiled x86_64 binaries, and compiled arm64 binaries. I assume the compiled binaries represent approximately our best compiler technology. It seems predicting an arm binary from an x86_64 binary would not be insane?

If someone who actually knows anything here wants to disabuse me of my showerthoughts, I'd appreciate being able to put the idea out of my head :-)


You would need a hybrid architecture with a NN generating guesses and a "watchdog" shutting down errors.

Neural models are basically universal approximators. Machine code needs to be obscenely precise to work.

Unless you're doing something else in the backend, it's just a turbo SIGILL generator.


This is all true - machine code needs to be "basically perfect" to work.

However, there are lots of problems in CS that are easier to check the answer to a solution than to solve in the first place. It may turn out to be the case that a well-tuned model can quickly produce solutions to some code-generation problems, that those solutions have a high enough likelihood of being correct, that it's fast enough to check (and maybe try again), and that this entire process is faster than state-of-the-art classical algorithms.

However, if that were the case, I might also expect us to be able to extract better algorithms from the model - intuitively, machine code generation "feels" like something that's just better implemented through classical algorithms. Have you met a human that can do register allocation faster than LLVM?


> turbo SIGILL generator

This gave me the delightful mental image of a CPU smashing headlong into a brick wall, reversing itself, and doing it again. Which is pretty much what this would do.


I'm a ML dilletante and hope someone more knowledgeable chimes in, but one thing to consider is the statistics of how many instructions you're translating and the accuracy rate. Binary execution is very unforgiving to minor mistakes in translation. If 0.001% of instructions are translated incorrectly, that program just isn't going to work.


I think we are on the cusp of machine aided rules generation via example and counter example. It could be a very cool era of “Moore’s Law for software” (which I’m told software doubles in speed roughly every 18 years).

Property based testing is a bit of a baby step here, possibly in the same way that escape analysis in object allocation was the precursor to borrow checkers which are the precursor to…?

These are my inputs, these are my expectations, ask me some more questions to clarify boundary conditions, and then offer me human readable code that the engine thinks satisfies the criteria. If I say no, ask more questions and iterate.

If anything will ever allow machines to “replace” coders, it will be that, but the scare quotes are because that shifts us more toward information architecture from data munging, which I see as an improvement on the status quo. Many of my work problems can be blamed on structural issues of this sort. A filter that removes people who can’t think about the big picture doesn’t seem like a problem to me.


> It seems predicting an arm binary from an x86_64 binary would not be insane?

If you start with a couple of megabytes of x64 code, and predict a couple of megabytes of arm code from it, there will be errors even if your model is 99.999% accurate.

How do you find the error(s)?


People have tried doing this, but not typically at the instruction level. Two ways to go about this that I’m aware of are trying to use machine learning to derive high-level semantics about code, then lowering it to the new architecture.


Many branch predictors have traditionally used perceptrons, which are sort of NN like. And I think there's a lot of research into involving incorporating deep learning models into doing chip routings.


Rosetta 2 is beautiful - I would love it if they kept it as a feature for the long term rather than deprecating it and removing it in the next release of macOS (basically what they did during previous architectural transitions.)

If Apple does drop it, maybe they could open source it so it could live on in Linux and BSD at least. ;-)

Adding a couple of features to ARM to drastically improve translated x86 code execution sounds like a decent idea - and one that could potentially enable better x86 app performance on ARM Windows as well. I don't know the silicon cost but I'd hope it wasn't dropped in the future.'

Thinking a bit larger, I'd also like to see Apple add something like CHERI support to Apple Silicon and macOS to enable efficient memory error checking in hardware. I'd be surprised if they weren't working on something like this already.


Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.

https://en.m.wikipedia.org/wiki/FX!32

Or for a more technical deep dive,

https://www.usenix.org/publications/library/proceedings/usen...


OMG I forgot about FX!32. My first co-op was as a QA tester for the DEC Multia, which they moved from the Alpha processor to Intel midway through. I did a skunkworks project for the dev team attempting to run the newer versions of Multia's software (then Intel-based) on older Alpha Multias using FX!32. IIRC it was still internal use only/beta, but it worked quite well!


Rosetta 2 has become the poster child for "innovation without deprecation" where I work (not Apple).


Apple is the king of deprecation, just look at what happened to Rosetta 1 and 32-bit iOS apps.


Yes they are, and that makes Rosetta 2 even more special. Though Rosetta 1 got support for 5 years, which is pretty good.


(Apologies for the flame war quality to this comment, I’m genuinely just expressing an observation)

It’s ironic that Apple is often backhandedly complimented by hackers as having “good hardware” when their list of software accomplishments is amongst the most impressive in the industry and contrasts sharply with the best efforts of, say, Microsoft, purportedly a “software company.”


Both have pretty impressive engineering. The focus just appears to be different. Apple really cares about user experience (or, maybe, used to), while it seems like Microsoft is trying to (and according to what I hear mostly succeeding) nail the administration experience instead.


The transparent APFS migration was one heck of a feat.


Apple's historically been pretty good at making this stuff. Their first 68k -> PPC emulator (Davidian's) was so good that for some things the PPC Mac was the fastest 68k mac you could buy. The next-gen DR emulator (and SpeedDoubler etc) made things even faster.

I suspect the ppc->x86 stuff was slower because x86 just doesn't have the registers. There's only so much you can do.


> Apple's historically been pretty good at making this stuff. Their first 68k -> PPC emulator (Davidian's) was so good that for some things the PPC Mac was the fastest 68k mac you could buy.

Not arguing the facts here, but I'm curious—are these successes related? And if so, how has Apple done that?

I would imagine that very few of the engineers who programmed Apple's 68k emulator are still working at Apple today. So, why is Apple still so good at this? Strong internal documentation? Conducive management practices? Or were they just lucky both times?


I mean they are one of very few companies who have done arch changes like this and they had already done it twice before Rosetta 2. The same engineers might not have been used for all 3 but I'm sure there was at least a tiny bit of overlap between 68k->PPC and PPC->Intel (and likewise overlap between PPC->Intel and Intel->ARM) that coupled with passed down knowledge within the company gives them a leg up. They know the pitfalls, they've see issues/advantages of using certain approaches.

I think of it in same way that I've migrated from old->new versions of frameworks/languages in the past with breaking changes and each time I've done it I've gotten better at knowing what to expect, what to look for, places where it makes sense to "just get it working" or "upgrade the code to the new paradigm". The first time or two I did it was as a junior working under senior developers so I wasn't as involved but what did trickle down to me and/or my part in the refactor/upgrade taught me things. Later times when I was in charge (or on my own) I was able to draw on those past experiences.

Obviously my work is nowhere near as complicated as arch changes but if you squint and turn your head to the side I think you can see the similarities.

> Or were they just lucky to have success both times?

I think 2 times might be explained with "luck" but being successful 3 times points to a strong trend IMHO, especially since Rosetta 2 seems to have done even better than Rosetta 1 for the last transition.


FWIW, I know several current engineers at Apple who wrote ground-breaking stuff before the Mac even existed. Apple certainly doesn't have any problem with older engineers, and it turns out that transferring that expertise to new chips on demand isn't particularly hard for them.


> Apple certainly doesn't have any problem with older engineers

Just to be clear, I never meant to suggest that they did—I just didn't realize employees remained with the company for that long, instead of switching.


I suspect that there's a design document somewhere that has secrets and tips. Plus MacOS and NeXTstep have been ported so many times that by now any arch-level things are well-isolated. A lot of the pain of MacOS on x86 were related to kernel extensions and driver API changes. And a lot of stuff fell out during the 64-bit transition.

In general, MacOS/iOS/iPadOS owe everything to the excellent architecture laid down by the NeXT team back in the day. NeXTstep now has the dream that MS tried and failed to do back in the day: an OS that runs on every kind of device they make. Funny how that is.


> I suspect the ppc->x86 stuff was slower because x86 just doesn't have the registers.

My understanding is that part of the reason the G4/5 was sort of able to keep up with x86 at the time was due to the heavy use of SIMD in some apps. And I doubt that Rosetta would have been able to translate that stuff into SSE (or whatever the x86 version of SIMD was at the time) on the fly.


Apple had a library of SIMD subroutines (IIRC Accelerate.framework) and Rosetta was able to use the x86 implementation when translating PPC applications that called it.


Rosetta actually did support Altivec. It didn't support G5 input at all though (but likely because that was considered pretty niche, as Apple only released a G5 iMac, a PowerMac, and an XServe, due to the out-of-control power and thermals of the PowerPC 970).


> Their first 68k -> PPC emulator (Davidian's) was so good that for some things the PPC Mac was the fastest 68k mac you could buy.

This is not true. A 6100/60 running 68K code was about the speed of my unaccelerated Mac LCII 68030/16. Even when using SpeedDoubler, you only got speeds up to my LCII with a 68030/40Mhz accelerator.

Even the highest end 8100/80 was slower than a high end 68k Quadra.

The only time 68K code ran faster is when it made heavy use of the Mac APIS that were native.


>The only time 68K code ran faster is when it made heavy use of the Mac APIS that were native.

Yes, and that just confirms the original point. Mac apps often spend a lot of time in the OS apis and therefore the 68K code (the app) often ran faster on PPC than it did on 68K because apps often spend much of their time in OS apis. The earlier post said "so good that for some things the PPC Mac was the fastest 68k mac." That is true.

In my own experience, I found most 68K apps felt as fast or faster. Your app mix might have been different, but many folks found the PPC faster.


Part of that was the greater clock speeds on the 601 and 603, though. Those started at 60MHz. Clock for clock 68K apps were generally poorer on PowerPC until PPC clock speeds made them competitive, and then the dynamic recompiling emulator knocked it out of the park.

Similarly, Rosetta was clock-for-clock worse than Power Macs at running Power Mac applications. The last generation G5s would routinely surpass Mac Pros of similar or even slightly greater clocks. On native apps, though, it was no contest, and by the next generation the sheer processor oomph put the problem completely away.

Rosetta 2 is notable in that it is so far Apple's only processor transition where the new architecture was unambiguously faster than the old one on the old one's own turf.


The first gen 68K emulator performed worse on the PPC 603 than the 601.


That's not true for the 6100 but for the 8100 it was totally true. And at some point my 8500 ran 68k faster than any 68040 ever made, which isn't really fair since the 8500 was rocking good.


It is quite astonishing how seamless Apple has managed to make the Intel to ARM transition, there are some seriously smart minds behind Rosetta. I honestly don't think I had a single software issue during the transition!


If that blows your mind, you should see how Microsoft did the emulation of the PowerPC based Xeon chip to X86 so you can play Xbox 360 games on Xbox One.

There's an old pdf from Microsoft researchers with the details but I can't seem to find it right now.


They also bought a boat load of PPC Macs for much of the early Xbox dev work.


Any good videos on that?


I finally started seriously using a M1 work laptop yesterday, and I'm impressed. More than twice as fast on a compute-intensive job as my personal 2015 MBP, with a binary compiled for x86 and with hand-coded SIMD instructions.


Are you me lol? I'm on my third day on M1 Pro. Battery life is nuts. I can be on video calls and still do dev work without worrying about charging. And the thing runs cool!


It helps that there were almost 2 years between the release and your adoption. I had a very early M1 and it was not too bad, but there were issues. I knew that going in.


I had an M1 Air early on and I didn't run into any issues. Even the issues with apps like Homebrew were resolved within 3-4 months of the M1 debut. It's amazing just how seamless such a major architectural transition it was and continues to be!


They've almost made it too good. I have to run software that ships an x86 version of CPython, and it just deeply offends me on a personal level, even though I can't actually detect any slowdown (probably because lol python in the first place)


It has been extremely smooth sailing. I moved my own mac over to it about a year ago, swapping a beefed up MPB for a budget friendly M1 Air (which has massively smashed it out the park performance wise, far better than I was expecting). Didn't have a single issue.

My work mac was upgraded to a MBP M1 Pro and again, very smooth. I had one minor issue with a docker container not being happy (it was an x86 instance) but one minor tweak to the docker compose file and I was done.

It does still amaze me how good these new machines are. Its almost enough to redeem apple for the total pile of overheating, underperforming crap that came directly before the transition (aka any mac with a touchbar).


There's an annoying dwarf fortress bug but other than that, same


It isn't their first rodeo: 68k->PPC->x86_64->ARM.


nitpick, they did PPC -> x86 (32), the x86_64 bit transition was later (no translation layer though). They actually had 64-bit PPC systems on the G5 when they switched to Intel 32-bit, but Rosetta only does 32-bit PPC -> 32-bit x86; it would have been rare to have released 64-bit PPC only software.


They had 64 bit Carbon translation layer, but spiked it to force Adobe and some other large publishers to go native Intel. There was a furious uproar at the time, but it turned out to be the right decision.


But they've been on x84_64 for a long time. How much of that knowledge is still around? Probably some traces of it have been institutionalized but it isn't the same as if they just grabbed the same team and made them do it again a year after the least transition.


Rosetta 1 and the PPC -> x86 move wasn't anywhere near as smooth, I recall countless problems with that switch. Rosetta 2 is a totally different experience, and so much better in every way.


You gotta think there's been a lot of churn and lost knowledge at the company between PPC->x86_64 (2006) and now though.


I think the end of support for 32-bit applications in 2019 helped, slightly, with the run-up.

Assuming you weren’t already shipping 64-bit applications…which would be weird…updating the application probably required getting everything into a contemporary version of Xcode, cleaning out the cruft, and getting it compiling nice and cleanly. After that, the ARM transition was kind of a “it just works” scenario.

Now, I’m sure Adobe and other high-performance application developers had to do some architecture-specific tweaks, but, gotta think Apple clued them in ahead of time as to what was coming.


I have a single counter-example. Mailplane, a Gmail SSB. It's Intel including its JS engine, making the Gmail UI too sluggish to use.

I've fallen back to using Fluid, an ancient and also Intel-specific SSB, but its web content runs in a separate WebKit ARM process so it's plenty fast.

I've emailed the Mailplane author but they wont release an Universal version of the app since they've EOL'd Mailplane.

I have yet to find a Gmail SSB that I'm happy with under ARM. Fluid is a barely workable solution.


For what it's worth, I use Mailplane on an M1 MacBook Air (8GB) with 2 Gmail tabs and a calendar tab without noticeable issues.

Unfortunately the developers weren't able to get Google to work with them on a policy change that impacted the app [0] [1] and so gave up and have moved on to a new and completely different customer support service.

[0] https://developers.googleblog.com/2020/08/guidance-for-our-e... [1] https://mailplaneapp.com/blog/entry/mailplane_stopped_sellin...

So unfortunately


I wonder what's different about my setup. I've tried deleting and re-installing Mailplane and tried it on two different M1s (my personal MBA and my work 16" MBP). On both, there is significant lag in the UI. Just using the j/k keys to move up/down the message list it takes like 500 ms for the selected row to change.

I use non-default Gmail UI settings. I'm still using the classic theme, dense settings, etc. I'll try again, because I'd really like to use Mailplane as long as it will survive.

I'm aware why Mailplane has been EOL'd. But the developer also claims he's trying to keep it alive as long as possible and he's released at least one version since that blog post, so I'd hoped he would be willing to release a universal build. I don't know anything about the internals of Mailplane but I guess it's a non-trivial amount of work to do so.


Since this is the company's third big arch transition, cross-compilation and compatibility is probably considered a core competency for Apple to maintain internally.


And Next was multi-platform as well.


having total control on the hardware and the software didn't hurt for sure


Qualcomm (and Broadcomm) has total control on the hardware and software side of a lot of stuff and their stuff is shit.

It's not about control, it's about good engineering.


So many parts across the stack need to work well for this to go well. Early support for popular software is a good example. This goes from partnerships all the way down to hardware designers.

I'd argue it's not about engineering more than it is about good organizational structure.


And having execs who design the organizational structure around those goals is part of what makes good engineering :)


It's about both control and engineering in Apple's case.


That's really not the case, if you're in Microsoft or Linux's position you can't really change the OS architecture or driver models for any particular vendor.

That generality and general knowledge separation between different stacks leaves quite a lot of efficiency on the table.


The first time I ran into this technology was in the early 90s on the DEC Alpha. They had a tool called "MX" that would translate MIPS Ultrix binaries to Alpha on DEC Unix:

https://www.linuxjournal.com/article/1044

Crazy stuff. Rosetta 2 is insanely good. Runs FPS video games even.


>The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.

If there is no performance penalty why is it implemented as an optional extension?


I wonder how much hand-tuning there is in Rosetta 2 for known, critical routines. One of the tricks Transmeta used to get reasonable performance on their very slow Crusoe CPU was to recognize critical Windows functions and replace them with a library of hand-optimized native routines. Of course that's a little different because Rosetta 2 is targeting an architecture that is generally speaking at least as fast as the x86 architecture it is trying to emulate, and that's been true for most cross-architecture translators historically like DEC's VEST that ran VAX code on Alpha, but Transmeta CMS was trying to target a CPU that was slower.


Haven’t spotted any in particular.


Yeah, I haven't either, but I haven't looked. So, I'm not sure, but I wouldn't expect any "function recognition" tricks, since there isn't really static linking, but I would expect the e.g. memcpy and strcpy implementations in the ahead-of-time translated shared cache to be written in arm64 assembly rather than translated.


Indeed, see the _platform_*$VARIANT$Rosetta implementations in libsystem_platform.dylib.


For history this was a major milestone in x86 binary translation, Digital FX!32:

https://www.usenix.org/legacy/publications/library/proceedin...

Some apps run faster than the fastest available x86 at the time, and sometimes significantly faster like the Byte benchmark cited above. Of course it helped that the chip was sugnificantly faster than leading x86 chips in the first place.


Apple Silicon will be Tim Cook's legacy.


A lot of the simplicity of this approach relies on the x86 registers being able to be directly mapped to arm registers. This seems to be possible for most x86 registers, even simd registers. Although, I think this falls over for avx512, which is supported on the Mac pro. Arm neon has 32 128 bit registers - avx512 supports 32 512 bit registers and dedicated predicate registers. What do they do? Back to jit mode?



I wonder if such a direct translation from ARM to another architecture would even be possible given that the instruction set can be changed at runtime (thumb mode). Does anybody know how often typical ARM32 programs execute this mode switching or if such sections can be recognized statically?


Shouldn’t be hard under this scheme, it tracks indirect branches anyways. Just swap which table you do lookups based on whether you’re in thumb mode or not.


If that's really an issue (most code probably sticks to one mode, but I have no data), just translate twice, once in either mode. ¯\_(ツ)_/¯


x86 is worse because you can jump into the middle of instructions. In that case you just fall back to JIT though.

And luckily AArch64 doesn't have Thumb.


Rosetta 2 is great, except it apparently can't run statically-linked (non-PIC) binaries. I am unsure why this limitation exists, but it's pretty annoying because Virgil x86-64-binaries cannot run under Rosetta 2, which means I resort to running on the JVM on my M1...


Statically linked binaries are officially unsupported on MacOS in general, so there's no reason to support it on Rosetta either.

It's unsupported in MacOS because it assumes binary compatibility on the kernel system call interface, which is not guaranteed.


Rosetta was introduced with the promise that it supports binaries that make raw system calls. (And it does indeed support these by hooking the syscall instruction.)


Rosetta can run statically linked binaries, but I don’t think anything supports binaries that aren’t relocatable.

  $ file a.out
  a.out: Mach-O 64-but executable x86_64
  $ tool -L a.out
  a.out:
  $ ./a.out
  Hello, world!


> Rosetta 2 is great, except it apparently can't run statically-linked (non-PIC) binaries.

Interestingly, it supports statically-linked x86 binaries when used with Linux.

"Rosetta can run statically linked x86_64 binaries without additional configuration. Binaries that are dynamically linked and that depend on shared libraries require the installation of the shared libraries, or library hierarchies, in the Linux guest in paths that are accessible to both the user and to Rosetta."

https://developer.apple.com/documentation/virtualization/run...


Why are static binaries with PIC so rare? I’m surprised position dependent code is ever used anymore in the age of ASLR.

But static binaries are still great for portability. So you’d think static binaries with PIC would be the default.


> But static binaries are still great for portability.

macOS has not officially supported static binaries in... ever? You can't statically link libSystem, and it absolutely does not care for kernel ABI stability.


> it absolutely does not care for kernel ABI stability

That may be true on the mach system call side, but the UNIX system calls don't appear to change. (Virgil actually does call the kernel directly).


> That may be true on the mach system call side, but the UNIX system calls don't appear to change.

They very much do, without warning, as the Go project discovered (after having been warned multiple times) during the Sierra betas: https://github.com/golang/go/issues/16272 https://github.com/golang/go/issues/16606

That doesn't mean Apple goes outs of its way to break syscalls (unlike microsoft), but there is no support for direct syscalls. That is why, again, you can't statically link libSystem.

> (Virgil actually does call the kernel directly).

That's completely unsupported ¯\_(ツ)_/¯



Nice I had not seen these.


Virgil doesn't use ASLR. I'm not sure what value it adds to a memory-safe language.


Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.


I am interested in this domain, but lacking knowledge to fully understand the post. Any recommendations on good books/courses/tutorials related to low level programming?


I’d recommend going through a compilers curriculum, then reading up on past binary translation efforts.


> Every one-byte x86 push becomes a four byte ARM instruction

Can someone explain this to me? I don’t know ARM but it just seems to me a push should not be that expensive.


The general principle is that RISC style instruction sets are typically fixed length and with only a couple different subformats. Like the prototypical RISC design has one format with an opcode and 3 register fields, and then a second with an opcode and an immediate field. This simplicity and regularity makes the fastest possible decoding hardware much more simple and efficient compared to something like x86 that has a simply dumbfounding number of possible variable length formats.

The basic bet of RISC was that larger instruction encodings would be worth it due to the micro architectural advantages they enabled. This more or less was proven out, though the distinction is less distinct today with x86 decoding into uOps and recent ARM standards being quite complex beasts.


Bit late, and the other comments are right, but it's worth noting that pushes typically aren't that expensive. ARM has a 4-byte STP (store paired) instruction that pushes two values at a time. So usually a push only costs two bytes on ARM, but if you're translating instruction-for-instruction it's four bytes.

(I also forgot while writing the post that 2-byte push instructions are common in 64-bit x86 as well. Half the registers can be pushed with a single byte, and the other half require a REX prefix, giving a two-byte push instruction. So even though the quoted statement is true, the difference isn't that bad in general.)


x86 has variable-length instructions, so they can be anything from 1 to 15 bytes long. AArch64 instructions are always 4 bytes long.


Anybody know if Docker has plans to move from qemu to Rosetta on M1/2 Macs? I've found qemu to be at least 100x slower than the native arch.


The main reason, M1/2 being incredibly fast. Is listed last.


I don’t think that’s the main reason. The article lists a few things that, I think the main reason is that they made several parts of the CPU behave identical to x86. The M1 and M2 chips:

- can be told to do total store ordering, just as x86 does

- have of a few status flags that x86 has, but regular arm doesn’t

- can be told to make the FPU behave exactly as the x86 FPU

It also helps that ARM has many more registers than x86. Because of that the emulator can map the x86 registers to ARM registers, and have registers to spare for use by the emulator.


None of those custom features are used in Linux/in VMs, and it's still fairly fast.


Perhaps if you’re comparing against Intel processors, but even on an Apple Silicon Mac, Rosetta 2 vs native versions of apps are no slouch.

20% overhead for a non-native executable is very commendable.


That isn't the main reason.

If Rosetta ran x86 code at 10% the speed of native nobody would be calling it fast.


Great article!


TL;DR: One-to-one instruction translation ahead of time instead of complex JIT translations to bet on M1's performance and instruction cache handling.


[flagged]


Thanks for your thoroughly objective insights. I especially appreciate the concrete examples.


Here you go for a concrete example: https://news.ycombinator.com/item?id=33493276


This has nothing to do with Rosetta being incomplete (it has pretty good fidelity).


It was direct corroboration of:

> Apple users not being able to use the same hardware peripherals or same software as other people is not a problem, it's a feature. There's no doubt the M1/M2 chips are fast. It's just a problem that they're only available in crappy computers that can't run a large amount of software or hardware.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: