Why is Rosetta 2 fast?

iainmerrick · on Nov 9, 2022

This is a great writeup. What a clever design!

I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.

What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.

Anyone know how they implemented PPC-to-x86 translation?

kijiki · on Nov 9, 2022

> Anyone know how they implemented PPC-to-x86 translation?

They licensed Transitive's retargettable binary translator, and renamed it Rosetta; very Apple.

It was originally a startup, but had been bought by IBM by the time Apple was interested.

GeekyBear · on Nov 9, 2022

> It was originally a startup, but had been bought by IBM by the time Apple was interested.

Rosetta shipped in 2005.

IBM bought Transitive in 2008.

The last version of OS X that supported Rosetta shipped in 2009.

I always wondered if the issue was that IBM tried to alter the terms of deal too much for Steve's taste.

spijdar · on Nov 9, 2022

A lesser known bit of trivia about this is that IBM would go on to use Transitive's technology for the exact opposite of Rosetta -- x86 to PowerPC translation, in the form of "PowerVM Lx86", released that year (2008).

It's very fascinating to me, since IBM appears to have extended the PowerPC spec with this application specifically in mind. Up until POWER10, the Power/PowerPC ISA specified an optional feature called "SAO", allowing individual pages of memory to be forced to use an x86-style strong memory model, comparable to the proprietary extension in Apple's CPUs, but much more granular (page level and enforced in L1/L2 cache as opposed to the entire core).

As far as I can tell, Transitive's technology was the only application to ever use this feature, though it's mainlined in the Linux kernel, and documented in the mprotect(2) man page. IBM ditched the extension for POWER10, which makes sense, since Lx86 only ever worked on big endian releases of RHEL and SLES which are long out of support now.

One mystery to me, though, is IBM added support for this page marker to the new radix-style MMU in POWER9. It's documented in the CPU manual, but Linux has no code to use it -- unless I've missed it, Linux only has code to set the appropriate bits in HPT mode, and no reference to the new method for marking radix pages SAO as the manual describes. I can't imagine there was any application on AIX which used this mode (it only decreases performance), and unless you backported a modern kernel to a RHEL 5 userland, you couldn't use Lx86 with the new radix mode. Much strangeness...

Luke-Jr · on Nov 10, 2022

In theory, PROT_SAO should be useful for qemu, and trivial to make patches implementing there. That's assuming the kernel actually sets it, though. The problem I encountered when I set out to do it a year or so ago, was that I couldn't find a good test case to fail without it...

spijdar · on Nov 10, 2022

The kernel definitely sets the WIMG bits at https://github.com/torvalds/linux/blob/master/arch/powerpc/m... (line 336, if HN removes it), though I've never been able to "make it work" either.

I used box64 as a test case, where I had a game that would run in emulation, but only if I pinned it to a single core. On ARM64, it also worked, as the JIT translator on box64 uses manually inserted memory fences to force strongly ordered access.

The game never worked correctly, even after I patched the kernel to mark every page on the system as SAO, and confirmed this worked by checking the set memory flags. This might be a mistake in my understanding of what SAO should do, though. (or another failure in box64 on ppc64le)

One thought I've had recently is perhaps it's like the recently discovered tagged memory extension and only worked in big endian? There's nothing in the docs to suggest this, but since the only test case was BE-only, maybe?

r00fus · on Nov 9, 2022

Apple is also not tied to reverse compatibility.

Their customers are not enterprise, and consequently they are probably the best company in the world at dictating well-managed, reasonable shifts in customer behavior at scale.

So they likely had no need for Rosetta as of 2009.

kijiki · on Nov 10, 2022

Right, thanks for correcting my faulty memory on the timing.

It is possible that IBM tried to squeeze Apple, but given that IBM's interest in Transitive was for enterprise server migration, I suspect it is more likely that Apple got tired of paying whatever small royalty they'd contracted for with Transitive, and decided enough people had fully migrated to native x86 apps that they wouldn't alienate too many customers.

bogomipz · on Nov 11, 2022

>"The last version of OS X that supported Rosetta shipped in 2009."

Interesting so was Rosetta 2 written from the ground up then? Did Apple manage to hire any of the former Transitive engineers after IBM acquired them? It seems like this would be a niche group of engineers that worked in this area no?

bogomipz · on Nov 10, 2022

>"The last version of OS X that supported Rosetta shipped in 2009."

Interesting, so was Rosetta 2 written from the ground up then? Did Apple manage to hire any of the former Transitive engineers after IBM acquired them? It seems like this would be a niche group of engineers that worked in this area no?

savoytruffle · on Nov 9, 2022

I agree it was a bit worryingly short-lived. However the first version of Mac OS X that shipped without Rosetta 1 support was 10.7 Lion in summer 2011 (and many people avoided it since it was problematic). So nearly-modern Mac OS X with Rosetta support was realistic for a while longer.

GeekyBear · on Nov 9, 2022

> However the first version of Mac OS X that shipped without Rosetta 1 support was 10.7 Lion

Yes, but I was pointing out when the last version of OS X that did support Rosetta shipped.

I have no concrete evidence that Apple dropped Rosetta because IBM wanted to alter the terms of the deal after they bought Transitive, but I've always found that timing interesting.

In comparison, the emulator used during the 68k to PPC transition was never removed from Classic MacOS, so the change stood out.

tambourine_man · on Nov 17, 2022

The Classic environment was removed from OS X and all the IP involved was Apple’s.

The timing is interesting, but I wouldn’t put beyond Apple to remove a feature simply to sediment a transition (and decrease support cost).

scarface74 · on Nov 10, 2022

> In comparison, the emulator used during the 68k to PPC transition was never removed from Classic MacOS, so the change stood out.

It was never removed because Classic MacOS itself was never fully native.

xattt · on Nov 10, 2022

> It was never removed because Classic MacOS itself was never fully native.

Are there any current OSes that have the same level of historical cruft that Mac OS Classic had?

scarface74 · on Nov 11, 2022

Windows.

Depending on which API you are calling, you have to represent a “string” differently. This is just one example

https://learn.microsoft.com/en-us/cpp/text/how-to-convert-be...

GeekyBear · on Nov 11, 2022

There are still dialog boxes that date back to Windows 3.1 that show up in Windows 10 and 11.

savoytruffle · on Nov 11, 2022

I agree. And I suppose since it was so intrinsic to the operating system, if a 68k app worked in Mac OS 9 (some would some might not), you could continue to run it in the Classic Environment (on a PPC Mac, not Intel Mac) Mac OS 10.4 Tiger in the mid 20 00's!

xattt · on Nov 10, 2022

I could have have sworn that a unibody MacBook Pro where I did an in-place upgrade to Lion somehow held onto Rosetta.

savoytruffle · on Nov 11, 2022

I guess that's perjury because it cannot be true! Even Snow Leopard didn't even include Rosetta 1. But if it was deemed necessary, it would download and install it on-demand, similar to how the Java system worked.

runjake · on Nov 9, 2022

Link: https://en.wikipedia.org/wiki/QuickTransit

klelatti · on Nov 9, 2022

That’s really interesting. You might enjoy reading about the VM embedded into the Busicom calculator that used the Intel 4004 [1]

They squeezed a virtual machine with 88 instructions into less than 1k of memory!

[1] https://thechipletter.substack.com/p/bytecode-and-the-busico...

wang_li · on Nov 9, 2022

In the mists of history S. Wozniak wrote the SWEET-16 interpreter for the 6502. A VM with 29 instructions implemented in 300 bytes.

https://en.wikipedia.org/wiki/SWEET16

iainmerrick · on Nov 9, 2022

That is nifty! Sounds very similar to a Forth interpreter.

DonHopkins · on Nov 10, 2022

There's also OpenFirmware's platform independent Forth bytecode "FCode":

https://en.wikipedia.org/wiki/Open_Firmware

>Open Firmware Forth Code may be compiled into FCode, a bytecode which is independent of instruction set architecture. A PCI card may include a program, compiled to FCode, which runs on any Open Firmware system. In this way, it can provide boot-time diagnostics, configuration code, and device drivers. FCode is also very compact, so that a disk driver may require only one or two kilobytes. Therefore, many of the same I/O cards can be used on Sun systems and Macintoshes that used Open Firmware. FCode implements ANS Forth and a subset of the Open Firmware library.

hoosieree · on Nov 10, 2022

And here I was feeling impressed with myself for implementing the Nand2Tetris VM translator in ~2k of python... wow. Respect for the elders!

lostgame · on Nov 9, 2022

From what I understand; they purchased a piece of software that already existed to translate PPC to x86 in some form or another and iterated on it. I believe the software may have already even been called ‘Rosetta’.

My memory is very hazy; though. While I experienced this transition firsthand and was an early Intel adopter, that’s about all I can remember about Rosetta or where it came from.

I remember before Adobe had released the Universal Binary CS3 that running Photoshop on my Intel Mac was a total nightmare. :( I learned to not be an early adopter from that whole debacle.

runjake · on Nov 9, 2022

Link: https://en.wikipedia.org/wiki/QuickTransit

saagarjha · on Nov 9, 2022

Transitive.

Asmod4n · on Nov 9, 2022

I don't know how they did it, but they did it very very slowly. Anything "interactive" was unuseable.

lilyball · on Nov 9, 2022

Assuming you're talking about PPC-to-x86, it was certainly usable, though noticeably slower. Heck, I used to play Tron 2.0 that way, the frame rate suffered but it was still quite playable.

scarface74 · on Nov 9, 2022

Interactive 68K programs were usually fast. The 68K programs would still call native PPC QuickDraw code. It was processor intensive code that was slow. Especially with the first generation 68K emulator.

Connectix SpeedDoubler was definitely faster.

duskwuff · on Nov 9, 2022

Most of the Toolbox was still running emulated 68k code in early Power Mac systems. A few bits of performance-critical code (like QuickDraw, iirc) were translated, but most things weren't.

hinkley · on Nov 9, 2022

I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.

What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.

I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.

And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.

hamstergene · on Nov 9, 2022

Could it be simply because many binaries were produced by much older, outdated optimizers. Or optimized for size.

Also, optimizers usually target “most common denominator” so native binaries rarely use full power of current instruction set.

Jumping from that peculiar finding to praising runtime JIT feels like a longshot. To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

astrange · on Nov 9, 2022

> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.

That's why bitcode is dead now.

By the way, I don't know why this thread is about how JITs can optimize programs when this article is about how Rosetta is not a JIT and intentionally chose a design that can't optimize programs.

pjmlp · on Nov 10, 2022

Bitcode is dead because Apple got fed up to keep maintaining their own fork with guarantees that regular LLVM doesn't offer.

Since 1961 plenty of systems have used bytecodes as executable formats, the most sucessfull still in use, IBM and Unisys mainframes and microcomputers.

josephg · on Nov 10, 2022

> That's why bitcode is dead now.

WebAssembly seems alive and well. I'm not sure how similar it is to java bitcode but its the same core idea.

That said, WASM through v8 is ~3x slower than the same code compiled natively. (Some of this might be due to the lack of SIMD in wasm).

astrange · on Nov 10, 2022

"Bitcode" is the name of a specific Xcode feature in this case, not a general comment.

lmm · on Nov 9, 2022

> This turns out to be quite difficult, especially if you're using bitcode as a compiler IL. You have to know what the right "intermediate" level is; if assumptions change too much under you then it's still too specific. And it means you can't use things like inline assembly.

> That's why bitcode is dead now.

Isn't this what Android does today? Applications are distributed in bytecode form and then optimized for the specific processor at install time.

monocularvision · on Nov 10, 2022

The bitcode Apple used for their platforms was at a much, much lower level than bytecode used on Android.

pjmlp · on Nov 10, 2022

Yeah, it was also a custom version, somehow people keep thinking they used LLVM bitcode straight out of the box.

astrange · on Nov 10, 2022

I don't know what Android does… some kind of Java but not Java, right?

In that case it's much less expressive, so developers simply can't do the unsafe/specialized code in the first place. Which means they can't write in C or assembly.

Bitcode was a specific Apple feature that used LLVM's compiler IL and might have promised extra portability, but it didn't really work out and was removed this year. ("LLVM" stands for "low level virtual machine" which is funny because it isn't low level and isn't a virtual machine.)

KMag · on Nov 10, 2022

LLVM is lower level than Python or Java bytecode. LLVM is also virtual machine in the sense of an abstract machine, similar in idea to a process virtual machine. Most usages of "virtual machine" today are talking about system virtual machines, but it's important to note that "virtual machine" is an overloaded phrase.

pjmlp · on Nov 10, 2022

There is no C or Assembly in IBM and Unisys mainframes and microcomputers.

AceJohnny2 · on Nov 9, 2022

> Or optimized for size.

Note that on gcc (I think) and clang (I'm sure), -Oz is a strict superset of -O2 (the "fast+safe" optimizations, compared to -O3 that can be a bit too aggressive, given C's minefield of Undefined Behavior that compilers can exploit).

I'd guess that, with cache fit considerations, -Oz can even be faster than -O2.

MayeulC · on Nov 10, 2022

Interesting, I didn't know about -Oz (only -Os for size). -Oz is reportedly mac-specific https://stackoverflow.com/questions/1778538/how-many-gcc-opt...

jasonwatkinspdx · on Nov 9, 2022

All reasonable points, but examples where JIT has an advantage are well supported in research literature. The typical workload that shows this is something with a very large space of conditionals, but where at runtime there's a lot of locality, eg matching and classification engines.

kllrnohj · on Nov 10, 2022

> To me it’s more of an argument towards distributing software in intermediate form (like Apple Bitcode) and compiling on install, tailoring for the current processor.

Or distribute it in source form and make compilation part of the install process. Aka, the Gentoo model.

wmf · on Nov 9, 2022

https://www.hpl.hp.com/techreports/1999/HPL-1999-78.html

travisgriggs · on Nov 9, 2022

It was particularly poignant at the time because JITed languages were looked down on by the “static compilation makes us faster” crowd. So it was a sort of “wait a minute Watson!” moment in that particular tech debate.

No one cares as much now days, we’ve moved our overrated opinion battlegrounds to other portions of what we do.

pjmlp · on Nov 9, 2022

I eventually changed my opinion into JIT being the only way to make dynamic languages faster, while strong typed ones can benefit from having both AOT/JIT for different kinds of deployment scenarios, and development workflows.

mikepurvis · on Nov 9, 2022

I think I landed in a place where it's basically "the compiler has insufficient information to achieve ideal optimization because some things can only be known at runtime."

Which is not exclusively an argument for runtime JIT— it can also be an argument for instrumenting your runtime environment, and feeding that profiling data back to the compiler to help it make smarter decisions the next time. But that's definitely a more involved process than just baking it into the same JavaScript interpreter used by everyone— likely well worth it in the case of things like game engines, though.

bluGill · on Nov 9, 2022

The problem with JIT is not all information known at runtime is the correct information to optimize one.

In finance the performance critical code path is often the one run least often. That is you have a if(unlikely_condition) {run_time_sensitive_trade();}. In this case you need to tell the compiler to ensure the CPU will have a pipeline stall because of a branch misprediction most of the time to ensure the time that counts the pipeline doesn't stall.

The above is a rare corner case for sure, but it is one of those weird exceptions you always need to keep in mind when trying to make any blanket rule.

dahfizz · on Nov 9, 2022

The other issue with JIT is that it is unreliable. It optimizes code by making assumptions. If one of the assumptions is wrong, you pay a large latency penalty. In my field of finance, having reliably low latency is important. Being 15% faster on average, but every once in a while you will be really slow, is not something customers will go for.

bluGill · on Nov 10, 2022

I'm not in finance. I just remember one talk by a finance guy and it was a mind bender so I remember it.

ridiculous_fish · on Nov 10, 2022

How does the compiler arrange for the CPU to mispredict the branch most of the time? I didn't think there were any knobs for the branch predictor other than static ones (e.g. backwards-jumps statically predicted as taken, or PowerPC branch hint bit).

mikepurvis · on Nov 10, 2022

C++20 is gaining the [[likely]] attribute for guiding the predictor in a portable way:

http://eel.is/c++draft/dcl.attr.likelihood

Prior to this it was also possible, but required compiler-specific macros like GCC's __builtin_expect, which is used all over the kernel:

https://github.com/torvalds/linux/search?q=__builtin_expect

menaerus · on Nov 15, 2022

To my knowledge this does not have any direct impact to the CPU branch prediction mechanism. If that had been the case then we would at least have some x86-(64) instruction to manipulate with the BP. If I write a quick example such as https://godbolt.org/z/j7Y81j5fe, I also see no such instruction in the generated output.

But what likely/unlikely mechanism can do is that it can serve as a hint to the compiler which the compiler can then use to generate more optimal code layout. For example, if we provide the compiler with likely/unlikely hints in our code, compiler will try to use that hint to stitch together the more probable code paths first. In theory this approach should result with better utilization of CPU instruction cache and thus it may lead to better performance.

astrange · on Nov 10, 2022

There are no ways to hint the branch predictor on some common CPUs. I think Intel says it's fully dynamic and has no static predictions.

If you want to cause a pipeline stall, there's usually serializing instructions like cpuid/eieio.

vgatherps · on Nov 10, 2022

I believe intel used to always predict forward branches not taken and backward branches always taken if it didn’t have any history.

Regardless, you can at least lay out code in a way that heavily favors one branch over the other, and potentially optimize for one branch (I.e. give one branch very expensive prelude/exit to make another branch very cheap). The last thing I’ve seen compilers greatly struggle with though…

wumpus · on Nov 10, 2022

20 years of "progress" and hey, C++20 is gaining the likely attribute.

bluGill · on Nov 10, 2022

CPUs have documentation, for this. I forget which one, but the one I did read it was as simple as the true case is assumed more common, and it is easy to arrange logic around that (I probably mis remember, but close enough to that). The common case also should be near the if in memory (inline code), so it is likely on the same cache line (or the next which is prefetched), while the other case is farther away and so you can stall the cache if the if goes that way.

travisgriggs · on Nov 10, 2022

The problem with static compilation is not all information known at compile time is the correct information to optimize on.

Assuming either source of data is the single source of truth for all optimizations is a fallacy.

Use the right tool for the right job. Use all the tools if you can.

kllrnohj · on Nov 10, 2022

Static compilers usually don't have to make such a tradeoff, though. They are free to spend arbitrarily long amounts of time optimizing all branches. And they often do exactly that.

Static + LTO w/ PGO is pretty much the practical ideal. JITs don't offer much until you start adding dynamically loaded code where LTO just isn't possible anymore.

pjmlp · on Nov 10, 2022

Tooling and real data sets.

Most PGO for AOT scenarios suffers from tooling experience and not using data sets similar to production workflows.

kllrnohj · on Nov 10, 2022

Perhaps, but that's orthogonal. JITs don't just come with ideal production set sampling either after all. They only capture snippets, and rarely re-optimize already compiled functions in the face of changing workloads or new information. You can do crowd-sourced profiles (like Android does), but that's a completely independent set of infrastructure from JIT vs. AOT. You can feed that same profile to PGO for AOT. In fact, that's how Android uses it. They don't feed the profile to the JIT, they feed it to their offline, install-time or idle-maintenance AOT compiler.

KMag · on Nov 10, 2022

If your hardware is designed to allow very lightweight profiling and tracing, then static + LTO w/ PGO can still be improved by runtime re-optimization. If designed properly, the runtime overhead can be brought arbitrarily low by increasing the sampling period.

kllrnohj · on Nov 10, 2022

Are you speaking in theoreticals or can you point to any actual example of what you're describing? Usually for a JIT to work the source binary has to be unoptimized to begin with, otherwise information is lost that the JIT needs.

So What language/runtime out there ships unoptimized bytecode, an optimized precompiled static + LTO w/ PGO, and can re-optimize with runtime gathered information via a JIT?

Heck, what language/runtime is even designed around being performance-focused with a JIT to deliver even more performance in the first place? Pretty much ever JIT'd language makes trade-offs that sacrifice up-front performance and later hopes the JIT can claw some of it back. maybe this is kinda WASM-ish territory, although it then sacrifices up-front performance for security and hopes the JIT can claw it back.

KMag · on Nov 10, 2022

The basic elements all exist.

As others have pointed out, HP's Dynamo managed to use runtime re-optimization to improve performance of many binaries without cooperation from the original compiler. Runtime optimization doesn't strictly require any of the information lost in optimized binary builds.

Last I checked, the Android Runtime would AoT-compile Dalvik bytecode at install time, and in the background re-optimize the binary based on profiling information. Though, I don't think it performs hot code replacement.

I'm not sure the latest with Oracle's Java AoT. Last I checked, Oracle's JVM wasn't able to inline or re-optimize AoT-compiled code through JIT'd code.

Some optimizations, such as loop unrolling make JIT'ing harder. However, strength reduction, loop-invariant code hoisting, etc. make the JIT's life easier. Back around 2005-2006, my employer was getting good mileage out of a Java bytecode optimizer. If your AoT and JIT are cooperating, the AoT can stuff any helpful metadata (type information, aliasing analysis, serialized control flow graph, etc.) into an auxiliary section of the binary.

I'd like to eventually write a C compiler that essentially compiles to old-school threaded code: arrays of pointers to basic blocks (strait-line code with a single entry point and one or more exit points). Function entry would just pass the array of basic blocks to a trampoline function that calls the first basic block. Each basic block would return and index into the function's array of basic blocks for the trampoline to call next. Function return would be signaled by a basic block returning -1 to the trampoline loop. A static single assignment representation of each extended basic block would be stashed in an auxiliary ELF section. On a regular system without the runtime optimizer, the only startup overhead would be due to bloated binary size. A wild guess at the performance overhead without the runtime optimizer would be in the 5% to 15% range. However, on a system with the ELF dynamic loader replaced with a runtime-optimizer, the runtime would set up a perf signal handler that would keep counters for identifying hotspots to trace. If the tracing conditions were met, the perf signal handler would walk back up the call stack to find the last occurrence of the address of the trampoline, and replace it with a version of the trampoline that in addition, stores the address of the next extended basic block to run. Once a trace loops back on itself or meets some other TBD criteria, the runtime would stitch together the SSA representations of the constituent extended basic blocks, and generate a new optimized basic block that's the inline of the components of the trace, and finally place the address of the new extended basic block in the correct place in the array of extended basic blocks, thus performing runtime code replacement. Anyway, that's my grand vision. I've taken the introductiory Stanford compilers course, and am slowly working my way forward, but I have a job and a young kid, so I'm not holding my breath.

Among other things, this allows for inlining of hot paths across dynamic library boundaries. It also improves code locality and should increase the percentage of not-taken branches in the hot path, which should help reduce problems with aliasing in the branch predictor.

wumpus · on Nov 10, 2022

> In finance

Isn't low-latency trading only a subset of finance?

bluGill · on Nov 10, 2022

Yes. And the part I mentioned is a subset of low latency. See the other reply.

masklinn · on Nov 9, 2022

It's also an argument for having much more expressive and precise type systems, so the compiler has better information.

Once you've managed to debug the codegen anyway (see: The Long and Arduous Story of Noalias).

mikepurvis · on Nov 9, 2022

Is it? I'd love to see a breakdown of what classes of information can be gleaned from profile data, and how much of an impact each one has in isolation in terms of optimization.

Naively, I would have assumed that branch information would be most valuable, in terms of being able to guide execution toward the hot path and maximize locality for the memory accesses occurring on the common branches. And that info is not something that would be assisted by more expressive types, I don't think.

notriddle · on Nov 9, 2022

https://tomaszs2.medium.com/how-rust-1-64-became-10-20-faste...

https://news.ycombinator.com/item?id=33306945

anyfoo · on Nov 9, 2022

That's a lesson Intel had to (re-?)learn with Itanium as well.

inkyoto · on Nov 10, 2022

> […] "the compiler has insufficient information to achieve ideal optimization because some things can only be known at runtime."

This is where the profile guided optimisation comes in – for statically compiled languages, with a caveat being not always straightforward to come up with a set of inputs that will trigger an execution of all possible code paths. One solution is to provide the coverage specifically for the performance critical code paths and let the rest just be.

funstuff007 · on Nov 10, 2022

> it can also be an argument for instrumenting your runtime environment

Aren't JITs already self-instrumenting? What would you instrument that the JIT is not already keeping track of?

titzer · on Nov 9, 2022

Darn it, replied too early. See sibling comment I just posted. The problem with dynamic languages is that you need to speculate and be ready to undo that speculation.

hinkley · on Nov 9, 2022

Before I talked myself out of writing my own programming language, I used to have lunch conversations with my mentor who was also speed obsessed about how JIT could meet Knuth in the middle by creating a collections API with feedback guided optimization, using it for algorithm selection and tuning parameters by call site.

For object graphs in Java you can waste exorbitant amounts of memory by having a lot of “children” members that are sized for a default of 10 entries but the normal case is 0-2. I once had to deoptimize code where someone tried to do this by hand and the number they picked was 6 (just over half of the default). So when the average jumped to 7, then the data structure ended up being 20% larger than the default behavior instead of 30% smaller as intended.

For a server workflow, having data structured tuned to larger pools of objects with more complex comparison operations can also be valuable, but I don’t want that kitchen sink stuff on mobile or in an embedded app.

I still think this is viable, but only if you are clever about gathering data. For instance the incremental increase in runtime for telemetry data is quite high on the happy path. But corner cases are already expensive, so telemetry adds only a few percent there instead of double digits.

The nonstarter for this ended up being that most collections APIs violate Liskov, so you almost need to write your own language to pick a decomposition that doesn’t. Variance semantics help a ton but they don’t quite fix LSP.

pjmlp · on Nov 10, 2022

That is in a sense what JIT caching with PGO feedback loop does to a certain extent.

As for mobile, this isn't far off from the whole set of hand written Assembly interpreter, JIT with PGO data caching, AOT compilation on idle with the PGO collected data, sharing PGO metadata with other devices via the PlayStore, that Android happens to do since version 7, and refined later.

astrange · on Nov 10, 2022

Swift and Objective-C have collections that change implementation as needed, but they can do it because they're well abstracted (enough) and usually immutable so there's less chance of making a bad assumption.

Most other languages only have mutable collections, and name collection types after their implementation details instead of what the user actually wants from them.

titzer · on Nov 9, 2022

Dynamic languages need inline caches, type feedback, and fairly heavy inlining to be competitive. Some of that can be gotten offline, e.g. by doing PGO. But you can't, in general, adapt to a program that suddenly changes phases, or rebinds a global that was assumed a constant, etc. Speculative optimizations with deopt are what make dynamic languages fast.

kllrnohj · on Nov 10, 2022

I think you're over estimating the impact or relevance of that anecdote. Particularly since the "static compilation makes us faster" crowd turned out to be correct, and people use JITs for non-performance reasons and just pay the performance tax they so often come with. The time-constrained nature in which a JIT has to run just largely kills it's theoretical runtime information gathering advantages. Devirtualization remains about the most advanced trick in the JIT book, which is generally not an issue static compiled languages struggle with in the first place.

pjmlp · on Nov 10, 2022

Sure, because everyone ships static linked binaries.

saagarjha · on Nov 9, 2022

I take it you are not very familiar with the website known as Hacker News.

hawflakes · on Nov 9, 2022

People have mentioned the Dynamo project from HP. But I think you're actually thinking of the Aries project (I worked in a directly adjacent project) that allowed you to run PA-RISC binaries on IA-64.

https://nixdoc.net/man-pages/HP-UX/man5/Aries.5.html

mark_undoio · on Nov 9, 2022

Something that fascinates me about this kind of A -> A translation (which I associate with the original HP Dynamo project on HPPA CPUs) is that it was able to effectively yield the performance effect of one or two increased levels of -O optimization flag.

Right now it's fairly common in software development to have a debug build and a release build with potentially different optimisation levels. So that's two builds to manage - if we could build with lower optimisation and still effectively run at higher levels then that's a whole load of build/test simplification.

Moreover, debugging optimised binaries is fiddly due to information that's discarded. Having the original, unoptimised, version available at all times would give back the fidelity when required (e.g. debugging problems in the field).

Java effectively lives in this world already as it can use high optimisation and then fall back to interpreted mode when debugging is needed. I wish we could have this for C/C++ and other native languages.

saagarjha · on Nov 9, 2022

It depends greatly on which optimization levels you’re going through. —O0 to -O1 can easily be a 2-3x performance improvement, which is going to be hard to get otherwise. -O2 to -O3 might be 15% if you’re lucky, in which case -O+LTO+PGO can absolutely get you wins that beat that.

bluGill · on Nov 9, 2022

-O2 to -O3 has in some benchmarks made things worse. In others it is a massive win, but in generally going above -O2 should not be done without bench marking code. There are some optimizations that can make things worse or better for reasons that compiler cannot know.

astrange · on Nov 9, 2022

Over-optimizing your "cold" code can also make things worse for the "hot" code, eg by growing code size so much that briefly entering the cold space kicks everything out of caches.

hinkley · on Nov 9, 2022

I have often lamented not being able to hint to the JIT when I’ve transitioned from startup code to normal operation. I don’t need my Config file parsing optimized. But the code for interrogating the Config at runtime better be.

Everything before listen() is probably run once. Except not ever program calls listen().

hinkley · on Nov 9, 2022

And then there’s always the outlier where optimizing for size makes the working memory fit into cache and thus the whole thing substantially faster.

foobiekr · on Nov 9, 2022

One of the engineers I was working with on a project was from Transitive (the company that made QuickTransit which became Rosetta) found that their JIT based translator could not deliver significant performance increases for A->A outside of pathological cases, and it was very mature technology at the time.

I think it's a hypothetical. The Mill Computing lectures talk about a variant of this, which is sort of equivalent to an install-time specializer for intermediate code which might work, but that has many problems (for one thing, it breaks upgrades and is very, very problematic for VMs being run on different underlying hosts).

bogomipz · on Nov 10, 2022

>"The Mill Computing lectures talk about a variant of this ..."

Might you or someone else have a link those Mill Computing lectures?

foobiekr · on Nov 10, 2022

Sure thing. I’m on mobile but the 2nd one was easy to find and is here -

https://youtu.be/QGw-cy0ylCc

freedomben · on Nov 9, 2022

If JIT-ing a statically compiled input makes it faster, does that mean that JIT-ing itself is superior or does it mean that the static compiler isn't outputting optimal code? (real question. asked another way, does JIT have optimizations it can make that a static compiler can't?)

kmeisthax · on Nov 9, 2022

It's more the case that the ahead-of-time compilation is suboptimal.

Modern compilers have a thing called PGO (Profile Guided Optimization) that lets you take a compiled application, run it and generate an execution profile for it, and then compile the application again using information from the profiling step. The reason why this works is that lots of optimization involves time-space tradeoffs that only make sense to do if the code is frequently called. JIT only runs on frequently-called code, so it has the advantage of runtime profiling information, while ahead-of-time (AOT) compilers have to make educated guesses about what loops are the most hot. PGO closes that gap.

Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.

com2kid · on Nov 9, 2022

> Theoretically, a JIT could produce binary code hyper-tailored to a particular user's habits and their computer's specific hardware. However, I'm not sure if that has that much of a benefit versus PGO AOT.

In theory JIT can be a lot more efficient, optimizing for not only the exact instruction set, and do per CPU architecture optimizations, such as instruction length, pipeline depth, cache sizes, etc.

In reality I doubt most compiler or JIT development teams have the resources to write and test all those potential optimizations, especially as new CPUs are coming out all the time, and each set of optimizations is another set of tests that has to be maintained.

duped · on Nov 9, 2022

Like another commented, JIT compilers do this today.

The thing that makes this mostly theoretical is that the underlying assumption is only true when you neglect that an AOT has zero run-time cost while a JIT compiler has to execute the code it's optimizing and the code to decide if it's worth optimizing and generate new code.

So JIT compiler optimizations are a bit different than AOT optimizations since they have to both generate faster/smaller code and the execute code that performs the optimization. The problem is that most optimizations beyond peephole are quite expensive.

There's another thing that AOT compilers don't need to deal with, which is being wrong.Production JITs have to implement dynamic de-optimization in the case that an optimization was built on a bad assumption.

That's why JITs are only faster in theory (today), since there are performance pitfalls in the JIT itself.

KMag · on Nov 9, 2022

But, JIT vs. AoT is a false dichotomy. Given light-weight enough profiling utilizing cooperation between hardware designers and compiler writers, one could have AoT with feedback-guided optimization and link-time optimization, and still have just-in-time re-optimization.

Concretely, I think you'd want hardware that supported reservoir sampling of where CPU cycles are spent, sampling of which branches are mispredicted, and which code locations are causing cache misses. You'd also want lightweight hardware recording of execution traces.

saagarjha · on Nov 10, 2022

Guess what Apple has? :)

titzer · on Nov 9, 2022

Nearly all JS engines are doing concurrent JIT compilation now, so some of the compilation cost is moved off the main thread. Java JITs have had multiple compiler threads for more than a decade.

kllrnohj · on Nov 10, 2022

But they all still optimize their JITs to prioritize compilation speed & RAM usage (JIT'd code is dirty pages after all) over maximum optimizations. This is why you see things like WebKits multi-tier JIT strategy: https://webkit.org/blog/3362/introducing-the-webkit-ftl-jit/

They still want to swap in that JIT'd result ASAP since after all by the time it's been flagged for compilation it's already too late & is a hot hot hot function.

pjmlp · on Nov 10, 2022

Which is why in the Java world (including Android), and .NET, now we use JIT caches as well, so that this data isn't lost between runs.

saagarjha · on Nov 9, 2022

The well funded production JIT compilers (HotSpot, V8, etc.) absolutely do take advantage of these. The vector ISA can sometimes be unwieldy to work with but things like replacing atomics, using unaligned loads, or taking advantage of differing pointer representations is common.

com2kid · on Nov 9, 2022

They do some auto-vectorization, but AFAIK they don't do micro-optimizations for different CPUs.

bluGill · on Nov 9, 2022

gcc and clang at least have options so you can optimize for specific CPUs. I'm not sure how good they are (most people want a generic optimization that runs well on all CPUs of the family, so there likely is lots of room for improvement with CPU specific optimization), but they can do that. This does (or at least can, again it probably isn't fully implemented), account for instruction length, pipeline depth, cache size.

The Javascript V8 engine, and the JVM both are popular and supported enough that I expect the teams working on them take advantage of every trick they can for specific CPUs, they have a lot of resources for this. (at least the major x86 and ARM chips - maybe they don't for MIPS or some uncommon variant of ARM...). Of courses there are other JIT engines, some uncommon ones don't have many resources and won't do this.

titzer · on Nov 9, 2022

> take advantage of every trick they can for specific CPUs

Not to the extent clang and gcc do, no. V8 does, e.g. use AVX instructions and some others if they are indicated to be available by CPUID. TurboFan does global scheduling in moving out of the sea of nodes, but that is not machine-specific. There was an experimental local instruction scheduler for TurboFan but it never really helped big cores, while measurements showed it would have helped smaller cores. It didn't actually calculate latencies; it just used a greedy heuristic. I am not sure if it was ever turned on. TurboFan doesn't do software pipelining or unroll/jam, though it does loop peeling, which isn't CPU-specific.

astrange · on Nov 9, 2022

> gcc and clang at least have options so you can optimize for specific CPUs. I'm not sure how good they are

They are not very good at it, and can't be. You can look inside them and see the models are pretty simple; the best you can do is optimize for the first step (decoder) of the CPU and avoid instructions called out in the optimization manual as being especially slow. But on an OoO CPU there's not much else you can do ahead of time, since branches and memory accesses are unpredictable and much slower than in-CPU resource stalls.

mockery · on Nov 9, 2022

In addition to the sibling comments, one simple opportunity available to a JIT and not AOT is 100% confidence about the target hardware and its capabilities.

For example AOT compilation often has to account for the possibility that the target machine might not have certain instructions - like SSE/AVX vector ops, and emit both SSE and non-SSE versions of a codepath with, say, a branch to pick the appropriate one dynamically.

Whereas a JIT knows what hardware it's running on - it doesn't have to worry about any other CPUs.

acdha · on Nov 9, 2022

One great example of this was back in the P4 era where Intel hit higher clock speeds at the expense of much higher latency. If you made a binary for just that processor a smart compiler could use the usual tricks to hit very good performance, but that came at the expense of other processors and/or compatibility (one appeal to the AMD Athlon & especially Opteron was that you could just run the same binary faster without caring about any of that[1]). A smart JIT could smooth that considerably but at the time the memory & time constraints were a challenge.

1. The usual caveats about benchmarking what you care about apply, of course. The mix of webish things I worked on and scientists I supported followed this pattern, YMMV.

duped · on Nov 9, 2022

AOT compilers support this through a technique called function multi-versioning. It's not free and only goes so far, but it isn't reserved to JITs.

The classical reason to use FMV is for SIMD optimizations, fwiw

ketralnis · on Nov 9, 2022

It means that in this case, the static compiler emitted code that could be further optimised, that's all. It doesn't mean that that's always the case, or that static compilers can't produce optimal code, or that either technique is "better" than the other.

An easy example is code compiled for 386 running on a 586. The A->A compiler can use CPU features that weren't available to the 386. As with PGO you have branch prediction information that's not available to the static compiler. You can statically compile the dynamically linked dependencies, allowing inlining that wasn't previously available.

On the other hand you have to do all of that. That takes warmup time just like a JIT.

I think the road to enlightenment is letting go of phrasing like "is superior". There are lots of upsides and downsides to pretty much every technique.

andrewaylett · on Nov 9, 2022

It depends on what the JIT does exactly, but in general yes a JIT may be able to make optimisations that a static compiler won't be aware of because a JIT can optimise for the specific data being processed.

That said, a sufficiently advanced CPU could also make those optimisations on "static" code. That was one of the things Transmeta had been aiming towards, I think.

rowanG077 · on Nov 9, 2022

A JIT can definitely make optimizations that a static compiler can't. Simply by virtue of it having concrete dynamic real-time information.

vips7L · on Nov 9, 2022

Yes, the JIT has more profile guided data as to what your program actually does at runtime, therefore it can optimize better.

gpderetta · on Nov 9, 2022

On the other hand some optimization are so expensive that a JIT just doesn't have the execution budget to perform them.

Probably the optimal system is an hybrid iterative JIT/AOT compiler (which incidentally was the original objective of LLVM).

jeffbee · on Nov 9, 2022

Post-build optimization of binaries without changing the target CPU is common. See BOLT https://github.com/facebookincubator/BOLT

chrisseaton · on Nov 9, 2022

I've run Ruby C extensions on a JIT faster than on native, due to things like inlining and profiling working more effectively at runtime.

bogomipz · on Nov 10, 2022

>"I remember years ago when Java adjacent research was all the rage, ..."

What is meant by "Java adjacent research"? I'm not familiar with what that was.

vinyl7 · on Nov 10, 2022

> where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds.

That already exists with c/c++...unity builds aka single translation unit builds. Compiling and linking a ton of object files takes an inordinate amount of time, often the majority of the build time

AussieWog93 · on Nov 9, 2022

Outside of gaming, or hyper-CPU-critical workflows like video editing, I'm not really sure if people actually even care about that last 10% of performance.

I know most of the time I get frustrated by everyday software, its doing something unnecessary in a long loop, and possibly forgetting to check for Windows messages too.

koala_man · on Nov 9, 2022

Performance also translates into better battery life and cheaper datacenters.

tomcam · on Nov 9, 2022

I’m likely misunderstanding what you said, but I thought pre-compiled headers were pretty much standard these days.

tomcam · on Nov 10, 2022

What on earth did I say to merit the downvotes?

zaphirplane · on Nov 9, 2022

Is this for itanium

fuckstick · on Nov 9, 2022

> The output was faster than the input.

So if you ran the input back through the output multiple times then that means you could eventually get the runtime down to 0.

twic · on Nov 9, 2022

But unfortunately, the memory use goes to infinity.

avidiax · on Nov 9, 2022

Probably the output of the decade-old compiler that produced the original binary had no optimizations.

hinkley · on Nov 9, 2022

That too but the eternal riddle of optimizer passes is which ones reveal structure and which obscure it. Do I loop unroll or strength reduce first? If there are heuristics about max complexity for unrolling or inlining then it might be “both”.

And then there’s processor family versus this exact model.

darzu · on Nov 9, 2022

Does anyone know the names of the key people behind Rosetta 2?

In my experience, exceptionally well executed tech like this tends to have 1-2 very talented people leading. I'd like to follow their blog or Twitter.

cwzwarich · on Nov 9, 2022

I am the creator / main author of Rosetta 2. I don't have a blog or a Twitter (beyond lurking).

darzu · on Nov 9, 2022

If you're feeling inclined, here's a slew of questions:

What was the most surprising thing you learned while working on Rosetta 2?

Is there anything (that you can share) that you would do differently?

Can your recommend any great starting places for someone interested in instruction translation?

Looking forward, did your work on Rosetta give you ideas for unfilled needs in the virtualization/emulation/translation space?

What's the biggest inefficiency you see today in the tech stacks you interact most with?

A lot of hard decisions must have been made while building Rosetta 2; can you shed light on some of those and how you navigated them?

darzu · on Nov 9, 2022

Should you feel inspired to share your learnings, insights, or future ideas about the computing spaces you know, me and I'm sure many other people would be interested to listen!

My preferred way to learn about a new (to me) area of tech is to hear the insights of the people who have provably advanced that field. There's a lot of noise to signal in tech blogs.

pcf · on Nov 9, 2022

Thanks for your amazing work!

May I ask – would it be possible to implement support for 32-bit VST and AU plugins?

This would be a major bonus, because it could e.g. enable producers like me to open up our music projects from earlier times, and still have the old plugins work.

bdash · on Nov 9, 2022

Impressive work, Cameron! Hope you're doing well.

skrrtww · on Nov 9, 2022

Are you able to speak at all to the known performance struggles with x87 translation? Curious to know if we're likely to see any updates or improvements there into the future.

olliej · on Nov 10, 2022

There are two ways to approach x87: either saying to heck with it and just using doubles for everything (this is essentially what Qemu does) or creating a software fp80 implementation. Both approaches get burned by the giant amount of state, and state weirdness, that x87 brings to the table. It's also not possible to "fix" things by optimizing for the cases where the x87 unit's precision is set to the same as fp32 or fp64, as the precision flags don't impact the exponent range.

But even on native hardware using x87 is vastly slower than fp64, and it's just a shame that only win64 had the good sense to define long double as being fp64 instead of fp80 as every other x86_64 platform did :-/

phire · on Nov 10, 2022

> It's also not possible to "fix" things by optimizing for the cases where the x87 unit's precision is set to the same as fp32 or fp64, as the precision flags don't impact the exponent range.

I've been meaning to look into this. Certainly you can't blindly optimise all x87 code sequences to fp32 or fp64. But some sequences are safe.

For example, adding two numbers and saving back to memory is safe to optimise (at least for the infinity case, I haven't double-checked the subnormal behaviour). It's only when you need to add three or more numbers that you run into issues (though you can go further, if all N numbers have the same sign, you will get the correct result, you just might have saturated at infinity a few operations earlier than native x87)

Same goes for multiplication of two numbers (and N numbers that all provably >= 1.0)

The question is if such code sequences are common enough to bother trying to identify at compile time and optimise.

olliej · on Nov 10, 2022

> For example, adding two numbers and saving back to memory is safe to optimise (at least for the infinity case, I haven't double-checked the subnormal behaviour). It's only when you need to add three or more numbers that you run into issues (though you can go further, if all N numbers have the same sign, you will get the correct result, you just might have saturated at infinity a few operations earlier than native x87)

No, you cannot. Operations you can optimise are negation, nan, and infinity checks (ignoring pseudo nans and pseudo infinity checks of course).

fp80 has a 15bit exponent, and functionally a 63 significand, vs fp64's 11bit exponent and 52bit significand. When setting x87 to a reduced precision mode you aren't switching to fp64, you're getting a mix with a 15 bit exponent and a 53 significand. The effect is that you retain 53bits of precision for values where fp64 has entered subnormals, and conversely you maintain 53 bits of precision after fp64 has overflowed. There are perf benefits to reducing precision in x87 (at least in the 90s), but the main advantage is consistent rounding with fp64 while in the range of normalized fp64 values.

phire · on Nov 10, 2022

The key to this optimisation idea is the exponent gets truncated back to 8 bits when being written back to memory.

For the example of two fp32s adding to infinity.

With 15bit exponent: The add results in a non-infinite with exponent outside the -125 to 127 range. Then when writing back to memory, it the FPU notices the exponent is outside of the valid range, clamps it and writes infinity to memory.

With 8 bit exponent: The add immediately clamps to infinity in the register. and then it writes to memory.

In both cases you get the same result in memory, so the result is valid as long as the in-register version is killed. And the same should apply to the subnormal case (I have not double checked the x87 spec). If you start with two subnormals that are valid f32s, add them, get a subnormal result and then write back to memory as a f32, it should be guaranteed to produce the same result with both a 15bit exponent and a 8bit exponent. It doesn't matter if the subnormal mantissa was truncated before writing back to the register, or writing back to memory. It was still truncated.

You only start getting accuracy issues start doing multiple additions in a row without truncating the exponent. If you add 3 floats, the result of A+B might be infinite, but A+B+C could result in a normal f32 if you had a 15bit exponent (when A+B is positive infinity, and C is negative)

This line of thought could potentially be pushed further. If you can prove (or guard) at compile time that all N floats in a sequence of N adds will be positive (and not subnormal), then you can't have a case where one of intermediary exponent exceeds 127 but then the final exponent is less than 128. If there is an infinite anywhere along the chain, it will saturate to infinity. With 15bit exponent, the saturation might not be applied until the f32 value is written to memory at the end, but because of the preconditions the optimiser can guarantee the same result in memory (either infinity or a normal) at the end while only using 8 bit exponents operations.

Most of the above should also apply to other operations like multiply. I've only done some preliminary thinking about this idea, enough to be sure that some operations could be optimised. I'm only fully confident about clamping to infinity case, and I'm going to be really bummed if when I get around to double-checking, there is something about how the x87 deals with subnormals that I'm not aware of. Or some other x87 weirdity.

olliej · on Nov 11, 2022

> The key to this optimisation idea is the exponent gets truncated back to 8 bits when being written back to memory.

That's incorrect - the clamping of the exponent only occurs if you were to use FST/m32 or FST/m64, but if you're using x87 you're presumably doing FSTP/m80fp so there is no truncation or rounding on store, regardless of the prevision flag in the control word.

It sounds like what you're trying to arrange is an optimization such that a rosetta like translator/emulator can optimize this highly awesome function to be performed entirely using hardware fp32 or fp64 support:

    fp32 f(fp32 *fs, size_t count) {
      // pseudo code obviously :D
      ControlWord cw = fstcw();
      cw.precision = Precision32;
      fldcw(cw);
      fp80 result = 0;
      for (unsigned j = 0; j < count; j++) {
        result += fs[j];
      }
      return (fp32)result;
      // pretend we restored state before returning :)
    }

The problem you run into though, is that an optimization pass can make decisions based on anything other than the code it is presented with. So your optimizer can't assume sign or magnitude here, so that += has to be able to under or overflow the range that fp32 offers.

Things get really miserable once you go beyond +/-, because you can end up in a position where an optimization to do everything in the 32/64 bit units means that you won't get observable double rounding.

This is kind of moot in the rosetta case as I don't believe we ever implemented support for the precision control bits

More fun are the transcendentals - x87 specifies them as using range reduction and so they are incredibly inaccurate (in the context of maths functions), especially around multiples of pi/4, and if you go test it you'll find rosetta will produce the same degree of terrible output :D

phire · on Nov 11, 2022

I was thinking more about functions along the lines of this vertex transform function that you might theoretically find as hot code in a late 90s or early 2000s windows game (before hardware transform and lighting).

    void transform_verts(fp32 *m, fp32 *verts, size_t vert_count) {
      // it's a game, decent chance it applies percision32 across the whole process
      // Especially since directx < 10 automatically sets it when a 3d context is created
      while (vert_count--) {
        verts[0] = verts[0] * m[0] + verts[0] * m[1] + verts[0] * m[2] + m[3];
        verts[1] = verts[1] * m[4] + verts[1] * m[5] + verts[1] * m[6] + m[7];
        verts[2] = verts[2] * m[8] + verts[2] * m[9] + verts[2] * m[10] + m[11];
        verts += 3;
      }
    }

Would be nice if we could optimise it all to pure hardware fp32 without any issues. But not really possible with those six operation long chains. And you are right, we can't really assume anything about the data.

But we can go for guards and fallbacks instead. Implement that loop body as something like

    loop: 
        // Attempt calculation with hardware fp32
        $1 = hwmul(verts[0], m[0])
        $2 = hwmul(verts[0], m[1])
        $3 = hwmul(verts[0], m[2])
        $4 = hwadd($1, $2)
        $5 = hwadd($3, $4)
        $6 = hwadd($5, m[3]) // any infs from above sub-equations will saturate though to here
        if any(is_subnormal_or_zero([$1, $2, $3, $4, $5]) || $6 is inf: // guard
           // one of the above subcalulcations became either inf or subnormal, so our
           // hwresult might not be accurate. recalculate with safe softfloat
           $8 = swadd(swmul(verts[0], m[0]), swmul(verts[0], m[1]))
           $6 = swadd(swadd($8, swmul(vert[0], m[2])), m[3])
    
        verts[0] = $6
    
        // repeat above pattern for verts[1] and verts[2]
    
        goto loop

I think that produces bit-accurate results?

Sure, it might seem complicated to calculate twice. But the resulting code might end up faster than just pure softfloat code across average data. Maybe this is the type of optimisation that you only attempt at the highest level on a multi-tier JIT for really hot code. You could perhaps even instrument the function first to get an idea what the common shape of the data is.

> This is kind of moot in the rosetta case as I don't believe we ever implemented support for the precision control bits

So it's already producing inaccurate results for code that sets precision control? Might as well just switch over to hardware fp32 and fp64 /s

I guess for the rosetta usecase, Intel macs didn't until 2006 and so most of the install base of x86 programs will be compiled with SSE2 support, and commonly 64bit.

Probably the most common usecase for x87 support in rosetta will be 64bit code used long doubles and compilers/ABIs annoying implemented them as x87.

olliej · on Nov 13, 2022

Sorry for delay (surgery funsies)

> So it's already producing inaccurate results for code that sets precision control? Might as well just switch over to hardware fp32 and fp64 /s

:D

But in practice the only reason for changing the x87 precision is performance, which was then simply retained in hardware for backwards compatibility. Modern code (as in >= SSE era) simply uses fp32 or fp64 which is faster, more memory compact, has vector units, has a much more sane ISA, etc. Anyone who does try to toggle x87 mode in general is in for a world of hurt because the system libraries all assume the unit is operating in default state.

You are correct that the only reason x86_64 needs x87 is that the unix x86_64 ABI decided to specify the already clearly deprecated format the implementation of long double. I often looked wistfully at win64 where long double == double.

dougall · on Nov 10, 2022

Amazing work! It's nice to put a name to it :)

Klonoar · on Nov 9, 2022

Huh, this is timely. Incredibly random but: do you know if there was anything that changed as of Ventura to where trying to mmap below the 2/4GB boundary would no longer work in Rosetta 2? I've an app where it's worked right up to Monterey yet inexplicably just bombs in Ventura.

mrpippy · on Nov 10, 2022

This should work (Wine obviously needs it when running 32-bit apps). Are you explicitly specifying a small PAGEZERO when compiling?

Klonoar · on Nov 10, 2022

saagarjha · on Nov 10, 2022

Pretty sure mmap goes almost directly to the kernel in Rosetta 2, and Apple silicon requires at least 4 GB.

Klonoar · on Nov 10, 2022

This works fine in Big Sur and Monterey on Apple Silicon, hence my point about if something changed in Ventura.

Edit: Big Sur -> Ventura.

anyfoo · on Nov 9, 2022

Not affiliated and don't know, but curious why you're doing that in the first place?

Klonoar · on Nov 10, 2022

's not my doing, it's just an older project that's slowly migrating to a newer system but is held back by everyone having lives. I wouldn't do it normally, heh.

keepquestioning · on Nov 9, 2022

Isn't Rosetta 2 "done"? What are you working on now?

trollied · on Nov 9, 2022

The original Rosetta was written by Transitive, which was formed by spinning a Manchester University research group out. See https://www.software.ac.uk/blog/2016-09-30-heroes-software-e...

I know a few of their devs went to ARM, some to Apple & a few to IBM (who bought Transitive). I do know a few of their ex staff (and their twitter handles), but I don’t feel comfortable linking them here.

scrlk · on Nov 9, 2022

IIRC the current VP of Core OS at Apple is ex-Manchester/Transitive.

lunixbochs · on Nov 9, 2022

> To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah///unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on old version of macOS, so things may have changed and improved since then.

My aotool project uses a trick to extract the AOT binary without root or disabling SIP: https://github.com/lunixbochs/meta/tree/master/utils/aotool

menaerus · on Nov 9, 2022

> Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front.

Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?

nicoburns · on Nov 9, 2022

> If so, does it affect the application startup times?

It does, but only the very first time you run the application. The result of the transpilation is cached so it doesn't have to be computed again until the app is updated.

kijiki · on Nov 9, 2022

Similar to DEC's FX!32 in that regard. FX!32 allowed running x86 Windows NT apps on Alpha Windows NT.

saltcured · on Nov 9, 2022

There was also an FX!32 for Linux. But I think it may have only included the interpreter part and left out the transpiler part. My memory is vague on the details.

I do remember that I tried to use it to run the x86 Netscape binary for Linux on a surplus Alpha with RedHat Linux. It worked, but so slowly that a contemporary Python-based web browser had similar performance. In practice, I settled on running Netscape from a headless 486 based PC and displaying remotely on the Alpha's desktop over ethernet. That was much more usable.

dylan604 · on Nov 9, 2022

Does that essentially mean each non-native app is doubled in disk use? Maybe not doubled but requires more space to be sure.

varenc · on Nov 9, 2022

Yes... you can see the cache in /var/db/oah/

Though only the actual binary size that gets doubled. For large apps it’s usually not the binary that’s taking up most of the space.

saagarjha · on Nov 9, 2022

arianvanp · on Nov 9, 2022

And deleting the cache is undocumented (it is not in the file system) so if you run Mac machines as CI runners they will trash and brick themselves running out of disk space over time.

koala_man · on Nov 9, 2022

Really? This SO question says it's stored in /var/db/oah/

https://apple.stackexchange.com/questions/427695/how-can-i-l...

jonny_eh · on Nov 9, 2022

You mean the cache is ever expanding?

rowanG077 · on Nov 9, 2022

What in the actual fuck. That is such an insane decision. Where is it stored then? Some dark corner of the file system inaccessible via normal means?

happyopossum · on Nov 10, 2022

GP is incorrect - they’re stored in individual files inside /var/db/oah, and can be deleted without causing harm.

esskay · on Nov 9, 2022

The first load is fairly slow, but once it's done it every load after that is pretty much identical to what it'd be running on an x86 mac due to the caching it does.

EricE · on Nov 9, 2022

For me my M1 was fast enough that the first load didn't seem that different - and more importantly subsequent loads were lighting fast! It's astonishing how good Rosetta 2 is - utterly transparent and faster than my Intel Mac thanks to the M1.

savoytruffle · on Nov 9, 2022

If installed using a packaged installer, or the App Store, the translation is done during installation instead of at first run. So, slow 1st launch may be uncommon for a lot of apps or users.

nilsb · on Nov 9, 2022

Yes, it does. The delay of the first start of an app is quite noticeable. But the transpiled binary is apparently cached somewhere.

saagarjha · on Nov 9, 2022

/var/db/oah.

johnthuss · on Nov 9, 2022

"I believe there’s significant room for performance improvement in Rosetta 2... However, this would come at the cost of significantly increased complexity... Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that."

Gigachad · on Nov 9, 2022

Would be a waste of effort when the tool is designed to be obsolete in a few years as everything gets natively compiled.

solarkraft · on Nov 10, 2022

They could've amazed a few people a bit more by emulating x86 apps even faster (but M1+Rosetta can already run some stuff faster than an Intel Mac), but then the benefit of releasing native apps would be much decreased ("why bother, it's good enough ...").

It's a delicate political game that they, yet again, seem to be playing pretty well.

saagarjha · on Nov 9, 2022

One thing that’s interesting to note is that the amount of effort expended here is not actually all that large. Yes, there are smart people working on this, but the performance of Rosetta 2 for the most part is probably the work of a handful of clever people. I wouldn’t be surprised if some of them have an interest in compilers but the actual implementation is fairly straightforward and there isn’t much of the stuff you’d typically see in an optimizing JIT: no complicated type theory or analysis passes. Aside from a handful of hardware bits and some convenient (perhaps intentionally selected) choices in where to make tradeoffs there’s nothing really specifically amazing here. What really makes it special is that anyone (well, any company with a bit of resources) could’ve done it but nobody really did. (But, again, Apple owning the stack and having past experience probably did help them get over the hurdle of actually putting effort into this.)

dougall · on Nov 9, 2022

Yeah, agreed. I get the impression it's a small team.

But there is a long-tail of weird x86 features that are implemented, that give them amazing compatibility, that I regret not mentioning:

* 32-bit support for Wine

* full x87 emulation

* full SSE2 support (generally converting to efficient NEON equivalents) for performance on SIMD code

I consider all of these "compatibility", but that last one in particular should have been in the post, since that's very important to the performance of optimised SIMD routines (plenty of emulators also do SIMD->SIMD, but others just translate SIMD->scalar or SIMD->helper-runtime-call).

saagarjha · on Nov 10, 2022

Guilty!

menaerus · on Nov 10, 2022

I think it's about the incentive and not about other companies not doing it. Apple decided to move to ARM and the reason is probably in their strong connection to the ARM ecosystem which basically means that they have an edge with their vertical-integration approach when compared to the other competitors. Apple is one of the three _founding_ companies of ARM. Other two were VLSI Technology and Acorn.

karmakaze · on Nov 9, 2022

Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.

saagarjha · on Nov 9, 2022

That’s not correct, the article goes into details why.

nwallin · on Nov 9, 2022

That is correct, the article goes into details why. See the "Apple's Secret Extension" section as well as the "Total Store Ordering" section.

The "Apple's Secret Extension" section talks about how the M1 has 4 flag bits and the x86 has 6 flag bits, and how emulating those 2 extra flags would make every add/sub/cmp instruction significantly slower. Apple has an undocumented extension that adds 2 more flag bits to make the M1's flag bits behave the same as x86.

The "Total Store Ordering" section talks about how Apple has added a non-standard store ordering to the M1 than makes the M1 order its stores in the same way x86 guarantees instead of the way ARM guarantees. Without this, there's no good way to translate instructions in code in and around an x86 memory fence; if you see a memory fence in x86 code it's safe to assume that it depends on x86 memory store semantics and if you don't have that you'll need to emulate it with many mostly unnecessary memory fences, which will be devastating for performance.

saagarjha · on Nov 9, 2022

I’m aware of both of these extensions; they’re not actually necessary for most applications. Yes, you trade fidelity with performance, but it’s not that big of a deal. The majority of Rosetta’s performance is good software decisions and not hardware.

dougall · on Nov 10, 2022

Yeah, these features exist, and they help, but I don't think they should be given all the credit. Both "Apple's Secret Extension" and "Total Store Ordering" are features that other emulators can choose to disable to get exactly the same performance.

"Apple's Secret Extension" isn't even used by Rosetta 2 on Linux (opting for, at least, explicit parity flag calculations rather than reduced fidelity). It's still fast.

TSO is only required for accuracy on multithreaded applications, and the PF and AF flags are basically never used, and, if they are, will usually be used immediately after being set, allowing emulators to achieve reasonable fidelity by only calculating them when used.

There's perhaps a better argument for performance-via-vertical-integration with the flag-manipulation extensions, which I believe Apple created and standardised, but which now anyone can use.

But the reason I wrote this post is that I think most of the ideas are transferable and could help other emulators :)

menaerus · on Nov 10, 2022

> TSO is only required for accuracy on multithreaded applications

If by accuracy you mean not segfaulting then yes. Every moderately complex x86-64 application will have memory fences in the generated machine code. x86-64 design of store-buffers and load-buffers are making the memory fences a necessity. In reality it's enough just to use the mutex or atomics in your code to end up with the memory fence in your generated machine code. So, I'd say that this particular part of Rosetta/M1 design is quite important, if not the most important. Without it applications wouldn't run.

saagarjha · on Nov 11, 2022

You can approximate it fairly well without much impact.

neerajsi · on Nov 13, 2022

Not true. The required fencing has huge impact. I led development of the chpe compiler for windows on arm, and the fencing was major source of our gains.

saagarjha · on Nov 13, 2022

I don't think we disagree :) If you're going for full accuracy you morally need barriers all over the place. If have TSO in your chips that makes things far easier, alternatively you can do stuff with RCpc if your hardware supports it. Otherwise you get stuck with fences everywhere, or you force your hardware into TSO compliance mode (read: turn off all the other cores) and that sucks.

The other option is you relax on the "required fencing", with the assumption that most accesses do not actually exercise the full semantics that TSO guarantees. Obviously some synchronization does matter, so you need heuristics and those won't always work. My understanding was that XTA has some of these, with knobs to turn them off if they don't work? You probably know more about that than I do. In iSH we play it even more fast-and-loose, with all regular memory accesses being lowered to ARM loads and stores, and locked operations to whatever seemed the closest. It's definitely not production-grade but we have shockingly good compatibility for what it is.

hinkley · on Nov 9, 2022

Apple is doing some really interesting but really quiet work in the area of VMs. I feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.

As a somewhat related aside, I have been watching Bun (low startup time Node-like on top of Safari’s JavaScript engine) with enough interest that I started trying to fix a bug, which is somewhat unusual for me. I mostly contribute small fixes to tools I use at work. I can’t quite grok Zig code yet so I got stuck fairly quickly. The “bug” turned out to be default behavior in a Zig stdlib, rather than in JavaScript code. The rest is fairly tangential but suffice it to say I prefer self hosted languages but this probably falls into the startup speed compromise.

Being low startup overhead makes their VM interesting, but the fact that it benchmarks better than Firefox a lot of the time and occasionally faster than v8 is quite a bit of quiet competence.

jraph · on Nov 9, 2022

> feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.

And maybe also by keeping the technology closed and Apple-specific. Many people who could be interested in using it don't have access to it.

jolux · on Nov 9, 2022

WebKit B3 is open source: https://webkit.org/docs/b3/

freedomben · on Nov 9, 2022

Exactly. As someone who would be very interested in this, but don't use Apple products, it's just not exciting because it's not accessible to me (I can't even test it as a user). If they wanted to write a whitepaper about it to share knowledge, that might be interesting, but given that it's Apple I'm not gonna hold my breath.

saagarjha · on Nov 9, 2022

Apple (mostly WebKit) writes a significant amount about how they designed their VMs.

kccqzy · on Nov 9, 2022

> The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which convert floating-point condition flags to/from a mysterious “external format”. By some strange coincidence, this format is x86, so these instruction are used when dealing with floating point flags.

This really made me chuckle. They probably don't want to mention Intel by name, but this just sounds funny.

https://developer.arm.com/documentation/100076/0100/A64-Inst...

Vt71fcAqt7 · on Nov 9, 2022

I hope Rosetta is here to stay and continues developement. And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.

masklinn · on Nov 9, 2022

> I hope Rosetta is here to stay and continues developement.

It almost certainly is not. Odds are Apple will eventually remove Rosetta II, as they did Rosetta back in the days, once they consider the need for that bridge to be over (Rosetta was added in 2006 in 10.4, and removed in 2011 from 10.7).

> And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.

That's not going to happen unless Apple decides to switch from ARM to RISC-V, and... why would they? They've got 15 years experience and essentially full control on ARM.

CharlesW · on Nov 9, 2022

> Odds are Apple will eventually remove Rosetta II, as they did Rosetta back in the days, once they consider the need for that bridge to be over (Rosetta was added in 2006 in 10.4, and removed in 2011 from 10.7).

The difference is that Rosetta 1 was PPC → x86, so its purpose ended once PPC was a fond memory.

Today's Rosetta is a generalized x86 → ARM translation environment that isn't just for macOS apps. For example, it works with Apple's new virtualization framework to support running x86_64 Linux apps in ARM Linux VMs.

https://developer.apple.com/documentation/virtualization/run...

stu2b50 · on Nov 9, 2022

Rosetta 1 had a ticking time bomb. Apple was licensing it from a 3rd party. Rosetta 2 is all in house as far as we know.

Different CEO as well. Jobs was more opinionated on “principles” - Cook is more than happy to sell what people will buy. I think Rosetta 2 will last.