Apple has already done this to an extent. M1 has undocumented instructions and a semi-closed toolchain (Assuming they have some, M1 tuning models for LLVM and GCC are nowhere to be seen afaik)
Intel publish a very thick optimization manual, which is a good help.
Compilers aren't great at using the real parameters of the chip (i.e. LLVM knows how wide the reader buffer is but I'm not sure if it actually can use the information), but knowing latencies for ISel and things like that is very helpful. To get those details you do need to rely on people like (the god amongst men) agner fog.
Apple also contributes plenty to LLVM, more than Intel actually, naively based on commit counts of @apple.com and @intel.com git committer email addresses. This isn't very surprising given that Chris Lattner worked at Apple for over a decade.
LLVM is Apple's baby, beyond its genesis under Lattner. They hate the GPL, that's it.
The thing about their contributions is that they upstream stuff they want other people to standardize on but aren't doing it out of love, as far as I can tell e.g. Valve have a direct business interest in having Linux kicking ass, Apple actively loses out (psychologically at least) if a non-apple toolchain is as good as theirs.
Apple M1 also has that x86 emulation mode where memory accesses have the same ordering semantics as on x86. It's probably one of the main things giving Rosetta almost 1:1 performance with x86.
TSO support at the hardware level is a cool feature but it's a bit oversold here. Most emulated x86 code doesn't require it and usually not at every memory instruction when it does. For instance the default settings in Window's translation implementation do not do anything to guarantee TSO.
Rosetta is also a long way from 1:1 performance, even you're own link says ~70% the speed. That's closer to half speed than it is to full speed.
The M1's main trick to being so good at running x86 code is it's just so god damn fast for the power budget it doesn't matter if there is overhead for emulated code it's still going to be fast. This is why running Windows for ARM in parallels is fast too, it knows basically none of the "tricks" available but the emulation speed isn't much slower than the Rosetta 2 emulation ratio even though it's all happening in a VM.
In a fun twist of fate 32 bit x86 apps also work under Windows on the M1 even though the M1 doesn't support 32 bit code.
> The M1's main trick to being so good at running x86 code is it's just so god damn fast for the power budget it doesn't matter if there is overhead for emulated code it's still going to be fast.
M1 is fast and efficient, but Rosetta 2 does not emulate x64 in real time. Rosetta 2 is a static binary translation layer where the x86 code is analysed, translated and stashed away in a disk cache (for future invocations) before the application starts up. Static code analysis allows for multiple heuritstics to be applied at the binary code translation time where the time to do so is plentiful. The translated code then runs natively at the near native ARM speed. There is no need to appeal to varying deities or invoke black magic and tricks – it is that straightforward and relatively simple. There have been mentions of the translated code being further JIT'd at the runtime, but I have not seen the proof of that claim.
Achieving even 70% of native CPU speed whilst emulating a foreign ISA _dynamically (in real time)_ is impossible on von Neumann architectures due to the unpredictability of memory access paths, even if the host ISA provides the hardware assistance. This is further compounded with the complexity of the x86 instruction encoding, which is where most benefits the hardware assisted emulation would be lost (it was already true for 32-bit x86, and is more complex for amd64 and SIMD extensions).
> This is why running Windows for ARM in parallels is fast too, it knows basically none of the "tricks" available but the emulation speed isn't much slower than the Rosetta 2 emulation ratio even though it's all happening in a VM.
Windows for ARM is compiled for the ARM memory model, is executed natively and runs at the near native M1 speed. There is [some] hypevisor overhead, but there is no emulation involved.
x86 apps with JITs can run [1]. For instance I remember initially Chrome didn't have a native version, and the performance was poor because the JITted javascript had to be translated at runtime.
Windows decided to go the "always JIT and just cache frequent code blocks" method though. In the end whichever you choose it doesn't seem to make a big difference.
> Windows for ARM is compiled for the ARM memory model, is executed natively and runs at the near native M1 speed. There is [some] hypevisor overhead, but there is no emulation involved.
This section was referring to the emulation performance not native code performance:
"it knows basically none of the "tricks" available but the _emulation speed_ isn't much slower than the Rosetta 2 emulation ratio "
Though I'll take native apps any day I can find them :).
> Windows decided to go the "always JIT and just cache frequent code blocks" method though. In the end whichever you choose it doesn't seem to make a big difference.
AOT (or, static binary translation before the application launch) vs JIT does make a big difference. JIT always carries a pressure of the «time spent JIT'ting vs performance» tradeoff, which AOT does not. The AOT translation layer has to be fast, but it is a one-off step, thus it invariably can afford spending more time analysing the incoming x86 binary and applying more heuristics and optimisaitons yielding a faster performing native binary product as opposed to a JIT engine that has to do the same, on the fly, under tight time constraints and under a constant threat of unnecessarily screwing up CPU cache lines and TLB lookups (the worst case scenario for a freshly JIT'd instruction sequence spilling over into a new memory page).
> "it knows basically none of the "tricks" available but the _emulation speed_ isn't much slower than the Rosetta 2 emulation ratio "
I still fail to comprehend which tricks you are referring to, and I also would be very much keen to see actual figures substantiating the AOT vs JIT emulation speed statement.
Breaking memory ordering will breaks software - if a program requires it (which is already hard to know), how would you know which memory is accessed by multiple threads?
It's not just a question of "is this memory accessed by multiple threads" and call it a day for full TSO support being mandated it's a question of "is the way this memory is accessed by multiple threads actually dependent on memory barriers for accuracy and if so how tight do those memory barriers need to be". For most apps the answer is actually "it doesn't matter at all". For the ones it does matter heuristics and loose barriers are usually good enough. Only in the worst case scenario that strict barriers are needed does the performance impact show up and even then it's still not the end of the world in terms of emulation performance.
As far as applying it the default assumption for apps is they don't need it and heuristics can try to catch ones that do. For well known apps that do need TSO it's part of the compatibility profile to increase the barriers to the level needed for reliable operation. For unknown apps that do need TSO you'll get a crash and a recommendation to try running in stricter emulation compatibility but this is exceedingly rare given the above 2 things have to fail first.
Yes, it absolutely can. Shameless but super relevant plug. I'm (slowly) writing a series of blog posts where I simulate the implications of memory models by fuzzing timing and ordering: https://www.reitzen.com/
I think the main reason why it hasn't been disastrous is that most programs rely on locks, and they're going to be translating that to the equivalent ARM instructions with a full memory barrier.
Not too many consumer apps are going to be doing lockless algorithms, but where they're used all bets are off. You can easily imagine a queue where two threads grab the same item from, for instance.
Heuristics are used. For example, memory accesses relative to the stack pointer will be assumed to be thread-local, as the stack isn’t shared between threads. And that’s just one of the tricks in the toolbox. :-)
The result of those is that the expensive atomics aren’t applied to all accesses at all on hardware that doesn’t expose a TSO memory model.
Nitpick: relative speed differences do not add up; they multiply. A speed of 70 is 40% faster than a speed of 50, and a speed of 100 is 42.8571…% faster than a speed of 70 (corrected from 50. Thanks, mkl!). Conversely, a speed of 70 is 30% slower than a speed of 100, and a speed of 50 is 28.57142…% slower than one of 70.
=> when comparing speed, 70% is about exactly halfway between 50% and 100% (the midpoint being 100%/√2 ≈ 70.7%)
Not sure what the nit is supposed to be, 70% is indeed less than sqrt(1/2) hence me mentioning it. And yes, it's closer to 3/4 than half or full, but the thing being pointed out wasn't "what's the closest fraction" rather "it's really not that close to 1:1".
Meh, in the context of emulation, which ran at 5% before JITs, 70% is pretty close to 1:1 performance. Given the M1 is also a faster cpu than any x86 Macbooks, and it´s really a wash (yes, recompiling under arm is faster...)
No, it's not. Switching to TSO allows for fast, accurate emulation, but if they didn't have that they would just go the Windows route and drop barriers based on heuristics, which would work for almost all software. The primary driver behind Rosetta's speed is extremely good single-threaded performance and a good emulator design.
My guess: if the situation is similar to Windows laptops, they just use a subset of OEM features and provide a sub-par experience (like lack of battery usage optimizations, flaky suspend/hibernate, second-tier graphics support, etc)
Now, I'm typing this on a GNU/Linux machine, but let's face it, all of the nuisances I mentioned are legit and constant source of problems in tech support forums.
If the extra instructions also operate in extra state (e.g. extra registers) a kernel needs to know about their existence so it can correctly save and restore state on context switches
It doesn’t need to use them, but it must be aware of them, insofar as they may introduce security problems.
As an example, if the kernel doesn’t know of DMA channels, and it requires setup code to prevent user-level code from using them to copy across process boundaries, the kernel will run fine, but have glaring security problems.
What dma channels doesn't require mapping registers into the user space process to work? There aren't usually magic instructions you have to opt into disabling as far as I know.
And I would like to emphasize that when Apple used Intel, even then it was not commercially viable to use their platform. Bringing in ARM did change less than one would think at first.