For Skylake, they probably optimized partial register
writes to these four partial "high" registers (AH, BH,
CH, DH), but the optimization was buggy in some hard-to-
hit corner case.
They did not do this.
The high registers (AH/BH/DH/CH) are nearly written out of existence with the REX Prefix in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.
The 16bit registers (AX/BX/DX/CX) are in worse situation, but it ends up costs additional cycles to even decode these instructions as the main encoder can't handle these instructions and you have to swap to the legacy encoder, and you'll end up losing alignment. This costs ~4-6 cycles, also the perf registers to track were only added in Haswell (and require Ring0 to use [2]).
High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.
That corner case probably can only be reached when some
part of the out-of-order pipeline is completely full,
which is why it needs a short loop (so the decoder is not
the bottleneck, AFAIK there's a small u-op cache after the decoder)
There is a 64uOP cache between the decoder and L1i cache that is called loop stream detector. Normally this exists to do batched writes to the L1i cache.
But in _some_ scenarios when a loop can fit completely within this cache it'll be given extremely priority. This is a way to max out the 5uOP per cycle Intel gives you [1]. It'll flush its register file to L1 cache piece meal as it continues to predict further and further and further ahead speculatively executing EVERYPART OF IT in parallel. [3]
In short this scenario is extremely rare. uOPs have stupidly weird alignment rules. Which you can boil down to:
Intel x64 Processor are effectively 16byte VLIW RISC processors
that can pretend to be 1-15byte AMD64 CISC processors at a minor performance
cost.
---
The real issue here is when Loop Stream mode ends it is properly reloading the register file, and OoO state.
This is likely just a small micro-code fix. The 8low/8high/16bit/32bit/64bit weirdness is likely somebody wasn't doing alignment checks when flushing the register file.
---
[1] On Skylake/KabyLake. IvyBridge, SandyBridge, Haswell, and Boardwell limited this to 4.
[2] Volume 3 performance counting registers I think we're up to 12 now on Boardwell.
> The high registers (AH/BH/DH/CH) are nearly written out of existence with the RAX flag in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.
I think you meant REX prefix but even that doesn't make any sense.
High registers are a first class element of the Intel 64 and IA-32 architectures. They aren't going anywhere. Microarchitectural implementations are an entirely different thing.
That aside, where in the manuals does Intel say not to use the high registers? They're pretty clear about such warnings and usually state them in Assembly/Compiler Coding Rules.
From the parent:
> For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case.
That is about right. I don't agree with the preceding slap at x86 but this is a good summary.
BTW writing to the low registers is in principle also a partial register hazard but then Intel sees fit to optimize that as a more common case.
In particular, mov AH,BH is not emulated from the MS-ROM which is just hella slow. It uses two μops for Sandy Bridge and above. This is covered in 3.5.2.4 Partial Register Stalls.
Lastly, there is no section 3.4.1.7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual which is 3 volumes. You must be talking about the Intel® 64 and IA-32 Architectures Optimization Reference Manual which is a single volume. And it isn't clear how that section furthers your argument.
> High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.
Someone really ought to tell clang and gcc this; they both happily use 16-bit registers for 16-bit arithmetic.
Anyway, Intel obviously already has special optimizations for many partial register accesses, dating back to Sandy Bridge. While it's quite possible that they left out the high registers initially (no clue, don't care), if they did they could have decided to include them in Skylake. Who knows though...
What are you even talking about with the LSD? The LSD is entirely before any register renaming and the entire out-of-order bits of the CPU. It's likely the LSD is involved only because that (plus hyperthreading) might be the only way to get enough in-flight µops to trigger whatever is going wrong, whether or not it's due to optimizations for partial register accesses.
Indeed, I've worked with CPUs that didn't have the register split of x86 and they are far less friendly to implementing certain algorithms, which would otherwise require many additional registers and lots of shift/add instructions to achieve the same effect. ("MOV AH, AL" being one simple example -- try doing that on a MIPS, for instance.)
How often have you really needed to do "a = (a & 0xffff00ff) | ((a & 0xff) << 8)"? I don't think I've ever needed to do it, and I wouldn't be surprised if compilers don't even generate "mov ah,al" for that, due to the fact that AH/BH/CH/DH only exist for 4 registers.
Anyway, since you asked: In AArch64 that would be written "bfi w0,w0,#8,#8". "bfi" is an alias of "bfm", an instruction far more flexible and useful than any of the baroque x86 register names. BFM can move an arbitrary number of bits from any position in any register to any other register, and it has optional zero-extension and sign-extension modes.
If you're talking to any memory-mapped registers, you'll be doing it all the time. Granted, you're much more likely to be using something like ARM to do that. x86 is a bit large/expensive/power hungry for embedded programming.
---
They did not do this.The high registers (AH/BH/DH/CH) are nearly written out of existence with the REX Prefix in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.
The 16bit registers (AX/BX/DX/CX) are in worse situation, but it ends up costs additional cycles to even decode these instructions as the main encoder can't handle these instructions and you have to swap to the legacy encoder, and you'll end up losing alignment. This costs ~4-6 cycles, also the perf registers to track were only added in Haswell (and require Ring0 to use [2]).
High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.
There is a 64uOP cache between the decoder and L1i cache that is called loop stream detector. Normally this exists to do batched writes to the L1i cache.But in _some_ scenarios when a loop can fit completely within this cache it'll be given extremely priority. This is a way to max out the 5uOP per cycle Intel gives you [1]. It'll flush its register file to L1 cache piece meal as it continues to predict further and further and further ahead speculatively executing EVERYPART OF IT in parallel. [3]
In short this scenario is extremely rare. uOPs have stupidly weird alignment rules. Which you can boil down to:
---The real issue here is when Loop Stream mode ends it is properly reloading the register file, and OoO state.
This is likely just a small micro-code fix. The 8low/8high/16bit/32bit/64bit weirdness is likely somebody wasn't doing alignment checks when flushing the register file.
---
[1] On Skylake/KabyLake. IvyBridge, SandyBridge, Haswell, and Boardwell limited this to 4.
[2] Volume 3 performance counting registers I think we're up to 12 now on Boardwell.
[3] Volume 3 Chapter 3.4.1.7 (Page 107)