You really don't know what you're talking about. --- For Skylake, they probably ...

CalChris · on June 25, 2017

> The high registers (AH/BH/DH/CH) are nearly written out of existence with the RAX flag in 64bit mode. Within the manual(s) it is called out effectively not to use them as they're now emulated and not support directly in hardware.

I think you meant REX prefix but even that doesn't make any sense.

High registers are a first class element of the Intel 64 and IA-32 architectures. They aren't going anywhere. Microarchitectural implementations are an entirely different thing.

That aside, where in the manuals does Intel say not to use the high registers? They're pretty clear about such warnings and usually state them in Assembly/Compiler Coding Rules.

From the parent:

> For Skylake, they probably optimized partial register writes to these four partial "high" registers (AH, BH, CH, DH), but the optimization was buggy in some hard-to-hit corner case.

That is about right. I don't agree with the preceding slap at x86 but this is a good summary.

BTW writing to the low registers is in principle also a partial register hazard but then Intel sees fit to optimize that as a more common case.

In particular, mov AH,BH is not emulated from the MS-ROM which is just hella slow. It uses two μops for Sandy Bridge and above. This is covered in 3.5.2.4 Partial Register Stalls.

Lastly, there is no section 3.4.1.7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual which is 3 volumes. You must be talking about the Intel® 64 and IA-32 Architectures Optimization Reference Manual which is a single volume. And it isn't clear how that section furthers your argument.

brigade · on June 25, 2017

> High Register and 16bit registers are huge wart that it seems Intel is trying desperately hard to get us to stop using.

Someone really ought to tell clang and gcc this; they both happily use 16-bit registers for 16-bit arithmetic.

Anyway, Intel obviously already has special optimizations for many partial register accesses, dating back to Sandy Bridge. While it's quite possible that they left out the high registers initially (no clue, don't care), if they did they could have decided to include them in Skylake. Who knows though...

What are you even talking about with the LSD? The LSD is entirely before any register renaming and the entire out-of-order bits of the CPU. It's likely the LSD is involved only because that (plus hyperthreading) might be the only way to get enough in-flight µops to trigger whatever is going wrong, whether or not it's due to optimizations for partial register accesses.

userbinator · on June 25, 2017

Indeed, I've worked with CPUs that didn't have the register split of x86 and they are far less friendly to implementing certain algorithms, which would otherwise require many additional registers and lots of shift/add instructions to achieve the same effect. ("MOV AH, AL" being one simple example -- try doing that on a MIPS, for instance.)

pcwalton · on June 26, 2017

How often have you really needed to do "a = (a & 0xffff00ff) | ((a & 0xff) << 8)"? I don't think I've ever needed to do it, and I wouldn't be surprised if compilers don't even generate "mov ah,al" for that, due to the fact that AH/BH/CH/DH only exist for 4 registers.

Anyway, since you asked: In AArch64 that would be written "bfi w0,w0,#8,#8". "bfi" is an alias of "bfm", an instruction far more flexible and useful than any of the baroque x86 register names. BFM can move an arbitrary number of bits from any position in any register to any other register, and it has optional zero-extension and sign-extension modes.

Espressosaurus · on June 26, 2017

If you're talking to any memory-mapped registers, you'll be doing it all the time. Granted, you're much more likely to be using something like ARM to do that. x86 is a bit large/expensive/power hungry for embedded programming.