I find myself nodding in assent at the last comment... I also believe that the amount of code out there which would benefit from having an extra 24 32-bit registers is far more than that which would benefit from having 16 64/32-bit ones instead.
The last comment is very ignorant of CPU design. Clearing the highest bits of a register is not a stupid thing to do, it's what you have to do if you don't want to have false dependencies on the previous contents of the register. Partially updating registers in an OoO system is much more expensive operation than what it seems to be for a software guy, and the naive way to use them that the comment suggested would have killed any attempts to extract workable ILP from the system.
At the time, they evaluated how many registers they needed and ran simulations for 16, 32 and 64. Assuming a modern implementation with register renaming and good OoO, they found only a few % gain from 32 and practically none with 64. They decided to go with 16 to save encoding bits. If they found that x86 needed more registers, they would have added more registers, not made partial accesses more common. Remember that spilling to stack in x86 does not have a latency hit in normal conditions because all modern x86 cpus have a separate stack engine that allows the loads to retire immediately after they enter the instruction window, typically tens of cycles before any instructions immediately following them.
Not likely. Modern out-of-order microarchitectures like Bulldozer and Broadwell have far more physical registers than the ISA specifies. Haswell has 168 physical integer register, for example.
http://www.realworldtech.com/haswell-cpu/3/
The lack of registers in the ISA moves the burden of data dependency checking from compiler to core, but it doesn't increase the number of stalls.
Well, it means that if code cannot fit everything into the ISA registers, it has to spill them to the stack, which AFAIK is not renamed to the physical registers on current x86, so you still need to pay an extra penalty to access the stack data in the L1 cache.
But stack accesses are generally perfectly predicted, meaning that your loads from stack get executed and the data loaded into those extra registers long before your code needs to use those values.
The prefetcher does a great job, but it's not remotely enough to make stack access penalties disappear.
It's trivial to write microbenchmarks where spilling hot read/write variables to the stack destroys performance, much harder to find special cases where the difference isn't noticeable.
The hot region of the stack is, for practical purposes, always resident in D$, so there's no prefetching to be done. I think that you're really talking about out-of-order and speculative loads. You're absolutely correct that spill/fill can be catastrophic for performance, however.
There are two common scenarios where spill/fill has catastrophic performance implications:
- Workloads that are LSU-bound, like simple per-pixel image operations and level 1 and level 2 BLAS[1]. Here every spill is taking up two LSU dispatch slots that would otherwise be used for "real work".
- Spilled loop-carried dependencies in tight loops; here you're simply adding latency to the critical path.
Both are fairly rare in "generic" code, but very real concerns for compute kernels.
[1] The BLAS operations have few enough buffers that spill/fill never actually happens in practice, but more complex operations on multiple buffers do run into this.
As… the article even explained further down, sometimes. That's where the whole discussion comes from.
• In 32 bit mode, it doesn't matter. Some (dumb) CPUs treated it as xchg ax,ax; pipelined CPUs optimized it away.
• In 64 bit mode, it matters: Is "xchg eax, eax" a valid way to clear the upper 32 bits of eax? Or will it always be optimized away as legacy NOP?
AMD decided for the latter. They could also have introduced a new opcode for it instead (there are already multiple nop instructions, like nopl/nopw, so it wouldn't have been too far off) – as this only affects 64 bit mode, backwards compatibility didn't really matter, both would have been possible.
Opcode 0x90 only means xchg eax,eax on paper. If no documentation ever called it that then it would be a non-issue. It would always have been nop and still be nop. Somebody could also have called it xchg ebx,ebx as well and it would have been just the same in 32-bit mode.
> If no documentation ever called it that then it would be a non-issue.
XCHG EAX,target is defined as opcode (0x90 + offset of target register), with EAX having offset 0.
So, it was xchg eax,eax originally, and documented as such, before it was turned into NOP because it happened to be safe for it.
It's still documented as "alias for the XCHG (E)AX, (E)AX instruction" in Intel's instruction set reference, and pre-486 embedded x86s still treat it as xchg.
Sure. That "If" kind of makes it moot. That's all I was trying to point out - that it's a documentation thing rather than something in the design of the chip and how it works.
Edit: What do you mean "still treat it as xchg"? On those chips, isn't there no distinction between xchg and what we might retrospectively call "nop"? Perhaps this is something I'm missing.
> Edit: What do you mean "still treat it as xchg"? On those chips, isn't there no distinction between xchg and what we might retrospectively call "nop"? Perhaps this is something I'm missing.
XCHG EAX, EAX in its dumbest interpretation loads EAX into a temporary register, replaces it with the contents of EAX and restores the temporary data to… EAX. So, it is an operation that does nothing, but it does nothing in an elaborate way. You can skip it instead of executing it, but only if your other code doesn't depend on 0x90 taking exactly three clock cycles.
The "treat 0x90 as NOP and skip it instead of wasting three cycles" optimization was only done with the 486 and up, and wasn't retroactively applied to the 386 embedded versions. Doing so would have messed up their timings, and would have needed a small design change, both not interesting to that customer base.
(The 386 and derivatives were still produced for embedded use for a long, long time, past 2001 – and thus, after the introduction of AMD64. When its instruction set was drafted, new 386-based embedded devices were still designed.)
It makes more sense if you look at the neighbors on the opcode map. It seems like the range 0x90-0x97 are all variants of xchg, just that the first happens to have a destination same as source: