The article says as much. > Although this may seem like a significant drawback, ...

maxwell86 · on July 11, 2022

If you only write to the lower 32-bit of the v0 register, which could be 1024 bit wide, that claims that the hardware somehow has to allocate a 1024-bit wide register to back those up, and then makes some "locality" arguments.

The hardware can back up the 1024-bit register with a pool of 32-bit registers, and if you only wrote to the first 32-bits, and all others are zero, it can use a single 32-bit register to back it up, making this "as good" as the single mask register solution, which the author thinks is good.

ncmncm · on July 11, 2022

Determining that you only wrote to the bottom 32 bits of the register being copied from is hard for hardware to see; and if the compiler can see, it has no way to tell the hardware.