> Intel recommends against running it because it will damage the CPU Citation? A...

MichaelBurge · on Dec 27, 2021

There's a thread here on the Intel forums that I remember:

https://community.intel.com/t5/Processors/Simple-instruction...

The description only confirms a "hang", not permanent damage. And the Errata SKL082 says

    Under complex microarchitecture conditions, processor may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS or cause

unpredictable system behavior

so it sounds like very optimized code with AVX instructions may have caused some internal components to violate their clock-SLA because it can't run them that quickly. So "CPU damage" is overstating it, and I don't keep up-to-date with CPU bugs to see if there's been persistent problems in this area.

> no matter how fast those cycles are completed, we'd have faster software.

I think "speed" is not a good way to think about CPU instructions. I usually think of "resource pressure":

    If a function uses many CPU registers at once, it puts more pressure on the 16 named registers within one process. 
    If there's lots of random memory fetches everywhere, it pressures the data dependency graph to be broader to make the same progress.
    If there's lots of arithmetic operations, those consume ALUs.
    Extra instructions consume memory bandwidth, decoder queue spots, and iop cache.
    etc.

and it only impacts speed if you hit a threshold and something has to block waiting to obtain one of these. So while it's good to remove pressure if there's no drawback, it won't necessarily translate into a direct speed improvement unless you were blocked on that specific resource.

It sounds pedantic, but CPUs nowadays have so many resources available that you can be quite wasteful and still not have any slowdown. It's not good for software developers to use the word "faster" to describe "release of pressure on resource X", which is a better approximation.

In the specific case of "& 0xFF", it's a clear win. But Java has many other benefits like profile-guided JIT and a GC with different memory pressure than malloc(). If you only think in terms of "faster" or "slower", you won't know how to aggregate all the things that Java and C do to come to a conclusion.

C is well-known for having poor performance because of pointer aliasing ruining optimizations. "Optimizer pressure" can be another idea, making the optimizer work harder to prove some code can be simplified until eventually it hits a threshold and can't. Java doesn't have unrestricted pointers so the references are easier to track and prove correct.

bruce343434 · on Dec 27, 2021

Some interesting points, thank you. I guess if I wanted a final answer I'd need to look into the specification sheet for a specific CPU to see to what extent it can parallelize and or reorder instructions or perhaps even break them down into "micro-ops". And then a new CPU comes out and I'd have to do it again. Maybe, for a higher level programmer (I'd call myself that), this is indeed a waste of time for very little gain.