CPUs today have more instructions-per-clock, so things like the extra "& 0xFF" m...

bruce343434 · on Dec 27, 2021

> CPUs today have more instructions-per-clock, so things like the extra "& 0xFF" might not affect performance at all

Huh? Yes it does affect performance. That could have been an actually useful instruction, instead we just do "& 0xFF" faster on modern hardware.

MichaelBurge · on Dec 27, 2021

> That could have been an actually useful instruction

Only if the data dependency graph in broad enough. There are certain code patterns that are clock-cycle-dependent and run with similar performance on older CPUs and newer ones, but they're not common.

Something like git is used broadly enough, there's probably something in some other process to do(like render ads in your browser), but looking strictly at git I would guess it's difficult to fill an entire CPU pipeline with no gaps.

I think Prime95 is optimized enough to do that in one process, and Intel recommends against running it because it will damage the CPU. Usually code gets blocked on a memory fetch or something before it gets to that point - code is never that well optimized.

bruce343434 · on Dec 27, 2021

> Intel recommends against running it because it will damage the CPU

Citation? A cursory search suggests this is a fable.

And your argument boils down to, (or it seems to), the fact that there is always an imperfection so might as well not try/care about other imperfections. Which I kind of understand, but I don't think it's a good mindset. If we didn't have to waste cycles on stuff like "&255" because of language or other self-inflicted limitations, no matter how fast those cycles are completed, we'd have faster software.

MichaelBurge · on Dec 27, 2021

There's a thread here on the Intel forums that I remember:

https://community.intel.com/t5/Processors/Simple-instruction...

The description only confirms a "hang", not permanent damage. And the Errata SKL082 says

    Under complex microarchitecture conditions, processor may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS or cause

unpredictable system behavior

so it sounds like very optimized code with AVX instructions may have caused some internal components to violate their clock-SLA because it can't run them that quickly. So "CPU damage" is overstating it, and I don't keep up-to-date with CPU bugs to see if there's been persistent problems in this area.

> no matter how fast those cycles are completed, we'd have faster software.

I think "speed" is not a good way to think about CPU instructions. I usually think of "resource pressure":

    If a function uses many CPU registers at once, it puts more pressure on the 16 named registers within one process. 
    If there's lots of random memory fetches everywhere, it pressures the data dependency graph to be broader to make the same progress.
    If there's lots of arithmetic operations, those consume ALUs.
    Extra instructions consume memory bandwidth, decoder queue spots, and iop cache.
    etc.

and it only impacts speed if you hit a threshold and something has to block waiting to obtain one of these. So while it's good to remove pressure if there's no drawback, it won't necessarily translate into a direct speed improvement unless you were blocked on that specific resource.

It sounds pedantic, but CPUs nowadays have so many resources available that you can be quite wasteful and still not have any slowdown. It's not good for software developers to use the word "faster" to describe "release of pressure on resource X", which is a better approximation.

In the specific case of "& 0xFF", it's a clear win. But Java has many other benefits like profile-guided JIT and a GC with different memory pressure than malloc(). If you only think in terms of "faster" or "slower", you won't know how to aggregate all the things that Java and C do to come to a conclusion.

C is well-known for having poor performance because of pointer aliasing ruining optimizations. "Optimizer pressure" can be another idea, making the optimizer work harder to prove some code can be simplified until eventually it hits a threshold and can't. Java doesn't have unrestricted pointers so the references are easier to track and prove correct.

bruce343434 · on Dec 27, 2021

Some interesting points, thank you. I guess if I wanted a final answer I'd need to look into the specification sheet for a specific CPU to see to what extent it can parallelize and or reorder instructions or perhaps even break them down into "micro-ops". And then a new CPU comes out and I'd have to do it again. Maybe, for a higher level programmer (I'd call myself that), this is indeed a waste of time for very little gain.

hashmash · on Dec 27, 2021

HotSpot usually optimizes the signed-to-unsigned conversion away. This is quite evident when looking at the generated machine code when accessing byte arrays. Although I still find it quite silly that the original post is even considering this an important point when the biggest bottleneck for git is the file system.

lmm · on Dec 27, 2021

Well, if JGit really is 2x slower than C git as claimed, then there's a bottleneck somewhere and it's not the filesystem.

hashmash · on Dec 27, 2021

Most likely it's due to the way file I/O is handled. There's extra memory copies going on that are hard to eliminate.

gpderetta · on Dec 27, 2021

Actually the &0xff will depend on the previous computation, so won't be executed in parallel and while very cheap will increase the latency of the operation that might be in a critical chain.

viktorcode · on Dec 27, 2021

Are you claiming that because modern CPUs have higher instructions per clock count means you can insert useless instructions without performance impact? That's ridiculous.