Hacker News new | past | comments | ask | show | jobs | submit login

CPUs today have more instructions-per-clock, so things like the extra "& 0xFF" might not affect performance at all. Similar for conversions to/from types: Those are compute on data already in cache/registers, so incur extra instructions but not extra memory fetches and might be free(depending on iop cache).

Boxing is still bad, because it often means more uncached memory fetches. Container types that hold references instead of values are also still bad. I notice many higher-level languages are putting more emphasis on adding value-types and unboxed options recently, because memory indirection is a comparatively worse problem for them than it used to be.

Memory is faster, caches are larger, so things that compute on bulk bytestream reads(initial loading of program code, reading large packed .git directories, etc.) should be more consistent across languages compared to 2009. Back then, it was easier for language runtime code(GC, JIT, etc.) to push data out of cache and trigger more memory fetches.

I haven't benchmarked JGit or git, these are just my prior assumptions.




> CPUs today have more instructions-per-clock, so things like the extra "& 0xFF" might not affect performance at all

Huh? Yes it does affect performance. That could have been an actually useful instruction, instead we just do "& 0xFF" faster on modern hardware.


> That could have been an actually useful instruction

Only if the data dependency graph in broad enough. There are certain code patterns that are clock-cycle-dependent and run with similar performance on older CPUs and newer ones, but they're not common.

Something like git is used broadly enough, there's probably something in some other process to do(like render ads in your browser), but looking strictly at git I would guess it's difficult to fill an entire CPU pipeline with no gaps.

I think Prime95 is optimized enough to do that in one process, and Intel recommends against running it because it will damage the CPU. Usually code gets blocked on a memory fetch or something before it gets to that point - code is never that well optimized.


> Intel recommends against running it because it will damage the CPU

Citation? A cursory search suggests this is a fable.

And your argument boils down to, (or it seems to), the fact that there is always an imperfection so might as well not try/care about other imperfections. Which I kind of understand, but I don't think it's a good mindset. If we didn't have to waste cycles on stuff like "&255" because of language or other self-inflicted limitations, no matter how fast those cycles are completed, we'd have faster software.


There's a thread here on the Intel forums that I remember:

https://community.intel.com/t5/Processors/Simple-instruction...

The description only confirms a "hang", not permanent damage. And the Errata SKL082 says

    Under complex microarchitecture conditions, processor may hang with an internal timeout error (MCACOD 0400H) logged into IA32_MCi_STATUS or cause
unpredictable system behavior

so it sounds like very optimized code with AVX instructions may have caused some internal components to violate their clock-SLA because it can't run them that quickly. So "CPU damage" is overstating it, and I don't keep up-to-date with CPU bugs to see if there's been persistent problems in this area.

> no matter how fast those cycles are completed, we'd have faster software.

I think "speed" is not a good way to think about CPU instructions. I usually think of "resource pressure":

    If a function uses many CPU registers at once, it puts more pressure on the 16 named registers within one process. 
    If there's lots of random memory fetches everywhere, it pressures the data dependency graph to be broader to make the same progress.
    If there's lots of arithmetic operations, those consume ALUs.
    Extra instructions consume memory bandwidth, decoder queue spots, and iop cache.
    etc.
and it only impacts speed if you hit a threshold and something has to block waiting to obtain one of these. So while it's good to remove pressure if there's no drawback, it won't necessarily translate into a direct speed improvement unless you were blocked on that specific resource.

It sounds pedantic, but CPUs nowadays have so many resources available that you can be quite wasteful and still not have any slowdown. It's not good for software developers to use the word "faster" to describe "release of pressure on resource X", which is a better approximation.

In the specific case of "& 0xFF", it's a clear win. But Java has many other benefits like profile-guided JIT and a GC with different memory pressure than malloc(). If you only think in terms of "faster" or "slower", you won't know how to aggregate all the things that Java and C do to come to a conclusion.

C is well-known for having poor performance because of pointer aliasing ruining optimizations. "Optimizer pressure" can be another idea, making the optimizer work harder to prove some code can be simplified until eventually it hits a threshold and can't. Java doesn't have unrestricted pointers so the references are easier to track and prove correct.


Some interesting points, thank you. I guess if I wanted a final answer I'd need to look into the specification sheet for a specific CPU to see to what extent it can parallelize and or reorder instructions or perhaps even break them down into "micro-ops". And then a new CPU comes out and I'd have to do it again. Maybe, for a higher level programmer (I'd call myself that), this is indeed a waste of time for very little gain.


HotSpot usually optimizes the signed-to-unsigned conversion away. This is quite evident when looking at the generated machine code when accessing byte arrays. Although I still find it quite silly that the original post is even considering this an important point when the biggest bottleneck for git is the file system.


Well, if JGit really is 2x slower than C git as claimed, then there's a bottleneck somewhere and it's not the filesystem.


Most likely it's due to the way file I/O is handled. There's extra memory copies going on that are hard to eliminate.


Actually the &0xff will depend on the previous computation, so won't be executed in parallel and while very cheap will increase the latency of the operation that might be in a critical chain.


Are you claiming that because modern CPUs have higher instructions per clock count means you can insert useless instructions without performance impact? That's ridiculous.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: