I think I landed in a place where it's basically "the compiler has insufficient ...

bluGill · on Nov 9, 2022

The problem with JIT is not all information known at runtime is the correct information to optimize one.

In finance the performance critical code path is often the one run least often. That is you have a if(unlikely_condition) {run_time_sensitive_trade();}. In this case you need to tell the compiler to ensure the CPU will have a pipeline stall because of a branch misprediction most of the time to ensure the time that counts the pipeline doesn't stall.

The above is a rare corner case for sure, but it is one of those weird exceptions you always need to keep in mind when trying to make any blanket rule.

dahfizz · on Nov 9, 2022

The other issue with JIT is that it is unreliable. It optimizes code by making assumptions. If one of the assumptions is wrong, you pay a large latency penalty. In my field of finance, having reliably low latency is important. Being 15% faster on average, but every once in a while you will be really slow, is not something customers will go for.

bluGill · on Nov 10, 2022

I'm not in finance. I just remember one talk by a finance guy and it was a mind bender so I remember it.

ridiculous_fish · on Nov 10, 2022

How does the compiler arrange for the CPU to mispredict the branch most of the time? I didn't think there were any knobs for the branch predictor other than static ones (e.g. backwards-jumps statically predicted as taken, or PowerPC branch hint bit).

mikepurvis · on Nov 10, 2022

C++20 is gaining the [[likely]] attribute for guiding the predictor in a portable way:

http://eel.is/c++draft/dcl.attr.likelihood

Prior to this it was also possible, but required compiler-specific macros like GCC's __builtin_expect, which is used all over the kernel:

https://github.com/torvalds/linux/search?q=__builtin_expect

menaerus · on Nov 15, 2022

To my knowledge this does not have any direct impact to the CPU branch prediction mechanism. If that had been the case then we would at least have some x86-(64) instruction to manipulate with the BP. If I write a quick example such as https://godbolt.org/z/j7Y81j5fe, I also see no such instruction in the generated output.

But what likely/unlikely mechanism can do is that it can serve as a hint to the compiler which the compiler can then use to generate more optimal code layout. For example, if we provide the compiler with likely/unlikely hints in our code, compiler will try to use that hint to stitch together the more probable code paths first. In theory this approach should result with better utilization of CPU instruction cache and thus it may lead to better performance.

astrange · on Nov 10, 2022

There are no ways to hint the branch predictor on some common CPUs. I think Intel says it's fully dynamic and has no static predictions.

If you want to cause a pipeline stall, there's usually serializing instructions like cpuid/eieio.

vgatherps · on Nov 10, 2022

I believe intel used to always predict forward branches not taken and backward branches always taken if it didn’t have any history.

Regardless, you can at least lay out code in a way that heavily favors one branch over the other, and potentially optimize for one branch (I.e. give one branch very expensive prelude/exit to make another branch very cheap). The last thing I’ve seen compilers greatly struggle with though…

wumpus · on Nov 10, 2022

20 years of "progress" and hey, C++20 is gaining the likely attribute.

bluGill · on Nov 10, 2022

CPUs have documentation, for this. I forget which one, but the one I did read it was as simple as the true case is assumed more common, and it is easy to arrange logic around that (I probably mis remember, but close enough to that). The common case also should be near the if in memory (inline code), so it is likely on the same cache line (or the next which is prefetched), while the other case is farther away and so you can stall the cache if the if goes that way.

travisgriggs · on Nov 10, 2022

The problem with static compilation is not all information known at compile time is the correct information to optimize on.

Assuming either source of data is the single source of truth for all optimizations is a fallacy.

Use the right tool for the right job. Use all the tools if you can.

kllrnohj · on Nov 10, 2022

Static compilers usually don't have to make such a tradeoff, though. They are free to spend arbitrarily long amounts of time optimizing all branches. And they often do exactly that.

Static + LTO w/ PGO is pretty much the practical ideal. JITs don't offer much until you start adding dynamically loaded code where LTO just isn't possible anymore.

pjmlp · on Nov 10, 2022

Tooling and real data sets.

Most PGO for AOT scenarios suffers from tooling experience and not using data sets similar to production workflows.

kllrnohj · on Nov 10, 2022

Perhaps, but that's orthogonal. JITs don't just come with ideal production set sampling either after all. They only capture snippets, and rarely re-optimize already compiled functions in the face of changing workloads or new information. You can do crowd-sourced profiles (like Android does), but that's a completely independent set of infrastructure from JIT vs. AOT. You can feed that same profile to PGO for AOT. In fact, that's how Android uses it. They don't feed the profile to the JIT, they feed it to their offline, install-time or idle-maintenance AOT compiler.

KMag · on Nov 10, 2022

If your hardware is designed to allow very lightweight profiling and tracing, then static + LTO w/ PGO can still be improved by runtime re-optimization. If designed properly, the runtime overhead can be brought arbitrarily low by increasing the sampling period.

kllrnohj · on Nov 10, 2022

Are you speaking in theoreticals or can you point to any actual example of what you're describing? Usually for a JIT to work the source binary has to be unoptimized to begin with, otherwise information is lost that the JIT needs.

So What language/runtime out there ships unoptimized bytecode, an optimized precompiled static + LTO w/ PGO, and can re-optimize with runtime gathered information via a JIT?

Heck, what language/runtime is even designed around being performance-focused with a JIT to deliver even more performance in the first place? Pretty much ever JIT'd language makes trade-offs that sacrifice up-front performance and later hopes the JIT can claw some of it back. maybe this is kinda WASM-ish territory, although it then sacrifices up-front performance for security and hopes the JIT can claw it back.

KMag · on Nov 10, 2022

The basic elements all exist.

As others have pointed out, HP's Dynamo managed to use runtime re-optimization to improve performance of many binaries without cooperation from the original compiler. Runtime optimization doesn't strictly require any of the information lost in optimized binary builds.

Last I checked, the Android Runtime would AoT-compile Dalvik bytecode at install time, and in the background re-optimize the binary based on profiling information. Though, I don't think it performs hot code replacement.

I'm not sure the latest with Oracle's Java AoT. Last I checked, Oracle's JVM wasn't able to inline or re-optimize AoT-compiled code through JIT'd code.

Some optimizations, such as loop unrolling make JIT'ing harder. However, strength reduction, loop-invariant code hoisting, etc. make the JIT's life easier. Back around 2005-2006, my employer was getting good mileage out of a Java bytecode optimizer. If your AoT and JIT are cooperating, the AoT can stuff any helpful metadata (type information, aliasing analysis, serialized control flow graph, etc.) into an auxiliary section of the binary.

I'd like to eventually write a C compiler that essentially compiles to old-school threaded code: arrays of pointers to basic blocks (strait-line code with a single entry point and one or more exit points). Function entry would just pass the array of basic blocks to a trampoline function that calls the first basic block. Each basic block would return and index into the function's array of basic blocks for the trampoline to call next. Function return would be signaled by a basic block returning -1 to the trampoline loop. A static single assignment representation of each extended basic block would be stashed in an auxiliary ELF section. On a regular system without the runtime optimizer, the only startup overhead would be due to bloated binary size. A wild guess at the performance overhead without the runtime optimizer would be in the 5% to 15% range. However, on a system with the ELF dynamic loader replaced with a runtime-optimizer, the runtime would set up a perf signal handler that would keep counters for identifying hotspots to trace. If the tracing conditions were met, the perf signal handler would walk back up the call stack to find the last occurrence of the address of the trampoline, and replace it with a version of the trampoline that in addition, stores the address of the next extended basic block to run. Once a trace loops back on itself or meets some other TBD criteria, the runtime would stitch together the SSA representations of the constituent extended basic blocks, and generate a new optimized basic block that's the inline of the components of the trace, and finally place the address of the new extended basic block in the correct place in the array of extended basic blocks, thus performing runtime code replacement. Anyway, that's my grand vision. I've taken the introductiory Stanford compilers course, and am slowly working my way forward, but I have a job and a young kid, so I'm not holding my breath.

Among other things, this allows for inlining of hot paths across dynamic library boundaries. It also improves code locality and should increase the percentage of not-taken branches in the hot path, which should help reduce problems with aliasing in the branch predictor.

wumpus · on Nov 10, 2022

> In finance

Isn't low-latency trading only a subset of finance?

bluGill · on Nov 10, 2022

Yes. And the part I mentioned is a subset of low latency. See the other reply.

masklinn · on Nov 9, 2022

It's also an argument for having much more expressive and precise type systems, so the compiler has better information.

Once you've managed to debug the codegen anyway (see: The Long and Arduous Story of Noalias).

mikepurvis · on Nov 9, 2022

Is it? I'd love to see a breakdown of what classes of information can be gleaned from profile data, and how much of an impact each one has in isolation in terms of optimization.

Naively, I would have assumed that branch information would be most valuable, in terms of being able to guide execution toward the hot path and maximize locality for the memory accesses occurring on the common branches. And that info is not something that would be assisted by more expressive types, I don't think.

notriddle · on Nov 9, 2022

https://tomaszs2.medium.com/how-rust-1-64-became-10-20-faste...

https://news.ycombinator.com/item?id=33306945

anyfoo · on Nov 9, 2022

That's a lesson Intel had to (re-?)learn with Itanium as well.

inkyoto · on Nov 10, 2022

> […] "the compiler has insufficient information to achieve ideal optimization because some things can only be known at runtime."

This is where the profile guided optimisation comes in – for statically compiled languages, with a caveat being not always straightforward to come up with a set of inputs that will trigger an execution of all possible code paths. One solution is to provide the coverage specifically for the performance critical code paths and let the rest just be.

funstuff007 · on Nov 10, 2022

> it can also be an argument for instrumenting your runtime environment

Aren't JITs already self-instrumenting? What would you instrument that the JIT is not already keeping track of?

titzer · on Nov 9, 2022

Darn it, replied too early. See sibling comment I just posted. The problem with dynamic languages is that you need to speculate and be ready to undo that speculation.