Exploring PGO for the Rust Compiler

Ar-Curunir · on Nov 11, 2020

I think it would be a massive boost to the ecosystem to have at least stable and beta use PGO; 16% speed ups are amazing at this point in Rust’s lifetime!

kibwen · on Nov 11, 2020

Experiments with a post-link optimization pass (for rearranging symbols in the binary for optimal access patterns, e.g. BOLT) have suggested another 5% speedup is possible. If LLD ever gets enabled by default that could be another 25-50% off when compared to the system linker (e.g. https://github.com/rust-lang/rust/issues/39915#issuecomment-... ). And so far Cranelift ( https://github.com/rust-lang/rust/pull/77975 ) shows about 20-30% improvement on debug builds, keeping in mind that what it's gaining from doing less optimizations is being partially offset by giving the linker more to chew on; using Cranelift in conjunction with LLD can give much better results ( https://github.com/bjorn3/rustc_codegen_cranelift/issues/878... ).

pjmlp · on Nov 12, 2020

I look forward to when Rust toolchains can compete with Delphi/Eiffel/.NET/Java in what concerns debug builds, I can let the CI/CD take care of release ones.

On a positive note, congratulations to everyone involved doing this work.

Ar-Curunir · on Nov 12, 2020

I'm very excited for my rust builds to get 2x faster over the next year =D

chrismorgan · on Nov 11, 2020

If stable and beta used PGO and nightly didn’t and was consequently 10–20% slower, then there would suddenly be an actual reason to not use nightly! I’m not sure whether that would be good or bad.

est31 · on Nov 11, 2020

There are multiple reasons to avoid nightly. One of them is that you don't have to constantly rebuild the entire graph when you want to stay up to date because stable is "only" updated every 6 weeks. Another being that you are less subject to regression bugs hitting you.

gpm · on Nov 11, 2020

You can "only"update nightly every 6 weeks as well, or every 6 months. Nothing says you have to stay up to date.

In practice I update when there is something I want in a more recent release. This happens more often with nightly because there are more bugs to be fixed and feature changes, but it isn't that often.

bluejekyll · on Nov 11, 2020

The problem I’ve found with working on nightly is that APIs and certain impls of features you might rely on are constantly in flux. This ends up requiring you to pin to a specific nightly build date. And then the ecosystem that might also be relying on night starts to shift, and all the sudden you don’t have a stable build anymore.

I never experience that on stable, and it’s been my primary reason for not recommending Rocket for so long, even though that it’s an excellent library in all other regards. (I believe it’s working with stable in the master branch at this point, and just waiting on a release of 0.5).

gpm · on Nov 11, 2020

Yes, this is an issue, I avoid it by making it a rule for that either I nightly features and none of my dependencies do, or at most one of my dependencies do and I do not.

est31 · on Nov 11, 2020

And if you find a bug? If it's in a recent nightly or a stable release, you can rightfully file a bug. It will be either fixed on master or a point release will be made. If your nightly is weeks old, you don't really know whether master has fixed it weeks ago or a bug report would be welcome. So you lose this advantage.

Anyways, that's not an exhaustive list. Not wanting to deal with rustup might be another reason so you use your distro rustc packages instead.

gpm · on Nov 11, 2020

Finding a compiler bug is rare... but when I do being on an old nightly tends to speed up the reporting process. That's because this is the process

1. Search the bug tracker for the message/symptoms. If you find the bug (more likely on older nightly's) stop here.

2. Create a minimal test case by creating a copy of the code and then deleting large portions of it until you have a dozen or so lines and no dependencies that reproduce it and are all needed to reproduce it.

2.5. If using an old version of rustc, `rustup override set nightly` to use a recent nightly on this directory and see if the bug is gone/changed. This takes about 30 seconds (compilation is instantaneous since I've already deleted all but a dozen lines of code).

3. Search the bug tracker again now that we have a better idea what is going on.

4. If there really is nothing, report it.

The time savings by bailing out early at 1. greatly exceed the time cost introduced by 2.5.

If you're using nightly you should be using rustup... but since rustup is so easy to use that's a very very low cost.

fluffything · on Nov 11, 2020

I would just use beta instead and "turn it" into a nightly compiler with RUSTC_BOOTSTRAP=1.

dodobirdlord · on Nov 11, 2020

That flag does more than just enable unstable features in the stable compiler, and it should be avoided unless you are actually compiling rustc.

ncmncm · on Nov 11, 2020

My experiments with profile-guided optimization have been thus far disappointing.

Formulations I have encountered fail to respond to density of branch prediction failures, which can easily make a 2x difference in performance, as it does in one program I have. If the profile were to log branch prediction failures, it might be able to favor conditional moves over branches in cases where it is found that the result of the condition is not predictable, or the reverse.

I have another (in this case, Rust) program that demonstrates 2x performance difference based on the order in which exactly two instructions are emitted. Profile-guided optimization always chose the slower version. In that case, putting the instructions in the correct order enabled fusing two pairs of instructions and looping in one cycle. In the chosen order, the instructions did not fuse, and the loop took two cycles. That loop dominated execution time, for an almost 2x overall result. To choose correctly depends on the compiler knowing a great deal more about the target microarchitecture than can reasonably be expected of a compiler. (I know that it chose wrongly only because I had an exact C++ translation of the program, and the C++ compiler was sometimes persuaded to choose correctly.)

The point I make here is that the choices that optimizers make in response to profile information typically amount to much less than 10% of runtime, so are easily overwhelmed by effects of random microarchitectural details that are not practical for a production compiler to act on. What seems to the optimizer like a better choice may be, instead, radically worse. That they achieve any positive result, on average, with or without profile guidance, is impressive: a dancing bear.

The fact also means that we generally don't know how fast a program ought to run, or whether 2x gains from trivial choices remain to be discovered. This is fallout from the deal we have made with the Devil by relying so heavily on very complex runtime optimization mechanisms whose consequences we poorly understand.

Our programs run much faster with such hardware acceleration gimcrackery, but we can no longer reason about the effect of a prospective change, and must resort to measuring. Measuring, we don't know what effects are or are not in play in different runs. We don't even know all the choices available, as microarchitectures are always very sketchily documented. They will only ever become less predictable: what is faster on skylake may be persistently slower on mudlake.

scottlamb · on Nov 12, 2020

> I have another (in this case, Rust) program that demonstrates 2x performance difference based on the order in which exactly two instructions are emitted. ... That loop dominated execution time, for an almost 2x overall result.

I think a program like this—where two instructions in one loop is almost the entire execution time—isn't the right one to expect profile-guided optimization to greatly improve. I think instead it's complex programs where there's too much hot stuff to reasonably put human attention into it all. On programs like this, I really have gotten 15% improvements from an AutoFDO pipeline. Honestly I don't know exactly how it achieves that. I can make some guesses but will probably be wrong. Eg I was surprised when the "machine function splitter" was introduced recently that it wasn't already doing that. But generally speaking, I'd expect skilled human optimization to make the most difference at the smallest scales: less hot code, changing infrequently, targeting fewer critical (micro)architectures. I'd expect compiler optimization to make the most difference elsewhere. I don't think profile-guided optimization is an exception to this general rule.

> To choose correctly depends on the compiler knowing a great deal more about the target microarchitecture than can reasonably be expected of a compiler.

fwiw, it doesn't seem crazy to me to expect the compiler to know what operations can be fused.

ncmncm · on Nov 12, 2020

I explained the point in the very next paragraph, starting "The point I make...".

Sometimes the compiler can guess what operations can be fused, if it knows what hardware the program will run on, and the vendor has revealed those details, and the compiler has incorporated all of them. But almost all programs are built for an ancient target, using a compiler version older than the actual target.

When you really care about performance of a program, you compile it yourself, if you can, and add as much "-march=", "-lto", etc. hackery as you trust, and measure what difference all the different combinations (which interact!) make, and make sure they didn't introduce bugs (which they often do).

But, as I noted, even after that there are lots of 2x effects the compiler doesn't understand. Many programs don't spend all their time in one loop, but almost all spend all their time in loops.

scottlamb · on Nov 12, 2020

Right, everything you've said is consistent with my "I'd expect skilled human optimization to make the most difference at the smallest scales: less hot code, changing infrequently, targeting fewer critical (micro)architectures".

In other situations, many folks (including me) get 15% improvements from PGO, and it's hard to take seriously any argument that suggests that can't happen.

ncmncm · on Nov 13, 2020

In other words, you didn't read what you were replying to: "That [optimizers] achieve any positive result, on average, with or without profile guidance, is impressive".

What you call "skilled human optimization" appears to refer to the noted more or less accidental optimizations cited. We may congratulate ourselves when we find one that turns out can be applied elsewhere, and even program it into our compiler, but these are benefits more often of luck than skill, and are often much less durable than we hoped, washed away in the next microarchitectural generation, and sometimes even come to degrade performance.

In the second example cited, recent Gcc versions deliberately intersperse other instructions between those that could be fused, to fill in for pipeline delays in a previous generation; but the benchmarks referenced for current chips were compiled with a previous compiler that didn't. So, current compilers improve performance for obsolete hardware and pessimize current hardware.

Potential improvements in the next microarchitectural generation are often held hostage by optimizations performed in old compilers, used in reference benchmarks, that would be pessimizations in the better microarchitecture (therefore) not shipped.

scottlamb · on Nov 13, 2020

You led with "my experiments with profile-guided optimization have been thus far disappointing". My point is that in many cases people get 15% improvements. Do you consider that a disappointing result? Are you disregarding other people's results? or are you disappointed that they don't apply in your specific situations? I originally gave you the benefit of the doubt by thinking it was the last of these, and speculated about how your situation differs. Now I don't know what you want, except to show off how smart you are vs how poor a reader I am, regardless of accuracy. I'm not interested. Bye.

ncmncm · on Nov 13, 2020

Yes: 15% is rarely achieved, yet still disappointing.

As I actually came out and said, microarchitectures are now so complicated that we cannot program optimizers to do better. Instead we resign ourselves to (more usually) 10%, and congratulate ourselves on getting that far.

todd8 · on Nov 12, 2020

It's not clear from the article, but PGO isn't a new technique. In 1989, my Decstation 3100 running a Berkley kernel based Unix on a MIPS processor had a compiler option that used a profile to guide its C compiler optimizer. I don't recall how well it worked. At that level of optimization I tended to use it very little since I was spending most of my time trying to architect, develop and debug a new distributed system.

StreamBright · on Nov 11, 2020

Is this about https://en.wikipedia.org/wiki/Profile-guided_optimization?

ibraheemdev · on Nov 11, 2020

The_rationalist · on Nov 11, 2020

I hope that the rust compiler team is considering to integrate BOLT by default too https://github.com/facebookincubator/BOLT (I'm aware it's being integrated into llvm but rust could integrate the external BOLT in it's toolchain in the meantime, free performance gain for everyone is nice to have right now)

ibraheemdev · on Nov 11, 2020

This was discussed here [1]. The conversation died off around a year ago. Not sure why... Although, it is pretty clear that BOLT's optimizations are not "free". For example:

> BOLT uses enormous amount of memory (about 6GB).

> Perf needs to be ran with LBR support; this is almost always unsupported with VMs (which means you don't want to run the measurement inside CI).

[1] https://github.com/rust-lang/rust/issues/50655

crb002 · on Nov 11, 2020

BOLT just permutes link order? The object permutation from a BOLT run should be good for months until the underlying objects have substantially drifted?

The_rationalist · on Nov 11, 2020

Interesting, would be nice for rustc to have a special flag that enable "maximum performance" at any compile time/ram price

the8472 · on Nov 11, 2020

At any price? You could throw something like souper at the problem[0]. It makes compilation about 20 times slower in some benchmarks in their paper. This is on the IR level, iirc similar optimizers exist for assembly generation, so you can throw that in as another pass.

[0] https://github.com/google/souper

MaxBarraclough · on Nov 12, 2020

> At any price?

Seems to me that raising the ceiling on optimisation is good to have, even if the computational cost is very high. If it works reliably, it wouldn't be necessary to do optimised release builds very often, and they could be offloaded onto a heavyweight build server, perhaps on the cloud.

glandium · on Nov 12, 2020

Propeller is probably better (and should eventually be integrated in LLVM itself). http://lists.llvm.org/pipermail/llvm-dev/2019-September/1353...

It's still Linux only, though.

ibraheemdev · on Nov 11, 2020

TLDR -- PGO reduces build times by 10-16% but is not straightforward to realize in CI.

The_rationalist · on Nov 11, 2020

Why isn't it straightforward, isn't autofdo automating the process?

ibraheemdev · on Nov 11, 2020

> Unfortunately, bringing these improvements to end users is not as simple as adding a few compiler flags to our dist builds. PGO is different from most other optimizations in that it:

- requires a different, extended build workflow due to the additional instrumentation and data collection phases, and

- it incurs a sustained build time cost (a trait it shares with other automated optimizations like LTO).

The_rationalist · on Nov 11, 2020

Autofdo if automatically called by the rust toolchain should alleviate the first point https://github.com/google/autofdo (though it only work on Linux)

How much percent slower is the build time with PGO on average?

oscardssmith · on Nov 11, 2020

It depends. The thing that makes pgo good is that it bases it's compile off of analysis of runtime behavior. For that to work, you have to run the problem long enough to get a representative sample. For some programs, that can be a big speed decrease.

pabs3 · on Nov 12, 2020

Does PGO mean the build products are non-deterministic? ie if you build the Rust compiler twice you get different bits.

connicpu · on Nov 12, 2020

It would be deterministic for a given set of input profile data, but would likely change every time you instrumented a new profile

mlindner · on Nov 12, 2020

That assumes any random number generator that a branch is based off of in the code flow was also deterministic, which is unlikely.

hansjorg · on Nov 12, 2020

Is the profile data or the resulting PGOed binary useful for identifying spots that could be improved manually?

Arcuru · on Nov 12, 2020

Instead of wall time, the cycle counts should be a little less noisy as they will actually account for the threads getting switched out. In this case they show a relatively similar reduction from a quick glance.

> One plausible explanation would be that PGO improves instruction cache utilization, something which makes a difference for execution time but would not be reflected in the amount of instructions executed.

Yes, this is a big reason why PGO is useful. It places the hot code together for improved ICache performance, while keeping the colder basic blocks separate (though there are likely some strategies here I don't know about). You could show the ICache miss counter and prove that this is the case.

> I also don't know how branch mispredictions factor into instruction counts -- branch prediction being another aspect explicitly targeted by PGO.

Branch mispredictions won't show up directly in the instruction counts, but they will show up indirectly through lower CPI or you can explicitly look at the counters for branch misses and compare between the two runs. If you were asking about whether speculative instructions from the missed branch are counted, they are not. Instructions are only counted once they are retired. (Intels counter for this is specifically named "Instructions Retired" to clarify that distinction).

> In order to make the collected data as useful as possible, we should try to exercise all the common code paths within the compiler. I typically use the "standard" rustc-perf benchmark suite for this purpose

This is not correct, though maybe just misstated. You need to exercise the code with inputs that are representative of the code you want to optimize for. Mozilla does this with the Firefox build (presumably) by using the Firefox build itself as an input since it is small enough. That's a perfect 1:1, so their performance gains are as good as possible.

This post does the same thing, using the same Benchmark suite as training data and for testing, but the goal is to use this optimized rustc to compile other Rust programs. It would probably be better to show the results on a separate set of benchmarks to see how well PGO does for the general use case of "compiling Rust code" instead of "compiling the training data".

Source: Did a lot of performance analysis (with a heavy use of counters) at my last job at Microsoft. They were very useful when we needed to do detailed analysis between x86 and ARM, among other things. I didn't work on PGO but a sibling team did, it was used extensively for the Windows builds. It was an interesting job, but unfortunately moving further up the stack to be a low level code monkey working on boring backend development at a colorful cloud company pays significantly better.

runiq · on Nov 12, 2020

> This post does the same thing, using the same Benchmark suite as training data and for testing, but the goal is to use this optimized rustc to compile other Rust programs.

So, in my pipedream future, what's stopping crate authors from uploading the PGO profile for the build along with a new crate release? Even better, let CI handle it. Sounds like a logical next step.

crb002 · on Nov 11, 2020

Makes sense at this point to work with Sylvestre Ledru who maintains the Debian llvm-toolchain and rust packages.

ibraheemdev · on Nov 11, 2020

How do Rust build times compare to Scala?

nindalf · on Nov 11, 2020

It’s hard to compare across languages. Too many independent variables to consider. Take the example of grep vs Ripgrep. Is it fair to compare the compile times of grep written in C with ripgrep? I can see arguments both ways.

What I can say with confidence is that Rust build times have improved continuously in the last 2 years. It’s now 30-50% faster on common programs. The future is bright too, a new debug back end out in 2021 promises a further 50% improvement.

ibraheemdev · on Nov 11, 2020

Do you have a reference for that "new debug back end"?

estebank · on Nov 11, 2020

https://hacks.mozilla.org/2020/10/a-new-backend-for-cranelif...

https://github.com/bjorn3/rustc_codegen_cranelift

ComputerGuru · on Nov 11, 2020

Project Cranelift

dmit · on Nov 11, 2020

One of the reasons why it's hard to compare the two is that the standard metric is time/loc, and it varies hugely depending on which language features are used. Both languages have features that can blow up compile times by a factor of 1000 or more.

lalaithion · on Nov 11, 2020

In my experience, cargo build (as of recently) is significantly faster than sbt compile (as of 3 years ago), but only a little faster than compile (as of 3 years ago) called from within the sbt shell.

This comparison is my gut feeling, on pretty small programs.