They trained (and report) two optimization strategies:
- inline-for-size
We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction.
- regalloc
with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications
Try it Yourself
Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.
https://github.com/google/ml-compiler-opt
https://github.com/google/ml-compiler-opt/blob/main/docs/demo/demo.md
Compilers have soooo many heuristics. And a lot of it looks like a chess or Go game: You have a list of 100s of AST optimization passes (possible moves) that preserve the semantics of the program but you have limited compute with which to run iterations of these passes. What order and combination of these should you use?
This is known as the "phase ordering problem". Despite the super-exponential (even unbounded, because phases can be repeated) space, I think this is more tractable for certain subsets. For example, in sea of nodes compilers, like HotSpot, Graal, and TurboFan, forward dataflow analyses that do reductions (i.e. always make the graph smaller and/or cheaper) can be combined into one mega-pass where they positively interact each other. The problem is that some problems are not forward data flow problems and some transformations are not simple reductions. Also, some phases make use of type information which can be destroyed or obscured by lowering.
today, learned heuristics have a couple of pitfalls that make them hard to add to such systems
1. they are usually hard to run efficiently
2. they are usually hard to explain
The former is definitely changing with low precision formats like fp16 and useful coprocessors that can do matrix multiplications efficiently (M1, Intel). The latter hasn't been developed much and unless you're just training a model to memorize the entire space the heuristic operates in, it can be scary to trust it on unseen data.
Most interesting cases don't really look like this. The heuristic is applied to the user's code; it's not a one-time knob in the compiler. If it were, then you would likely be able to afford an exhaustive search to pick it & wouldn't need ml.
The analogy I'm making doesn't especially apply to compilers, which have human defined defaults in the first place.
What I've found is that it's important to not run the ML model then use its output as the default state. Have a human, heuristic choice as the default state.
Also, they are not stable. I.e, if you have a fast program and you change a tiny detail, it is not guaranteed that the program remains fast. Also between versions of the compiler.
Which you should use are often unknowable because it's often about how the code is driven by the data - you need to know the characteristics of the data to know how the code will behave, only then can you meaningfully talk about picking a 'better' heuristic.
Analogy: otherwise you're just optimising the design of a car. But optimising it for what? speed, efficiency, reliability, price, weight, carrying capacity... You first need to know how it's expected to be used.
I guess local inlining might sometimes be an unconditional win, but even then only under specific circumstances.
(disclaimer: I know something but am not an expert)
A quick google search on the subject is not so clear cut (try ML compiler and Compiler ML).
In either case, this is a discussion of compilers where audience skews toward an interest in programming languages. In this context abbreviation is ambiguous.
I think a rename would be helpful in clearly outlining the subject.
Something like 'Google Deep Learning based Compiler Achieves 3-7% reduction in size' would get the point across clearly.
Eh, the OCaml crowd is surely reasonably large, and they would all be familiar with ML.
I had the same question as the parent commenter and had to wait until the page loaded before determining whether the article was about machine learning or the programming language.
...and other things HNers tell themselves. More at 11.
Seriously tho, I’m unable to find market share numbers because most survey results only list the top 35 or so languages. The last one is COBOL at 0.5%. For comparison, the Lizardman Constant is 4%: https://slatestarcodex.com/2013/04/12/noisy-poll-results-and...
There is also the much larger confusion arising of whether it's an ML compiler in the sense of being a specialized compiler for ML code (compiles a description of a network to efficient ML-specific low-level IR to enable further optimization and code generation), or whether it's an ML compiler, a compiler (or part of it) that uses ML.
The first meaning is extremly common, the top 2 results for "ML Compiler" on google returns[1][2], both using it in the first sense. It's not that ML or ML-like techniques in compiler writing is new but the first sense is definitely at least as popular an interpretation.
I wonder how well it performs on Rust code, which is probably different in terms of patterns from C++ code. They mention the Fuchsia project but they seem to have focused on its C++ components. There is actually more Rust inside Fuchsia now than C++ [0].
I also wonder how this would look like for mainlining. Should the LLVM project depend on tensorflow now? IIRC tensorflow itself depends on LLVM so to avoid circular dependencies, does there have to be a ML-free version of LLVM that tensorflow depends on, which is then used by the proper LLVM? Or can inference be converted into simple C like it was for lpcnet? Lastly, there is the general question of integration of ML models into open source projects. Say it is merged and the old manual heuristic is deleted. What if Google one day decides they don't want to maintain the component any more? Can LLVM maintainers do any refactors of the code around it? Unless Google also shares their training infrastructure, LLVM maintainers can't re-train the model on the post-refactor data.
I remember listening to one of the Lex Fridman interviews with Jim Keller where he said that modern branch prediction in CPUs was now done by “neural nets” in silicon. Does anyone here have any insight into this?
If you search “AMD Zen neural branch prediction” you’ll find several references to this, but very few details. I dunno if perceptrons really count as “neural nets” these days since it’s just a weighted average. Lots of computations are technically neural nets if we use such broad definitions.
It’d be pretty cool to see an x86 language model in future CPUs. I have no doubt that compute will continue to scale faster than memory access and the relative cost of pipeline stalls will never go down.
Across the board improvements from changing register allocators is surprisingly difficult to get. Couple of percent sounds great.
It's usually done heuristically in a fashion that scales badly with how many instructions you're considering at a time so it's really easy to overfit to today's benchmark.
I'm very familiar with both inlining and register allocation in GCC (I even helped write one of the register allocation rewrite attempts, until vlad went even further down the rabbit hole of writing register allocators than we did)
RA was itself is historically more advanced in amount of optimization it does itself in GCC, but was still mostly a wash, because it's still much easier to work with LLVM's form and better optimizers were written.
LLVM also nowadays has a fairly advanced register allocator if you want it (PBQP) - it produces near optimal register allocations. It's not used though. It hasn't been worth it.
Inlining, they are not usefully comparable - they are very different approaches, both are good, and during the last inliner rewrite in LLVM, large amounts of comparisons were done a large number of times, and there isn't anything to suggest that GCC's inliner is really better (unlike RA, where the default GCC algorithm is certainly more advanced than the default LLVM RA, despite this being a deliberate decision).
We spent a lot of time optimizing inlining and RA/etc code when we transitioned to LLVM at Google (many years ago now)
At when we made that transition it was a net positive thing for fleet, and it was on a very large codebase.
The truth about compilers is there are no silver bullets.
Things get faster mostly through careful tuning (which ML will hopefully replace), and much more rarely due to better algorithms.
As a result, i would expect GCC to not do any better here than LLVM.
Honestly, no, because it's been decades at this point.
While not as formally written down, this is the most common way that profiling info was ever used in RA.
Using both the execution frequency and use distance to calculate which registers should be spilled and where to spill them was done even in toy research compilers when I was heavy into this part of the world.
I remember seeing implementations in IBM's XLC, at least two of the RA's we implemented in GCC did this, etc.
Stallman hasn't been relevant to GCC in 10-15 years. If people aren't doing research projects with it, it's because they don't like the code style (which is still weird but not monolithic) or their funders aren't using it in production.
> This code is licensed under a BSD-like license [8], and LLVM itself will not initially be assigned to the FSF. If people are seriously in favor of LLVM being a long-term part of GCC, I personally believe that the LLVM community would agree to assign the copyright of LLVM itself to the FSF and we can work through these details.
The leads of the project offered LLVM to the FSF and the offer was ignored because of Stallman's incompetent approach to email.
What about reproducibility of the compiler binaries? I got the impression that model training itself isn't deterministic across training hardware? Since the model is embedded in the binaries, then the binaries aren't really reproducible if the model isn't. How big is the data used to train the model? How costly is it to train the model?
> The TensorFlow model is embedded with XLA AOT, which converts the model into executable code.
Taking the TF model to executable code should be deterministic.
Generating the TF model might not be (depends on implementation and hardware).
But it's unclear from this if this is a problem in practice - it's not uncommon for non-deterministic models to end up producing the same output because you perform thresholding/quantizing or some analogous process to convert it into a classification style output.
I.E. here, you are generating operations, which you can setup as a classification problem: Given this input and this history what is the next operation.
And of course you can always got back to your saved model and generate the code from that.
While the exact scores for the next operation might be non-deterministic you always end up with the highest scoring one being the same.
Provocative title. Function inlining mixes up call overhead elision and specialisation to the call site, especially if the compiler hasn't implemented the specialisation as it's own thing.
Inlining by itself only decreases code size if there's provably a single call site. What's doing the work here is branch folding after the inline.
Alternative is specialising the function wrt the predicate and rewriting call sites to the specialisation when the predicate is known.
Harder than inline branchy things and hope for the best but tractable and bounds the code size increase.
I wonder how tied the heuristic is to a particular compiler IR. E.g. can the decision procedure be boiled down to a form that could be plugged into other compilers that need an inlining heuristic?
I remember a while back reading a paper about using superoptimizers to auto-generate peephole optimizations; anyone know if that ever got deployed anywhere?
It seems like it's sort of built in to LLVM right now, but not usable unless you build LLVM yourself with pretrained models as part of the build process?
The code will by default auto-download it during the build process. It's about 800 kbytes, which seems very reasonable for something that will reduce the generated code size by gigabytes for a large codebase.
Note that the open-source-ness is dubious... They say it was trained using an optimizer which isn't opensource, and that the results are significantly better ('more generalizable') than using the open source one. In my view, if the code that makes a binary blob isn't opensource, then the project isn't opensource...
Yes - the default model really needs to be trained with an opensource optimizer on a corpus of open source code (ie. with a license at least as permissive as llvm itself).
A blob trained with proprietary google tech on a proprietary google codebase isn't opensource. Even if it were, Google C++ differs in style quite widely from typical C++, so the model probably isn't as good as a model trained on all of github.
Isn't this always going to be less efficient than profile guided optimization?
Of course that assumes you can profile your program with realistic "production" workloads, and it'll need two compilation passes, so having sensible ML defaults boxes sense.
But it's odd not to compare this to PGO in terms of the resulting performance.
It's not odd because the result is on top of PGO. You would never adopt this approach if you did not already have PGO. The gains from PGO will be much larger.
As someone who doesn't follow compilers too much, is 3-7% considered good? Throwing a whole ML model in to the compiler seems like a pretty large boost to complexity for a small gain in performance.
If you can fit 5% more functionality into a fixed size that is nice. Especially if you are only using this optimization level for the final production build that needs to be squeezed onto a small (and therefore cheaper) ROM.
Although I do agree that these days most projects will have media that takes up the majority of the space.
The stat that seems clearly impressive for me is that their register allocator claims "0.3% ~1.5% improvements in queries per second" which is a huge cost savings for operations at the scale of Google. If you have 100 datacenters of software running you can conceivably turn one of them off. (Or more likely slow down future building and expansion plans). Of course for most people compute costs aren't a significant expense.
Is this the start of a future where we can write high level code (Idris, Agda, Coq) and the resulting code will run as fast (and as safe) as RUST? Interesting.
- inline-for-size
- regalloc With code, that's awesome—what I like to see.