Great to see this open sourced, and more work in the FDO/PGO space. I also like the access visualization heatmap: I'd done something similar before but didn't have a big use for it yet, but showing how it improves access locality is killer.
It seems unfortunate that they developed their own data format for the input. Why can't it be in the same format that SamplePGO ingests? Additionally, it also seems unfortunate to add yet another stage to the toolchain. We already have either build, run+profile, rebuild with pgo, link with LTO, or we have build with samplepgo, link with thinlto, and this is adding a third or second rebuild. It already takes a phenomenally long time to compile a large C++ application, and another pass isn't going to make it shorter.
There's no need for another build as BOLT runs directly on a compiled binary, and could be integrated into an existing build system. Operating directly on a machine code allows BOLT to boost performance on top of AutoFDO/PGO and LTO. Processing the binary directly requires a profile in a format different from what a compiler expects, as the compiler operates on source code and needs to attribute the profile to high-level constructs.
So what exactly would BOLT help with? I'm guessing mainly junky code or things a compiler cannot possibly optimize. Could you run BOLT on something that is known to be be well optimized?
If BOLT can work directly on binaries, is there anything stopping it from being integrated as a kernel module into the OS, so that binaries are continually being profiled and optimized?
It seems to me that an optimized OS image could be also be created.
That doesn't seem like it would be profitable. Profiling the running process, processing the profile, and then relinking the binary on a single host wouldn't pay off compared to profiling in the large, relinking the program once and redeploying it at scale. Peak optimization is expensive.
My phone runs approximately zero new binaries every day. I'd happily let my phone optimize itself while charging so that netflix, youtube, or other heavy CPU users use less battery.
I'm headed to work now but I'll try opera 12 and the oracle JVM later. As for the latter, I'm curious as to how much the running application will influence the final optimizations applied. Since you used it on hhvm, did you see anything interesting regarding this, or does of always optimize towards the same optimal layout?
AutoFDO is exactly that. Facebook's execuse to develop this is that AutoFDO didn't work well with HHVM, but that sounds like an AutoFDO bugfix, not a whole new project. I agree with you that it is better to work on AutoFDO, because it will see more wide usage.
In many cases BOLT complements AutoFDO. AutoFDO affects several optimizations, and code layout is just one of them. Another critical optimization influenced by AutoFDO/PGO is function inlining. After inlining a callee code profile is often different from a "pre-inlined" profile seen by the compiler, which prevents the compiler from making an optimal layout decisions. Since BOLT observes the code after all compiler optimizations, its decisions are not affected by context-sensitive inlined function behavior.
Their reasoning is interesting: they use exceptions. Google notoriously does not. I wonder if Google (and the other AutoFDO contributors, of course) have overlooked something crucial that the exception-using people need.
Great to see this open sourced. The front-end bottlenecks I$ misses are a big problem for several managed runtimes (Node, Java, .NET etc). This works on the executable, the JITed code will still need to do it's own layout to optimize.
bzip2 isn't very large, and they seem to believe they speed that up too. I don't see why this wouldn't speed up small things. Whether it's a big program or a small one, it probably has hot code and cold code, and organizing the code in such a way that the hot stuff is all together is likely to help. Same goes for making sure you don't have to take branches, etc.
On modern x86_64 implementations tiny changes in code layout can have large effects. There are rampant stories of "load-bearing NOPs" that mysteriously make programs several percent faster. Even experts do not understand these effects. I'd like to hear more about the effect on various programs, not just bzip2.
x86_64, and some extensions, yes. However, as you may be aware, most interpreted code is hosted by a VM that runs compiled code. :) FB's HHVM is such a one.
Could this conceivably be used to speed binaries like nginx? I tend to compile nginx on my systems as over time there are always custom modules I want to add/modify anyway.
This helps if and only if you suffer from instruction starvation. Given that nginx binary size is around 1 MB which fits comfortably inside cache, it is very unlikely BOLT would help with nginx.
I don't think that is true. Even without memory bandwidth constraints, jumping to instructions that aren't in cache is going to incur memory latency. If the instructions are all packed together, the next instructions could be prefetched and already be in cache.
Also low cache levels have less latency and yet are much smaller, the L1 instruction cache is 32KB. Any linear access of memory will prefetch and minimize the latency of memory access.
AMDs Zen architecture uses 32KB for data and 64KB for instructions (was curious if there are differences between AMD and Intel designs regarding L1 cache).
This is an interesting solution to avoid microservices.
Although it doesn't apply to many, one of the advantages of microservices is avoiding exactly the problem they are solving -- that the monolith application is too big for memory, too big for CPU cache, etc.
I like this alternate solution -- it involves a lot of up front work and will surly require a lot of maintenance, but so does a complex microservices architecture.
It'll be interesting to see how this holds up in a few years.
Huh? Your having a big monolith service binary doesn't mean that this whole binary gets paged in from disk all the time for every request. If you're specializing and dedicating particular instances to particular tasks, you'll get cache-mediated locality even if you have a big binary that can, in principle, do lots of different things.
> doesn't mean that this whole binary gets paged in from disk all the time for every request
No, but unless I'm reading this wrong, what this optimizes is what does get paged in.
Using microservices would naturally limit what could get paged in because the services would most likely be smaller, whereas with this, the developer doesn't have to worry about those things and instead this tool optimizes for them.
It's not really much to do with what gets paged in, more to do with instruction cache locality and all sorts of tiny micro-optimisations, especially relating to branches. If re-ordering code or adding branch-expected markers causes the branch predictor to guess correctly more often, it can have a huge benefit.
In my experience if you take one big service and break it into two, each binary will be pretty much just as big as the original. All of that library code is still in there, all that static data is still reachable. Nobody knows why but it is.
It doesn't feel like microservices address the same problem. For one thing, if you have a CPU bound task, splitting it out into two different servers is often going to worsen the problem, as now you have network communication overhead to contend with on top of everything else.
Working at large companies like these, it is clear that many of their own applications never use these technologies. Why then is FB investing into such technology? Where do they plan to use it?
Many might not be all, in many places this stuff wouldn’t move the needle but might be crucial somewhere else also the expertise to do these things doesn’t evaporate after they are done.
The bigger problem I have with a lot of the stuff the Facebooks of the world put out is that most of us simply don’t need it now and maybe never.
I hear too much “but big company Foo uses it!” well yeah but Foo can hire the creator and half a dozen engineers to maintain it, you have two developers.