Accelerate large-scale applications with BOLT

brendangregg · on June 20, 2018

Great to see this open sourced, and more work in the FDO/PGO space. I also like the access visualization heatmap: I'd done something similar before but didn't have a big use for it yet, but showing how it improves access locality is killer.

ebikelaw · on June 20, 2018

It seems unfortunate that they developed their own data format for the input. Why can't it be in the same format that SamplePGO ingests? Additionally, it also seems unfortunate to add yet another stage to the toolchain. We already have either build, run+profile, rebuild with pgo, link with LTO, or we have build with samplepgo, link with thinlto, and this is adding a third or second rebuild. It already takes a phenomenally long time to compile a large C++ application, and another pass isn't going to make it shorter.

maxpan · on June 20, 2018

There's no need for another build as BOLT runs directly on a compiled binary, and could be integrated into an existing build system. Operating directly on a machine code allows BOLT to boost performance on top of AutoFDO/PGO and LTO. Processing the binary directly requires a profile in a format different from what a compiler expects, as the compiler operates on source code and needs to attribute the profile to high-level constructs.

nwmcsween · on June 20, 2018

So what exactly would BOLT help with? I'm guessing mainly junky code or things a compiler cannot possibly optimize. Could you run BOLT on something that is known to be be well optimized?

maxpan · on June 20, 2018

BOLT can optimize the compiler itself. Either GCC or Clang.

AboutTheWhisles · on June 20, 2018

If BOLT can work directly on binaries, is there anything stopping it from being integrated as a kernel module into the OS, so that binaries are continually being profiled and optimized?

It seems to me that an optimized OS image could be also be created.

ebikelaw · on June 20, 2018

That doesn't seem like it would be profitable. Profiling the running process, processing the profile, and then relinking the binary on a single host wouldn't pay off compared to profiling in the large, relinking the program once and redeploying it at scale. Peak optimization is expensive.

AboutTheWhisles · on June 20, 2018

First, if it worked without intervention it could be used on desktop and mobile OSs, not just servers.

Second, I already mentioned what you suggested.

ebikelaw · on June 20, 2018

Relinking binaries running on a phone sounds even less profitable. How much battery energy are you willing to waste here?

CyberDildonics · on June 20, 2018

This isn't a linker, so there is no re-linking.

> even less profitable

What does this mean? If a company like google restructured their binaries to get a 15% speedup, that would be huge.

> How much battery energy are you willing to waste here?

What waste is happening? This doesn't have to happen on the phone. Even then, android already does optimization passes when phones are plugged in.

sliken · on June 21, 2018

My phone runs approximately zero new binaries every day. I'd happily let my phone optimize itself while charging so that netflix, youtube, or other heavy CPU users use less battery.

AboutTheWhisles · on June 20, 2018

You seem awfully desperate to guess at the limitations of a technique you knew nothing about a day ago.

iforgotpassword · on June 20, 2018

I'm headed to work now but I'll try opera 12 and the oracle JVM later. As for the latter, I'm curious as to how much the running application will influence the final optimizations applied. Since you used it on hhvm, did you see anything interesting regarding this, or does of always optimize towards the same optimal layout?

AboutTheWhisles · on June 20, 2018

Can BOLT optimize a shared library as well?

maxpan · on June 20, 2018

Not yet, but the support is coming.

ebikelaw · on June 20, 2018

Nobody who cares about performance uses them.

wocram · on June 19, 2018

Is it possible for this sort of thing to make it into the compilers? Seems like it might see much more usage that way.

sanxiyn · on June 20, 2018

AutoFDO is exactly that. Facebook's execuse to develop this is that AutoFDO didn't work well with HHVM, but that sounds like an AutoFDO bugfix, not a whole new project. I agree with you that it is better to work on AutoFDO, because it will see more wide usage.

maxpan · on June 20, 2018

In many cases BOLT complements AutoFDO. AutoFDO affects several optimizations, and code layout is just one of them. Another critical optimization influenced by AutoFDO/PGO is function inlining. After inlining a callee code profile is often different from a "pre-inlined" profile seen by the compiler, which prevents the compiler from making an optimal layout decisions. Since BOLT observes the code after all compiler optimizations, its decisions are not affected by context-sensitive inlined function behavior.

ebikelaw · on June 20, 2018

Their reasoning is interesting: they use exceptions. Google notoriously does not. I wonder if Google (and the other AutoFDO contributors, of course) have overlooked something crucial that the exception-using people need.

ssuresh · on June 20, 2018

Great to see this open sourced. The front-end bottlenecks I$ misses are a big problem for several managed runtimes (Node, Java, .NET etc). This works on the executable, the JITed code will still need to do it's own layout to optimize.

rattray · on June 19, 2018

Should this read "Speed up large binaries with BOLT"? Seems specific to many-megabyte compiled programs.

asfasgasg · on June 19, 2018

bzip2 isn't very large, and they seem to believe they speed that up too. I don't see why this wouldn't speed up small things. Whether it's a big program or a small one, it probably has hot code and cold code, and organizing the code in such a way that the hot stuff is all together is likely to help. Same goes for making sure you don't have to take branches, etc.

ebikelaw · on June 20, 2018

On modern x86_64 implementations tiny changes in code layout can have large effects. There are rampant stories of "load-bearing NOPs" that mysteriously make programs several percent faster. Even experts do not understand these effects. I'd like to hear more about the effect on various programs, not just bzip2.

sanxiyn · on June 20, 2018

There was a published paper in CGO 2011 on load-bearing NOPs by Google.

https://ai.google/research/pubs/pub37077

rattray · on June 20, 2018

Ah, thanks for the correction.

In any case, it seems this is specific to compiled code.

asfasgasg · on June 20, 2018

x86_64, and some extensions, yes. However, as you may be aware, most interpreted code is hosted by a VM that runs compiled code. :) FB's HHVM is such a one.

magicbuzz · on June 20, 2018

Could this conceivably be used to speed binaries like nginx? I tend to compile nginx on my systems as over time there are always custom modules I want to add/modify anyway.

sanxiyn · on June 20, 2018

This helps if and only if you suffer from instruction starvation. Given that nginx binary size is around 1 MB which fits comfortably inside cache, it is very unlikely BOLT would help with nginx.

AboutTheWhisles · on June 20, 2018

I don't think that is true. Even without memory bandwidth constraints, jumping to instructions that aren't in cache is going to incur memory latency. If the instructions are all packed together, the next instructions could be prefetched and already be in cache.

Also low cache levels have less latency and yet are much smaller, the L1 instruction cache is 32KB. Any linear access of memory will prefetch and minimize the latency of memory access.

redtuesday · on June 20, 2018

AMDs Zen architecture uses 32KB for data and 64KB for instructions (was curious if there are differences between AMD and Intel designs regarding L1 cache).

polskibus · on June 20, 2018

Can this work be used somehow to improve JITs like .NET's?

summarity · on June 20, 2018

A thing to try: Mono(LLVM) -> mkbundle(static, aot full) -> BOLT.

himom · on June 19, 2018

Except for case, bolt already is a puppet project. Not a huge deal since both projects are widely-spaced.

stochastic_monk · on June 19, 2018

For some bizarre reason, regardless of which browser I use, I can't open this facebook page. (ERR_CONNECTION_CLOSED)

Is there code or a summary available elsewhere?

jedberg · on June 19, 2018

This is an interesting solution to avoid microservices.

Although it doesn't apply to many, one of the advantages of microservices is avoiding exactly the problem they are solving -- that the monolith application is too big for memory, too big for CPU cache, etc.

I like this alternate solution -- it involves a lot of up front work and will surly require a lot of maintenance, but so does a complex microservices architecture.

It'll be interesting to see how this holds up in a few years.

quotemstr · on June 19, 2018

Huh? Your having a big monolith service binary doesn't mean that this whole binary gets paged in from disk all the time for every request. If you're specializing and dedicating particular instances to particular tasks, you'll get cache-mediated locality even if you have a big binary that can, in principle, do lots of different things.

jedberg · on June 20, 2018

> doesn't mean that this whole binary gets paged in from disk all the time for every request

No, but unless I'm reading this wrong, what this optimizes is what does get paged in.

Using microservices would naturally limit what could get paged in because the services would most likely be smaller, whereas with this, the developer doesn't have to worry about those things and instead this tool optimizes for them.

pjc50 · on June 20, 2018

It's not really much to do with what gets paged in, more to do with instruction cache locality and all sorts of tiny micro-optimisations, especially relating to branches. If re-ordering code or adding branch-expected markers causes the branch predictor to guess correctly more often, it can have a huge benefit.

ebikelaw · on June 20, 2018

In my experience if you take one big service and break it into two, each binary will be pretty much just as big as the original. All of that library code is still in there, all that static data is still reachable. Nobody knows why but it is.

repsilat · on June 20, 2018

Sure, but it's probably not all being executed, so you don't really pay for what you don't use.

(OTOH, while you may not find yourself waiting on instructions, you'll no doubt spend all of your time waiting on data...)

asfasgasg · on June 19, 2018

It doesn't feel like microservices address the same problem. For one thing, if you have a CPU bound task, splitting it out into two different servers is often going to worsen the problem, as now you have network communication overhead to contend with on top of everything else.

harigov · on June 19, 2018

Working at large companies like these, it is clear that many of their own applications never use these technologies. Why then is FB investing into such technology? Where do they plan to use it?

hackernudes · on June 19, 2018

The article says BOLT was deployed in production with HHVM (which is Facebook's PHP runtime).

paxy · on June 19, 2018

The main example they gave in this article, HHVM, is what all of Facebook's backend runs on, so I'm not sure what point you're trying to make.

notacoward · on June 20, 2018

I work on Facebook infra, and to me that's all the front end.

noir_lord · on June 19, 2018

Many might not be all, in many places this stuff wouldn’t move the needle but might be crucial somewhere else also the expertise to do these things doesn’t evaporate after they are done.

The bigger problem I have with a lot of the stuff the Facebooks of the world put out is that most of us simply don’t need it now and maybe never.

I hear too much “but big company Foo uses it!” well yeah but Foo can hire the creator and half a dozen engineers to maintain it, you have two developers.

Pick boring unless you can’t is my approach.

RussianCow · on June 19, 2018

> Pick boring unless you can’t is my approach.

But when you can't, being able to learn from the lessons of giant companies via their open source software sure is a blessing!