Delete an inline function, save 794 kB

ot · on Jan 23, 2017

Another (obvious) optimization is to make functions that have to be inline smaller. For example, std::string is used ubiquitously in C++ programs; this change shaved a few MBs of .text from a bunch of large services at FB: https://github.com/facebook/folly/commit/be4c6d6b3e21914df8a... .

minimax · on Jan 23, 2017

That is a smart optimization but it does make you wonder why were they using a size_t to hold the category (only 3 possible values) in the first place...

ot · on Jan 23, 2017

You should check out how SSO (small string optimization) is implemented. size_t is actually the natural choice, because it is shared with the capacity (or with part of the string in the small case).

minimax · on Jan 23, 2017

Ah I see. In size() and c_str(), strictly speaking it's not the category constants that become 8 byte immediates but rather the category mask. Since the category constants and masks are all compile time constants I wonder if a smarter compiler could have performed this optimization for you.

ot · on Jan 23, 2017

Exactly. And yes, a "sufficiently smart compiler" [1] could have discovered the optimization, the old and new methods are semantically equivalent. It would be interesting to test a superoptimizer on this, but I suspect it wouldn't go very far because superoptimizers usually operate at the basic block level (this has a branch) and optimize for number of instructions, not bytes.

[1] http://wiki.c2.com/?SufficientlySmartCompiler

brucedawson · on Jan 23, 2017

Nice. That 64-bit 0x1700000000000000 constant is pretty painful to see.

dzdt · on Jan 23, 2017

With all the work on compilers and programming languages, doesn't it seem like linker and loader technology is ripe for a major overhaul?

This kind of "chaos theory" linker crap is ridiculous.

rwmj · on Jan 23, 2017

For loaders definitely yes! On my local Fedora laptop, qemu takes about 60ms to simply run 'qemu-system-x86_64 -version'. Almost all of the time is taken up in glibc's loader, resolving the ~170 shared libraries.

While 60ms may not sound like a lot, I've been trying to get qemu boot times down to the hundreds of milliseconds (for lightweight VMs and sandboxing). It's been quite successful, but 60ms is now a significant chunk of the total boot time.

Edit: Sure I'm aware I could compile my own custom qemu, or statically link, but that's really not the point. I want qemu to boot very fast for everyone, and each of those shared libraries is a feature that someone, somewhere wants.

geofft · on Jan 23, 2017

The problem is that the glibc loader has to look through each of those 170 libraries for every single symbol.

On Windows, OS X, Solaris, and probably lots of other OSes, the executable format supports saying "This function needs g_string_append from libglib-2.0.so.0" and "That function needs SDL_Init from libsdl-1.2.so.0". The GNU dynamic linker (and the BSD dynamic linker, incidentally) only supports saying "Hi, I'd like libglib-2.0.so.0 and libsdl-1.2.so.0 at some point; okay, now I'd like g_string_append and SDL_Init". This means the linker has to go looking in SDL_Init in GLib, despite everyone knowing that SDL_Init is going to be in SDL and nowhere else.

Over a decade ago, the OpenOffice.org developers found that it started faster on Windows than on GNU/Linux, and also started faster via WINE than natively on GNU/Linux, and that implementing direct binding would cut startup time on GNU/Linux from about four seconds to two: https://sourceware.org/ml/binutils/2005-10/msg00436.html

It was rejected out-of-hand by Ulrich Drepper because prelink exists, even though prelink didn't actually solve the problems.

Now that Ulrich hasn't been involved with glibc for a few years, and prelink seems to be dead, it's probably time for someone to revive that patchset.

acqq · on Jan 23, 2017

> It was rejected out-of-hand by Ulrich Drepper

That explains quite a bit.

Even without him I still consider a lot of Linux development attitude to be "har, har, the noob doesn't know that he has to hexagonize the fongebangler in the WERTR_FREW file before he can use the Backspace key, no we aren't going to change the defaults to match the 99.9999% of the keyboards of the world."

I never understood how they enjoy hexagonizing all the fongebanglers every time they install all their systems. Because they surely have to. And don't make me started about these who develop Linux GUIs.

geofft · on Jan 23, 2017

It's a poisonous culture, where it's more important to look smart than to be smart. Note that Ulrich's answer was not just completely self-assured but also technically incorrect, that he didn't respond to the polite questions about his answer, and that prelink eventually turned out to be a big misfeature.

I suspect they're quietly pentagonizing all the fongebanglers every time they install their systems, confused why there are only five points instead of six, and too worried about their reputation to suggest that maybe someone should make fongebanglers automatically figure out the right polygon, because then they'd they'll admit they don't know actually how to fix it themselves.

acqq · on Jan 24, 2017

Yes, thanks, only his answer to the post you linked to is more than enough to demonstrate how wrong it was allowing him to be responsible for anything. How many years did he manage to block the improvements?

To get back to the topic we discuss: the patch sent there (and rejected by UD) speeds up loading everywhere glibc is used! And that's the pure CPU speedup, no matter how fast SSD you have!

Reading the question of Michael Meeks, it was just a proof of concept, maybe the whole mechanism can be improved even more, he comments his own work:

"Is there a better way to achieve what I do in dl-deps.c? the umpteen-string compares are clearly highly evil"

Anybody knows the status of uclibc? How do they do it? If nobody knows, anybody willing to measure how much the same version of OO needs on the same hardware with glibc and uclibc?

JoeAltmaier · on Jan 23, 2017

Isn't it a feature that you can replace an implementation at link time? E.g. implement your own g_string_append or whatever. If all symbols were bound to libraries, that would be impossible.

geofft · on Jan 23, 2017

Yes, that's a feature, with LD_PRELOAD or similar. But you don't need to support SDL being able to arbitrarily replace GLib symbols, without opting in to such a replacement. Two options:

1. Distinguish preloaded libraries from normal libraries (IIRC, glibc doesn't really do this). Conduct searches in the list of preloaded libraries, followed by the specific library where the program thinks it is. In the normal case, where LD_PRELOAD is empty, this adds no overhead and you still get to look through 1 library instead of 170. In the most common abnormal case, you still only look at 2 libraries instead of 171.

2. Have libraries explicitly say "I am replacing this symbol from this library", instead of just happening to have a name collision and be loaded first. OS X does this via the interpose section: the format is an array of structures of two pointers, the new function and the old one. The old "pointer" is usually a dynamic relocation, which means all the information in a dynamic relocation is present, including info about which library you're interposing on. The linker looks at all interpose sections of libraries at startup (again, in the common case, this is empty) and applies those overrides when it's about to resolve something to the old function.

Mac OS X Internals has an example of writing an interposing library: https://books.google.com/books?id=K8vUkpOXhN4C&lpg=PA73&ots=...

the_why_of_y · on Jan 23, 2017

It is a feature that is useful for many standard C library functions, a few functions in other commonly used libraries, and next to no functions in application libraries in cases like the parent's OOo example.

A reasonable approach to designing shared libraries would be to allow interposition as a non-default feature for explicitly marked symbols, and use fast direct binding by default.

acqq · on Jan 23, 2017

For the case you describe, I can imagine that it could be possible to implement on the system level some caches of the necessary module info for any application that is started, which would be valid at least until something new is installed on the system and in case of somebody developing something, he'd have to disable it for the application he's testing.

rwmj · on Jan 23, 2017

That's what "prelink" did, but IMHO the cure was worse than the disease. prelink would modify all your executables in /usr/bin, breaking them in some cases (for example if they contained any non-standard ELF section). It also required invasive changes in RPM and SELinux. Prelink is dead upstream and was dropped from Fedora a few releases back.

comex · on Jan 23, 2017

For reference, macOS has something like prelink in the form of the dyld shared cache, where a daemon just links together every shared library in the standard system directories and stores the result in a separate cache file. Then that file is essentially mmapped into every process on the system, so the dynamic linker has very little work to do at process startup. (For whatever reason, executables aren't included, so it still has to link those, but that's just an implementation quirk.) This works quite well, though it wastes a ton of disk space (essentially duplicating each library on disk).

Too bad macOS is really slow to start processes anyway, compared to Linux, because the kernel sucks.

acqq · on Jan 23, 2017

> the kernel sucks

I can imagine that there are surely some weak points, still I'd like to know if you can or will provide some specifics about macOS' kernel suckiness which are technical enough. I expect that you can as you wrote about dyld shared cache. Thanks.

delinka · on Jan 23, 2017

It seems we know the problem and its solution:

> Ideally the linker would notice the ODR violation, but doing so ends up being expensive – you have to look into every object file to see if any of functions are defined in different ways, and the definition of ‘different’ is not obvious.

For this overhaul you mention, we just need to eliminate "undefined behavior" from our C and C++ implementations. Just. Or permit that [time-]expensive operation to run. Or create some new languages that specify everything so that there is no "undefined behavior."

For the first option (removing undefined behavior), every attempt is met with code that relies on a particular compiler's implementation of "undefined" for a particular operation. For the second (run the expensive operation), the option needs to be available in the linker (I have no idea whether it is) and programmers need to have the patience to let the tools run. For the third, see all the new languages upon which much work is done, especially in the LLVM community (Julia, Rust, Swift).

I believe the renaissance you want is underway. Just not in C/C++ land.

dfox · on Jan 23, 2017

The problem with generating ODR violation diagnostics is that it requires the linker to compare all the duplicate definitions to each other not on bit-by-bit basis, but somehow discerning whether they are generated from same input source or functionally equivalent, without embedding some hairy metadata in the object file this seems like something that reduces to halting problem.

This is also one of the many problems that simply ceases to exists once you move away from the C/C++ model of "preprocessor as module system".

gpderetta · on Jan 23, 2017

I believe that GCC, when running on LTO mode, embeds exactly this kind of 'hairy metadata' (IIRC the gimple code for the function) and is capable in principle of identifying ODR violations. I have to try this capability myself though, so I don't know how effective it is.

edit: reading the docs again, it seems that it only detects ODR violations of the layout of types (and vtables).

brucedawson · on Jan 23, 2017

I seem to remember that VC++ has a (possibly undocumented?) option to detect ODR violations. It might only work on LTCG builds where functions are stored in an intermediate representation, prior to optimization.

But, I can't find any references to it, so maybe I am imagining.

BudVVeezer · on Jan 23, 2017

Is it the /d1reportSingleClassLayoutXXX flag, described here?

https://blogs.msdn.microsoft.com/vcblog/2007/05/17/diagnosin...

dfox · on Jan 23, 2017

This detects differing structure layouts between compilation units which is simple to do given some metadata in the object files. On the other hand it also shows why detecting ODR violations for executable code is non-trivial, because same source code compiled with different compiler flags can lead to different output that might or might not be compatible (enforcing same compiler configuration for all compilation units is certainly non-starter, think JS runtime vs. FFmpeg, 3D renderer vs. physics simulation...).

valarauca1 · on Jan 23, 2017

    With all the work on compilers and programming languages, 
    doesn't it seem like linker and loader technology is ripe 
    for a major overhaul?

Loaders?

Not really that just puts _text_ (raw executable machine code, and binary data) at certain locations in memory and updates some other _text_ to ensure branches/functions point to the right location. This isn't rocket science it works well. All in all loaders are very fast.

Linkers?

I would like to think so. It is just a damn complex problem. The real issue here is legacy support. Object files were a _hack_ to make compilers faster, use less resources, and give old school incremental-recompilation.

To make linkers faster, better, etc. one really needs to move away from opaque object files. I like what the LLVM is doing with LTO but this _kind of_ assumes you have all the rs/c/cpp files your compiling so you can see everything in llvm-ir format.

The real issue is build tools.

Modern C/C++ have a high level tool orchestrate compiling/moving object files/linking object files. In some total you need half a dozen different programs to do all these task (cc, cp, rm, make, sed/awk, ln). Each tool has no clue what everything else is doing.

You need to have that context for the linker to do it's job better. But now the linker has to do the job of all those tools :|

dfox · on Jan 23, 2017

What strikes me as weird is that I know of exactly one compiler system that got the incrementally compiled modules problem (almost) right and at the same time is AOT compiler without huge runtime: Turbo Pascal. The intermediate files contained both compiler readable definitions as well as object code, the linker was able to link only the required parts without resorting to hacks such as having each function in separate source file (see glibc) and the toolchain was able to generate missing intermediate files even without having to explicitly name them on command line. FPC implements the same mechanism by splitting the compiler metadata and object file into separate on-disk files, which seems like good compromise.

The problem with such approach is that the compiler frontends have to know about each other and produce/consume compatible metadata. For C (and Fortran and...) you an get away without sharing any metadata between separate compilation units and when you cannot (constants, structure layouts) CPP is workable and simple solution, this ignores the problem of inline functions defined in headers, but there is not that much C code that does that.

rwmj · on Jan 23, 2017

OCaml does much the same, but it embeds the meta-information in a .cmx file which sits alongside the .o file. It allows cross-module inlining of single functions for example.

OTOH OCaml has another problem which is that the format of the .cmx file is just a dump of internal OCaml compiler structures so it changes on every compiler version. (You do get an error message, not a crash, but you still end up having to compile everything with the same compiler version)

valarauca1 · on Jan 23, 2017

     What strikes me as weird is that I know of exactly one 
     compiler system that got it right (...) Turbo Pascal.

Long Live Turbo Pascal.

     when you cannot (...) CPP

I really see no way forward to be able to modernize CPP build systems. CPP has invented it's own paradigm for how a build system should work. Unless square 1 is "support CPP" making something support cpp is almost impossible.

dfox · on Jan 23, 2017

cpp(1) is not where the paradigm stems from, the same paradigm was there before and was there even before C itself. The whole idea of preprocessor stems from assembly programming and the idea that having one file with lines like "PORTB .equ 0x17" and reusing it is useful, in the same spirit the way how ld(1) works was/is useful for assembly programming (you can jump arbitrarily between notionally different functions, and in fact GCC is perfectly capable of generating object files that do that from C/C++ code). On the other hand, cpp(1) is useful for more things than just C/C++, there are even programs which read their configuration by means of popen("cpp /etc/foo").

And as for supporting ld(1) and unix toolchain in general, see my remark above about how FPC works, IIRC MSVC compiller does something similar for C++, albeit in more limited manner (stdaux.h and such stuff).

rwmj · on Jan 23, 2017

> Not really that just puts _text_ (raw executable machine code, and binary data) at certain locations in memory and updates some other _text_ to ensure branches/functions point to the right location. This isn't rocket science it works well. All in all loaders are very fast.

Really, they are not. In fact resolving symbols in ELF shared libraries is inherently complex and slow. Pick any program which is linked to lots of shared libraries and try running 'time program --version'. 'qemu-system-x86_64 -version' takes 60ms on my laptop.

valarauca1 · on Jan 23, 2017

I have in fact written my own ELF parser. I was speaking more to static libraries/binaries which once parsed are only a handful of pointer jumps/memset calls.

Dynamic segments are very complex :|

The load time for Dynamic libraries isn't really on the _loader_. You need to do disk searches/file IO to load the libraries parsing MANY additional files.

Starting a program that has 20 dynamic links isn't starting a program. It is starting 20. The bottle neck is not parsing, it is File IO.

vetinari · on Jan 23, 2017

There is another thing that slows down the symbol resolution: the symbols are not scoped per library or soname, they are global and so every symbol is searched in every linked library. Link few C++ libraries, the symbol count explodes and the runtime linking slows downs. Large C++ apps, like OpenOffice.org, used to have a problem with this.

The I/O may or may not be a bottleneck. The libraries are being mmaped and if some other process is already using them, chances are, that the interesting parts are already paged in.

brucedawson · on Jan 23, 2017

Whenever I hear about how loaders resolve symbols on Linux I am amazed that it works at all, and that it runs reasonably fast.

The loader on Windows has a much simpler job - it imports specific symbols from specific DLLs. In many cases the importing is done by ordinal instead of by name, to make it even faster.

I'm used to the Windows system so I don't miss the extra functionality that the Linux loader provides - in fact I find it scary.

geofft · on Jan 23, 2017

You don't need to. That's what tools like prelink are for: you calculate it once, when the program is installed, and you don't need to scan 20 files at program start every time, since the previous results of dynamic linking are stuffed inside the main program's binary.

Rethink what the role of a linker and a loader is, and you can avoid a lot of these problems. These systems were designed in a very different era of computing.

valarauca1 · on Jan 23, 2017

Good point!

barrkel · on Jan 23, 2017

There's not much difference between a linker and a loader; they do almost the same thing, one persisting the result to disk and the other to memory. I do notice that your definition of a loader does not support shared libraries, which is most of what a loader does outside of a toy OS.

Linkers are also similar to garbage collectors; smart linkers effectively eliminate bloat by garbage collecting the code.

oso2k · on Jan 23, 2017

Google's made their attempt a few years ago. It's called GoLD, or, gold [0]. However, its key benefit is in lower link time processing of object files and libraries. Important if you're building apps that take minutes or hours to build every day in a CI/CD system. Another example is browser source branches (10s of 1000s object files, libraries, and objects). Key innovation was to not use GNU's BFD [1]. BFD is a little like llvm's IR but for linkers that support multiple architectures, multiple executable formats (raw like DOS COM binaries images, ELF, COFF, DWARF, etc), and multiple executable scenarios (static vs. shared libs). Instead, Google focused exclusively on ELF (which is almost 100%, universally the standard on the major *nixes like Linux & BSD), and the few most popular architectures (i386, amd64, arm, ppc).

gold was later contributed to and can now be found as a part of GNU binutils.

[0] https://en.wikipedia.org/wiki/Gold_(linker)

[1] https://en.wikipedia.org/wiki/Binary_File_Descriptor_library

geofft · on Jan 23, 2017

gold is a good linker for the existing formats, but there are a number of systematic improvements that can be made, ranging from -Bdirect / two-level namespace support (see my other comment) to just rethinking how programs get compiled and linked in the first place.

Rust, for instance, bypasses a lot of this complexity by not supporting things like .o files: an entire set of Rust source files is compiled at once, so optimizations can be applied at source level across everything. The downside is higher compile times and memory usage, especially when making one change to one file. There are almost certainly better solutions here.

oso2k · on Jan 26, 2017

I really wish we had TP-style modules [0] information in our object file formats. What seems so simple conceptually is nearly impossible to replicate?

Can't figure out why C, C++, Java or C# never got there and why JS is puking all over the concept with half-baked, crappy syntactic look-alike (but not work work-alike) modules. Ughh...

[0] http://prog21.dadgum.com/47.html

JoeAltmaier · on Jan 23, 2017

tl;dr: modules using inline fns (even library fns like log2f) generate a static version too; such modules can be linked in if those fns are used anywhere else even if those modules are not needed themselves.

brucedawson · on Jan 23, 2017

Prior to VC++ 2013 the VC++ compiler would generate non-inline versions of all inline functions it saw, even if they weren't used. Even now the option to strip out these unreferenced inline functions (/Zc:inline) is off by default, because it breaks a few non-conforming programs.

So, at least VC++ isn't generating these non-inline versions as aggressively as it used to.

sklegg · on Jan 23, 2017

> 2017

> still using compiled languages

/s