> Static linking gives you better instruction cache utilization as you are execu...

arghwhat · on Jan 28, 2024

The compiler has built-ins for parts of libc exactly because dynamic linkage is ridiculous for performance, but they cannot statically link with a dynamic libc. It's a hack to make dynamic linked libc have at least somewhat acceptable performance.

If your libc was statically linked, you would not need the built-in - the strlen impl from your libc would get inlined by LTO.

The chances of a particular routine being in L1 is absolutely miniscule - it's hard enough to keep a single process and it's data in L1 and L2. What might happen is that you find it in L3, but: 1. The code you're loading is now much larger (fitting less well in L1 so you'll get more L1 misses) and slower (cache aside, it has redirection overhead and has not been LTO'd for this use), and 2. The inlined version would probably also be found in L3 - either resident or prefetched as that section if the process executable obviously had to be loaded to switch to it. 3. Unless the system is idle, the cache will be trashed in between process switches by the loads from other processes.

So while you could technically have a case where the shared lib is in cache, I do not think a realistic scenario exists where that setup wins out. There are more distinct pages, but the pages didn't fit in the first place: by having each process access fewer pages overall it can miss less while it is running.

inkyoto · on Jan 28, 2024

> The compiler has built-ins for parts of libc exactly because dynamic linkage is ridiculous for performance […]

The argument is entirely contrived and has no root in facts. Compiler built-ins appeared in GNU C/C++ compilers as an attempt to replace the non-portable inline assembly with portable primitives – across compilers and across different architectures as well. The rationale is well documented in the GNU C/C++ compiler v2.3 circa documentation, and it has nothing to do with the dynamic linking.

The use of the compiler built-ins increased once the C/C++ compilers gained the interprocedural, in-file, holistic optimisation capabilities – to improve the quality of the generated code. Moreover, compiler built-ins had been present in some form even in the 32-bit Watcom C compiler for MS-DOS and MS-DOS had no shared libraries or whatsoever.

> The chances of a particular routine being in L1 is absolutely miniscule - it's hard enough to keep a single process and it's data in L1 and L2 […]

CPU caches work at addresses being accessed level, not at the process level. The CPU knows nothing about processes – the CPU is a code interpreter.

One copy of «strlen» in a single memory page at a single physical memory address shared across all processes has a much better chance of staying in the cache for a longer time as opposed to 10k copies of the same «strlen» implementation in 10k memory pages strewn across 10k distinct physical addresses. A single page that is accessed frequently has a higher hit rate and, thus, fewer chance of getting evicted from the cache – these are the basics one can't go against. CPU's other than Intel CPU's have larger or large I-caches, too, therefore very frequently used code has higher chances of survival in the CPU cache. Most importantly, however, the CPU cache (L1/L2) size is not the bottleneck, the TLB size is – a frequently accessed address is better from the TLB perspective than 10k distinct addresses as it will result in a fewer number of the TLB flushes.

Lastly, the shared library cache I was referring to has nothing to do with the CPU execution time. It is the cache where the shared libraries are «pre-linked» to reduce the startup, the GOT fixup and the dynamic library initialisation times – to improve the user experience, not performance.

arghwhat · on Jan 31, 2024

> The argument is entirely contrived and has no root in facts. Compiler built-ins appeared in GNU C/C++ compilers as an attempt to replace the non-portable inline assembly with portable primitives

This is missing the point entirely.

GCC needs to emit e.g. memory copies. Before, this was inline assembly replicated over and over. Now, it's a call to __builtin_memcpy.

The point missed is that GCC always considered the idea of calling memcpy entirely unacceptable as the performance would be horrible over an inline implementation.

The proof of this intent lies in later optimizations: Not only would GCC never want to emit such slow calls, it replaces your explicit libc calls with builtins because obviously you wouldn't want to do something as slow as a dynamic linkeage call.

With static linking and LTO, the libc implementation becomes as good as the builtin, rendering the latter pointless. GCC just cannot assume this to be the case.

> CPU caches work at addresses being accessed level, not at the process level.

No, CPU caches do not work on addresses, they work on tags to be pedantic. Either way, I never said that caches are process level. I said that they do not survive across multiple processes - not because of flushing, but because of trashing. I.e., if you have three processes, A, B and C, where A and C run shared code while B something else, and you switch A -> kernel -> B -> kernel -> C, then by the time you made it form A to C your cache is has been trashed by both B and the kernel.

Now, instead of 3 processes and one routine, make it thousands of threads and gigabytes of shared libraries.

> One copy of «strlen» in a single memory page at a single physical memory address shared across all processes

Again, strlen is a terrible example: 10k copies of strlen being a handful of bytes in the current instruction stream, prefetched and branch predicted will outperform that shared page to an outright ridiculous extent and might even be smaller in total: a 10k copies of a handful of bytes vs. 10k calls and PLT indirections + the un-inlined function. Because it is literally less memory, it also trashes the TLB less.

Even in more realistic cases, remember the TLB hit of the PLT table in each application, not to mention the many more pages consumed by the bulkier implementation. In fact, let's focus a bit on the TLB. The most basic Gtk app links at least 80 libraries worth over 90 megabytes on my system. An L1 TLB has about 64 entries, the L2 around a thousand or so - so it can reference ~16MB worth of memory or thereabout. In other words, even the L2 TLb is about 6 times too small to keep the libraries of the simplest possible gtk app cached.

Heck, take just libicudata at 30MB. Of course, I wouldn't suggest statically linking that, but just pointing out that a single dependency of a Gtk app is enough to fill up the TLB twice, nullifying the idea of any cache benefit to using these libraries.

"Yes but at least they can have libicudata in L3!" - yeah, no - not only would it compete with other dynamic dependencies (for this and other processes), but more importantly the applications also need to process data. A single Gtk app on a 4k monitor will, for example, be managing at least two 32MB framebuffers (3840x2160x4, x2 for double buffering), so that's most of your cache gone during draw before you even consider the input to the draw or any actual functionality of the app!

The best-case for dynamic linkage performance is cases where call cost is irrelevant, e.g. when calling compute routines. There is no point whatsoever in considering CPU caches outside the scope of the currently running process.