The obvious alternative is to not use all the layers, but that is anything but s...

The obvious alternative is to not use all the layers, but that is anything but straightforward.

One way the slowness of libraries can become obvious is during optimisation - as a simple example memory copying can be optimised if you are doing large copies - the C library memcpy has to work in the general case and afaik it typically just loops over bytes and copies them one by one, which is probably optimal if you are copying a small number of bytes like 2 or 3 - probably a common case, but on modern CPUs you can get substantial speed ups by writing your own partially unrolled loop to copy 4-bytes at a time, or even more if you are willing to write assembler code where you can copy 16-bytes at a time, and with non temporal cache hints. Think about how many routines copy memory about by using this library... and this is just one example. In an actual use case of software rendering I used this to copy a 320x240 framebuffer and my final, assembler optimised version was a good 15% faster than using memcpy.

The problem is that libraries are convenient and they have to work in a large number of cases which may prevent them from using the choice of algorithm that is optimal for your problem. Even the fact of being in a library requires some small slowdown from not being able to inline, e.g. the C standard library math functions can be optimised just by writing equivalents that can be inlined - the gain is small per call but it still exists.

I'm not 100% sure but the C library math functions may do things to undo the features of the FPU as well, e.g. the fsin instruction fails for values over 2^64 iirc, the library function might do expensive operations to get around this, in which case the gain of using a single fsin instruction will be significant, perhaps more than twice as fast as the equivalent C library function.

Some of this is the rational behind my FridgeScript language (which tries to be fast at floating point ops), which is measurably faster than the MS C++ compiler provided that the code is clean (FridgeScript does no optimisation to a very good approximation, so things like foo+1+bar+1 mostly end up as three additions instead of two)