I was directly addressing the «saturate» part of the statement, not memory becoming the bottleneck. Since builds are inherently parallel nowadays, saturating the memory bandwidth is very easy since each CPU core runs a scheduled compiler process (in the 1:1 core to process mapping scenario), and all CPU cores suddenly start competing for memory access. This is true for all architectures and designs where memory is shared. The same reasoning does not apply to NUMA architectures but those are nearly entirely non-existant apart from certain fringe cases.
Linking, in fact, whilst benefitting from faster/wider memory is less likely to result in the saturation unless the linker is heavily and efficiently multithreaded. For insance, GNU ld is single threaded, gold is multi-threaded but Ian Taylor has reported very small performance gains from the use of multithreading in gold, and mold takes the full advantage of the concurrent processing. clang's lld is somewhere in between.
In M1/M2/M3 Max/Ultra, the math is a bit different. Each performance core is practically capped at the ~100Gb/sec memory transfer speed. Then the cores are organised into core clusters of n P and m E cores, and each core cluster is capped at the ~240Gb/sec speed. The accumulative memory transfer is ~400Gb/sec (800Gb/sec for the Ultra setup) for the entire SoC, but that is also shared with GPU, ANE and other compute acceleration cores/engines. Since each core cluster has multiple cores, a large parallel compilation process can saturate the memory bandwidth easily.
Code optimisation and type inference in strongly statically typed languages with polymorphic types (Haskell, Rust, ML and others) are very memory intensive, esp. at scale. There are multiple types of optimisation and most of them are of either the constraint solving or NP completeness type, but the code inlining coupled with the inter-procedural optimisations require very large amounts of memory on large codebases, and there are other memory bound optimisation techniques as well. Type inference for polymorphic types in the Hindley–Milner type system is also memory intensive due to having to maintain a large depth (== memory) in order to be able to successfully deduce the type. So it is not entirely unfathomable that «~8 bytes per cycle/core between L2 and the RAM on average» is rather modest for a highly optimising modern compiler.
In fact, I am of the opinion that the inadequate computing hardware coupled with the severe memory bandwidth and capacity constraints was a major technical contributing factor that led to the demise of the Itanium ISA (coupled with less advanced code optimisers of the day).
Linking, in fact, whilst benefitting from faster/wider memory is less likely to result in the saturation unless the linker is heavily and efficiently multithreaded. For insance, GNU ld is single threaded, gold is multi-threaded but Ian Taylor has reported very small performance gains from the use of multithreading in gold, and mold takes the full advantage of the concurrent processing. clang's lld is somewhere in between.
In M1/M2/M3 Max/Ultra, the math is a bit different. Each performance core is practically capped at the ~100Gb/sec memory transfer speed. Then the cores are organised into core clusters of n P and m E cores, and each core cluster is capped at the ~240Gb/sec speed. The accumulative memory transfer is ~400Gb/sec (800Gb/sec for the Ultra setup) for the entire SoC, but that is also shared with GPU, ANE and other compute acceleration cores/engines. Since each core cluster has multiple cores, a large parallel compilation process can saturate the memory bandwidth easily.
Code optimisation and type inference in strongly statically typed languages with polymorphic types (Haskell, Rust, ML and others) are very memory intensive, esp. at scale. There are multiple types of optimisation and most of them are of either the constraint solving or NP completeness type, but the code inlining coupled with the inter-procedural optimisations require very large amounts of memory on large codebases, and there are other memory bound optimisation techniques as well. Type inference for polymorphic types in the Hindley–Milner type system is also memory intensive due to having to maintain a large depth (== memory) in order to be able to successfully deduce the type. So it is not entirely unfathomable that «~8 bytes per cycle/core between L2 and the RAM on average» is rather modest for a highly optimising modern compiler.
In fact, I am of the opinion that the inadequate computing hardware coupled with the severe memory bandwidth and capacity constraints was a major technical contributing factor that led to the demise of the Itanium ISA (coupled with less advanced code optimisers of the day).