Why would pipelining, caching, branch prediction make increasing the frequency difficult? Why would heat be less of a problem for a pcie controller than for a cpu?
I see you don't have any responses, so I'll give it a shot. I don't remember the actual math from my coursework, and have no experience with real-world custom ASIC design (let alone anything at all in the "three-digit GHz" range), but the major variables look something like this (in no particular order):
1) transistor size (smaller = faster, but process is more expensive)
2) gate dielectric thickness (smaller = faster, but easier to damage the gate; there's a corollary of this with electric fields across various PCB layer counts, microstrips, etc. that I'm nowhere near prepared to articulate)
3) logic voltage swing (smaller = faster, but less noise immunity)
4) number of logic/buffer gates traversed on the critical path (fewer = faster, but requires wider internal buses or simpler logic)
5) output drive strength (higher = faster, but usually larger, less efficient, and more prone to generating EMI)
6) fanout (how many inputs must be driven by a single output) on the critical path (lower = faster)
Most of these have significant ties to the manufacturing process and gate library, and so aren't necessarily different between a link like PCIe and a CPU or GPU. Some things can be tweaked for I/O pads, but broadly speaking a lot of these vary together across a chip. The biggest exceptions are points #4 and #6. Doing general-purpose math, having flexible program control flow constructs, and being able to optimize legacy instruction sequences on-the-fly (but I repeat myself) unavoidably requires somewhat complex logic with state that needs to be observed in multiple places. Modern processors mitigate this with pipelining, which splits the processing into smaller stages separated by registers that hold on to the intermediate state (pipeline registers). This increases the maximum frequency of the circuit at the cost of requiring multiple clock cycles for an operation to proceed from initiation to completion (but allowing multiple operations to be "in flight" at once).
That being said, what's the simplest possible example of a pipeline stage? Conceptually, it's just a pair of 1-bit registers with no logic between the input of one register and the input of the next one. When the clock ticks, a bit is moved from one register to the next. Chain a bunch of these stages together, and you have something called a shift register. Add some extra wires to read or write the stages in parallel, and a shift register lets you convert between serial and parallel connections.
The big advantage that PCIe (and SATA/SAS, HDMI, DisplayPort, etc.) has over CPUs is that the actual part hooked up to the pairs that needs to run at the link rate is "just" a pair of big shift registers and some analog voodoo to get synchronization between two ends that are probably not running from the same physical oscillator (aka a SERDES block). In some sense it's the absolute optimal case of the "make my CPU go faster" design strategy. Actually designing one of these to reliably run at multi-gigabit rates is a considerable task, but any given foundry will generally have a validated block that can be licensed and pasted into your chip for a given process node.
It makes sense why links can reach higher speeds than CPU cores, but it's not enough explanation for how symbol frequencies got 25x faster while CPU frequencies got 2x faster.
The simple answer to this is that signaling rates were far behind the Pareto frontier of where they could be, and CPU clock rates are pretty much there. CPU clocks are also heat-limited far more than I/O data rates. CPUs are big and burn a lot of power when you clock them faster, while I/O circuits are comparatively small.
Transmitting and receiving high-speed data is actually mostly an analog circuits problem, and the circuits involved are very different than those in doubling CPU speed.