The System Bottleneck Shifts to PCI-Express

faragon · on July 18, 2017

The bottleneck is not PCI-Express, not even the 3.0 version. The bottleneck is shipping processors with just 16 PCI-E 3.0 lanes (!). Things are changing with the new high end desktop processors, with 64 PCI-E 3.0 lanes (e.g. AMD Threadripper [1]), which is massive: 128 GB/s, and hardly a bottleneck. The 4.0/5.0 versions will allow reducing costs by requiring less lanes, thus using less pins, being cheaper to produce.

[1] http://www.tomshardware.com/news/amd-threadripper-vega-pcie-...

shaklee3 · on July 19, 2017

The article is directed at the hpc/server market. Xeons have had 40 pcie lanes each for long time. Pcie bandwidth on desktops is only a problem for a very small number of people.

faragon · on July 19, 2017

They could make 32-lane PCI-E slots, then.

neilmovva · on July 19, 2017

Taking a more skeptical look at the issue, we might have Intel to blame for the stagnant system interconnect.

As it gets harder each year to improve CPU performance, the HPC community has shifted to more workload-specific accelerators, most notably GPUs for high-throughput parallel data processing. These accelerators still rely on the host CPU to dispatch commands, but in recent years we've seen workloads really focus on and target the accelerator device (eg neural networks on the GPU).

If one wants to build a multi-GPU cluster, then the CPU quite plainly "gets in the way" - performance can very quickly be bottlenecked by weak inter-device bandwidth (PCIe 3.0 x16 = 16GB/s, vs. the GTX 1080 Ti onboard DRAM's ~500 GB/s). Not to mention the fact that the PCIe controller is on the CPU die, meaning inter-node bandwidth and latency also strictly favor the CPU.

For large-scale systems, the value proposition of multi-GPU is greatly neutered by reliance on the PCIe bus, and so Intel stays relevant for many applications. And for the last 5 years, Intel's utter dominance of the HPC/server market meant that they could limit PCIe lanes without much pressure from their customers. With Ryzen/EPYC (128 PCIe 3.0 lanes!!) that old order looks set to change.

ccmonnett · on July 19, 2017

A friend who works on NVLink would have a very similar response if prompted. And they were (probably still are) trying some more interesting ways of working around PCIe that were hamstrung by Intel et al, given the control they have over the controller market as well. It's a shame but we'll see if they pay for it after all!

NVIDIA is also in a tough spot in that they can't cooperate too closely with either Intel or AMD now that the race for accelerators is heating up. Understandably they're pretty psyched about Tegra but I'm not sure I'd want to rely on that and I bet they don't either.

jchw · on July 19, 2017

I don't think that's the whole story, though. With some advances in storage, PCI-e is not far from being a bandwidth bottleneck in even consumer PCs (thinking about NVMe as an example.)

noinsight · on July 19, 2017

> NVMe

Even there the CPU can be the bottleneck. I have a Samsung 960 Pro that's theoretically capable of 3GB/s reads but when you use disk encryption even with AES-NI the processor can only do ~2GB/s.

zacmps · on July 19, 2017

What about when SIMD?

noinsight · on July 19, 2017

That would probably help. Not sure what Linux LUKS supports specifically, but it's testable via "cryptsetup benchmark".

CyberDildonics · on July 19, 2017

What are the current maximum bandwidth speeds used by PCIe solid state memory? I'm not aware of any that go over 4GB/s, let alone the 16GB/s of PCIe 3.0.

bhouston · on July 18, 2017

NVIDIA has their own bus to interconnect cards call NVLink: https://en.wikipedia.org/wiki/NVLink Seems like an attempt to get around this.

Memory is also really slow too, but AMD's 6+ memory controllers will help.

loeg · on July 19, 2017

NVLink is called out specifically in the article as a workaround for the delayed creation of PCIe 4.0.

CyberDildonics · on July 19, 2017

Memory is too slow? Recent processors have around 75GB/s or more bandwidth. That is rarely the bottleneck and only shows up in well written programs.

davidmr · on July 19, 2017

I certainly don't have much to add on any of the technical discussion here about the host side of things, but speaking as someone who in a previous life had to run miles and miles of IB cable under the floors, rip out mysteriously broken cables and run their replacements back in, these high-capacity ports are a godsend for switch interconnects. I can use more ports on my edge switches and run way fewer shitty $1500 Mellanox ISL cables if I just want my hosts running FDR IB.

wmf · on July 18, 2017

Specifically, a PCIe 3.0 x16 slot can't even support a 200 Gbps NIC. Apparently the Mellanox ConnectX-6 NIC takes up two slots.

subway · on July 18, 2017

Network adapters have outpaced the various system buses off and on since the start of computing.

What's interesting in this article is that the bus is becoming a local bottleneck between various system resources. (cpu/gpu/accelerators)

faragon · on July 18, 2017

PCI-E 3.0 x16 has 16GB/s bandwidth, being 200 Gbps 25 GB/s, so yes, it is not enough.

Even having enough bandwidth, handling 25 GB/s data, except for hardware-accelerated network processing, is very hard when current processors have 60 GB/s of RAM bandwidth (DDR4, dual channel), and even with zero-copy modern OS is going to be hard for the CPU to process the packets on continuous full load. That's mind blowing, and amazing :-)

shaklee3 · on July 19, 2017

DDR has a much higher bandwidth than this. The old V4 Xeons could get close to 100 GBps, and the new skylakes out now have a 50% increase by going to six channels. On top of that, Intel has a feature where the packets coming off the wire are stored in cache first, so you could see much higher packet processing performance, depending on the type of application.

snnn · on July 19, 2017

Have you tested the numbers?

shaklee3 · on July 19, 2017

http://www.anandtech.com/show/8423/intel-xeon-e5-version-3-u...

https://www.pugetsystems.com/labs/hpc/Memory-Performance-for...

wumpus · on July 18, 2017

The way that networking at high rates is done in in the HPC/Supercomputing world is with user-level networking and OS-bypass, not "zero-copy modern OS".

faragon · on July 18, 2017

Sure, using hardware-accelerated packet routing for accelerating network protocols (e.g. Cavium SoCs).

shaklee3 · on July 19, 2017

He's referring to dpdk and the likes. Not Hardware off loads such as cavium or other special chips.

jabl · on July 19, 2017

Not really dpdk either, though I suppose there are similarities (Disclaimer: I know very little about dpdk). Think Infiniband (IB), not ethernet. Demanding applications such as MPI libraries or Lustre are written directly against the IB verbs interface, not TCP/IP with the sockets API.

And yes, IB is designed such that the NIC HW can offload quite a lot, and the rest is indeed done in userspace without kernel involvement in the hot paths.

That being said, it's possible to run more or less the IB protocol stack on ethernet hardware, it's called ROCE (RDMA over converged ethernet). Somewhat amazingly, latency is actually quite competitive with IB.

shaklee3 · on July 20, 2017

He mentioned hardware-accelerated packet routing. RoCE is not a routing protocol -- it's a UDP packet using IB verbs to DMA directly to hardware on the PCIe bus and bypass the processor. But yes, RoCE is a really interesting protocol and seems to have won the RoCE versus iWarp war.

faragon · on July 19, 2017

Ok, thank you.

CyberDildonics · on July 19, 2017

200gbps is still quite a bit relative to everything else right now. Display port 1.4 is still only 32.4 gbps

hirsin · on July 18, 2017

No mention of power requirements/allowances? I recall some gpus a while back being a big deal because they just squeezed into the power envelope allowed by pcie. I don't know if this governing body also controls the power specs or if they're even related.

SAI_Peregrinus · on July 18, 2017

High-end GPUs have a second or even third power connector, direct to the PSU. The power supplies to support them have dedicated 12V rails for each. The case you're remembering was due to trying to build cards down to a lower price by not including this connector and running very near the max allowed power that can be taken from the PCI slot connector. IIRC When combined with cheap motherboards and power supplies it resulted in failures.

The 6-pin PCIE power connectors can handle 75W, the 8-pins 150W, according to the specs. Some video cards have multiples of these connectors, to allow for very high power consumption. If more power is needed they'll just add more and require a bigger power supply.

bryanlarsen · on July 19, 2017

The PCIE spec also says that the maximum power supplied via external connectors is 225W. This is so that system makers know how big to size their power supplies to so that they can claim to support a maximally spec'd PCIE board. Obviously a board can require more external power, but board makers do generally try to stay below 300W.

wtallis · on July 19, 2017

It seems like system integrators all use either 300-400W PSUs (for consumer systems that come with a single GPU that's seldom top of the line), or they use 1200W or dual-900W PSUs for the high-end workstations to ensure they can handle 4 top of the line GPUs. Meanwhile, the retail PSU market for enthusiasts building their own PCs is dominated by 500-850W models that are way oversized for almost all ordinary desktop usage, but guarantee safe headroom for the few users who do extreme overclocking or actually bother with multiple GPUs.

lathiat · on July 19, 2017

Something most people ignore is that many of these 500-850W enthusiast PSUs have higher efficiency and often near-silent or even no-fan performance at the usually used 50-200W power usage range.

blattimwind · on July 18, 2017

GPU servers nowadays use proprietary form factors for GPUs anyway (like SXM2).

cameldrv · on July 18, 2017

I'd be curious to hear who is using servers with the proprietary form factor GPUs. The prices for datacenter vs. gamer versions of the same GPUs have gotten really out of whack, where a 1080ti is $700, and a P100 is $7000, and at least on FP32, they have the same performance.

From what I hear, most all of the major DL research labs are using 4U servers with 8 of either the 1080ti or the Titan Xp rather than the SXM2 Teslas. On the other hand, all of the cloud providers seem to be sticking with the Tesla products, and pricing GPU instances commensurately. I really wish there were a cloud provider that offered instances with the gamer cards in them so that companies in the DL field weren't effectively forced to buy and manage their own hardware.

verall · on July 18, 2017

A P100 has a gp100 with nvlink, hbm2, and a 610mm die, while a 1080ti or titan Xp has a gp102 with gddr5x and 471mm die. It is not like quadro vs geforce where they both run on a gp102 with different drivers, the gp100 is considerably physically larger and has a much smaller userbase. I would be very surprised if they have the same fp32 performance in real world situations.

wmf · on July 18, 2017

OK, so let's compare eight GP102s against one GP100. Is there any metric where GP100 wins?

ori_b · on July 19, 2017

Power consumption and space consumption.

tanderson92 · on July 18, 2017

Right, well, people who need FP64 aren't going to be satisfied with the 1080ti so that is the answer.

See this comment of mine: https://news.ycombinator.com/item?id=14597486

matt4077 · on July 19, 2017

I actually think FP16 performance is more of a factor for those choosing the Tesla platform.

tanderson92 · on July 19, 2017

You mean for the Volta architecture, surely? I'm talking about currently available hardware.

jabl · on July 19, 2017

The Pascal Teslas (P100), at least, support FP16.

sacheendra · on July 18, 2017

You can use the AWS g2 instances.

shaklee3 · on July 19, 2017

That's not really true. The first card to introduce this form factor was the Pascal p100. It is still priced way out of most people's price range compared to the older cards. They also have pcie versions of that card, which perform close to the same.

hirsin · on July 18, 2017

Ah, thank you, that was it! I mistakenly believed there was a connection between the power input to the card and the amount it could send out over the lanes.

ClassyJacket · on July 19, 2017

There have even been graphics cards that had their own power plug that you plugged directly into the wall!

sliken · on July 18, 2017

Seems kinda of silly, especially since PCI-Express is not cache coherent.

Why not just connect high performance devices directly to hypertransport, Infinity fabric, QPI, or whatever the fast cache coherent serial interface of the day is?

wumpus · on July 18, 2017

The first generation of Infinipath did exactly that. We had to create a slot standard for Hypertransport (HTX), convince motherboard makers to build boards with that slot, and then at the end of the day we just ended up convincing everyone that they never wanted to go that route ever again.

These days Intel's putting Omnipath on-package for Xeon Phi and Skylake. Likely it's still a PCI-Express connection, but it doesn't count against the total available lanes for external cards.

jabl · on July 19, 2017

> and then at the end of the day we just ended up convincing everyone that they never wanted to go that route ever again.

Hmm, why? Intel executed well with Nehalem while AMD stumbled, leaving HTX even more niche than it already was, or do you mean there was some fundamental (technical) problem with it?

> These days Intel's putting Omnipath on-package for Xeon Phi and Skylake. Likely it's still a PCI-Express connection, but it doesn't count against the total available lanes for external cards.

Assuming the parts with integrated omni-path use the same socket, some pins will be needed for the OPA connection, no? Presumably pins that were reserved for PCIe in the normal chips, I'd guess...

wumpus · on July 19, 2017

Look at photos -- Omnipath Skylakes have an extra connector.

hajile · on July 18, 2017

Because those aren't standard and nobody can agree on what the standard should be. Do you want to be the GPU manufacturer that has to make a half-dozen different kinds of the same card for different interfaces? What about the performance over distance? Most of that kind of bus is designed to work well over short distances and much stricter tolerances.

These things would make cards cost many times more than they currently do which isn't feasible. Better a "slow" standard that works.

Unklejoe · on July 19, 2017

What exactly do you mean when you say PCI-Express isn't cache coherent?

I would think that's dependent on the specific platform more than anything. For example, on a cache coherent architecture like the Freescale T2080, if a PCI device writes to some RAM location that's currently cached, the line automatically gets invalidated. There’s this whole “coherency fabric” thing which handles all of the snooping and interaction between the peripherals.

jnordwick · on July 18, 2017

My problem with PCIE has always been latency issues. While more lanes will help with throughput it still doesn't help me get data on and off my NIC any quicker.

Razengan · on July 18, 2017

Do busses matter in devices with a singular SoC, like modern phones and tablets? If not, then perhaps laptops/desktops could follow suit?

simcop2387 · on July 18, 2017

They still matter there. Some of the high end ARMv8 stuff has been shipping with PCI-express for talking to storage peripherals and such.

https://developer.arm.com/products/system-design/development...

joenathanone · on July 18, 2017

Yes, storage for example isn't on the SoC, so bus speed will limit storage speed.

frahs · on July 18, 2017

Is PCIe really the bottleneck for storage devices? I haven't measured experimentally to see, but just off the top of my head, it seems like an x4 PCIe 3.0 slot should be able to keep up with everything but the latest SSD's (and for those, I'm not really sure.... depends on the workload as well probably).

wumpus · on July 18, 2017

It's a bit of a self-fulfilling prophecy -- the more flash chips you've got the more bandwidth is possible, but there's not much incentive to go faster than x4 can if you know the device is going to plug into x4.

wtallis · on July 19, 2017

If you're building a storage server using PCIe SSDs, you really don't want to have to pay Avago/Broadcom's prices for PCIe switches. That's one of the things that makes AMD's new server platform with 128 PCIe lanes so appealing: you can put together a 2U 24x NVMe SSD server plus NICs without the expense of a PLX chip.

It seems like we'll be sticking with PCIe x4 interfaces for SSDs for quite a while; client/consumer use cases don't allow for higher pin counts, and at the moment there's not much reason to prefer using half as many SSDs with twice the lane count for enterprise storage, especially when it means none of the parts can be shared with the client/consumer ecosystem.

mcraiha · on July 18, 2017

Topic is too overgeneralized. For general computing cases PCI Express is not the bottleneck, nor it will be.

wumpus · on July 18, 2017

This blog "offers in-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds."

It's not about general purpose computing.

kbaker · on July 19, 2017

Yeah they even link directly to PDF datasheets of PCB laminates and prepreg material (the core layers that the copper adheres to on a printed circuit board) that were used in an effort to show why they are having issues.

Pretty specific. Good article.

omgtehlion · on July 18, 2017

This article confronts PCIe with eth/IB multiple times.

I wonder, does the author know how network cards are attached to CPU? There is no magic, apparently. And on most desktops you can not even have a full size video card and a 10GE adapter working simultaneously at full speed. And this bottleneck is in CPU (its interfaces).