Hacker News new | past | comments | ask | show | jobs | submit login
The System Bottleneck Shifts to PCI-Express (nextplatform.com)
131 points by rbanffy on July 18, 2017 | hide | past | favorite | 62 comments



The bottleneck is not PCI-Express, not even the 3.0 version. The bottleneck is shipping processors with just 16 PCI-E 3.0 lanes (!). Things are changing with the new high end desktop processors, with 64 PCI-E 3.0 lanes (e.g. AMD Threadripper [1]), which is massive: 128 GB/s, and hardly a bottleneck. The 4.0/5.0 versions will allow reducing costs by requiring less lanes, thus using less pins, being cheaper to produce.

[1] http://www.tomshardware.com/news/amd-threadripper-vega-pcie-...


The article is directed at the hpc/server market. Xeons have had 40 pcie lanes each for long time. Pcie bandwidth on desktops is only a problem for a very small number of people.


They could make 32-lane PCI-E slots, then.


Taking a more skeptical look at the issue, we might have Intel to blame for the stagnant system interconnect.

As it gets harder each year to improve CPU performance, the HPC community has shifted to more workload-specific accelerators, most notably GPUs for high-throughput parallel data processing. These accelerators still rely on the host CPU to dispatch commands, but in recent years we've seen workloads really focus on and target the accelerator device (eg neural networks on the GPU).

If one wants to build a multi-GPU cluster, then the CPU quite plainly "gets in the way" - performance can very quickly be bottlenecked by weak inter-device bandwidth (PCIe 3.0 x16 = 16GB/s, vs. the GTX 1080 Ti onboard DRAM's ~500 GB/s). Not to mention the fact that the PCIe controller is on the CPU die, meaning inter-node bandwidth and latency also strictly favor the CPU.

For large-scale systems, the value proposition of multi-GPU is greatly neutered by reliance on the PCIe bus, and so Intel stays relevant for many applications. And for the last 5 years, Intel's utter dominance of the HPC/server market meant that they could limit PCIe lanes without much pressure from their customers. With Ryzen/EPYC (128 PCIe 3.0 lanes!!) that old order looks set to change.


A friend who works on NVLink would have a very similar response if prompted. And they were (probably still are) trying some more interesting ways of working around PCIe that were hamstrung by Intel et al, given the control they have over the controller market as well. It's a shame but we'll see if they pay for it after all!

NVIDIA is also in a tough spot in that they can't cooperate too closely with either Intel or AMD now that the race for accelerators is heating up. Understandably they're pretty psyched about Tegra but I'm not sure I'd want to rely on that and I bet they don't either.


I don't think that's the whole story, though. With some advances in storage, PCI-e is not far from being a bandwidth bottleneck in even consumer PCs (thinking about NVMe as an example.)


> NVMe

Even there the CPU can be the bottleneck. I have a Samsung 960 Pro that's theoretically capable of 3GB/s reads but when you use disk encryption even with AES-NI the processor can only do ~2GB/s.


What about when SIMD?


That would probably help. Not sure what Linux LUKS supports specifically, but it's testable via "cryptsetup benchmark".


What are the current maximum bandwidth speeds used by PCIe solid state memory? I'm not aware of any that go over 4GB/s, let alone the 16GB/s of PCIe 3.0.


NVIDIA has their own bus to interconnect cards call NVLink: https://en.wikipedia.org/wiki/NVLink Seems like an attempt to get around this.

Memory is also really slow too, but AMD's 6+ memory controllers will help.


NVLink is called out specifically in the article as a workaround for the delayed creation of PCIe 4.0.


Memory is too slow? Recent processors have around 75GB/s or more bandwidth. That is rarely the bottleneck and only shows up in well written programs.


I certainly don't have much to add on any of the technical discussion here about the host side of things, but speaking as someone who in a previous life had to run miles and miles of IB cable under the floors, rip out mysteriously broken cables and run their replacements back in, these high-capacity ports are a godsend for switch interconnects. I can use more ports on my edge switches and run way fewer shitty $1500 Mellanox ISL cables if I just want my hosts running FDR IB.


Specifically, a PCIe 3.0 x16 slot can't even support a 200 Gbps NIC. Apparently the Mellanox ConnectX-6 NIC takes up two slots.


Network adapters have outpaced the various system buses off and on since the start of computing.

What's interesting in this article is that the bus is becoming a local bottleneck between various system resources. (cpu/gpu/accelerators)


PCI-E 3.0 x16 has 16GB/s bandwidth, being 200 Gbps 25 GB/s, so yes, it is not enough.

Even having enough bandwidth, handling 25 GB/s data, except for hardware-accelerated network processing, is very hard when current processors have 60 GB/s of RAM bandwidth (DDR4, dual channel), and even with zero-copy modern OS is going to be hard for the CPU to process the packets on continuous full load. That's mind blowing, and amazing :-)


DDR has a much higher bandwidth than this. The old V4 Xeons could get close to 100 GBps, and the new skylakes out now have a 50% increase by going to six channels. On top of that, Intel has a feature where the packets coming off the wire are stored in cache first, so you could see much higher packet processing performance, depending on the type of application.


Have you tested the numbers?



The way that networking at high rates is done in in the HPC/Supercomputing world is with user-level networking and OS-bypass, not "zero-copy modern OS".


Sure, using hardware-accelerated packet routing for accelerating network protocols (e.g. Cavium SoCs).


He's referring to dpdk and the likes. Not Hardware off loads such as cavium or other special chips.


Not really dpdk either, though I suppose there are similarities (Disclaimer: I know very little about dpdk). Think Infiniband (IB), not ethernet. Demanding applications such as MPI libraries or Lustre are written directly against the IB verbs interface, not TCP/IP with the sockets API.

And yes, IB is designed such that the NIC HW can offload quite a lot, and the rest is indeed done in userspace without kernel involvement in the hot paths.

That being said, it's possible to run more or less the IB protocol stack on ethernet hardware, it's called ROCE (RDMA over converged ethernet). Somewhat amazingly, latency is actually quite competitive with IB.


He mentioned hardware-accelerated packet routing. RoCE is not a routing protocol -- it's a UDP packet using IB verbs to DMA directly to hardware on the PCIe bus and bypass the processor. But yes, RoCE is a really interesting protocol and seems to have won the RoCE versus iWarp war.


Ok, thank you.


200gbps is still quite a bit relative to everything else right now. Display port 1.4 is still only 32.4 gbps


No mention of power requirements/allowances? I recall some gpus a while back being a big deal because they just squeezed into the power envelope allowed by pcie. I don't know if this governing body also controls the power specs or if they're even related.


High-end GPUs have a second or even third power connector, direct to the PSU. The power supplies to support them have dedicated 12V rails for each. The case you're remembering was due to trying to build cards down to a lower price by not including this connector and running very near the max allowed power that can be taken from the PCI slot connector. IIRC When combined with cheap motherboards and power supplies it resulted in failures.

The 6-pin PCIE power connectors can handle 75W, the 8-pins 150W, according to the specs. Some video cards have multiples of these connectors, to allow for very high power consumption. If more power is needed they'll just add more and require a bigger power supply.


The PCIE spec also says that the maximum power supplied via external connectors is 225W. This is so that system makers know how big to size their power supplies to so that they can claim to support a maximally spec'd PCIE board. Obviously a board can require more external power, but board makers do generally try to stay below 300W.


It seems like system integrators all use either 300-400W PSUs (for consumer systems that come with a single GPU that's seldom top of the line), or they use 1200W or dual-900W PSUs for the high-end workstations to ensure they can handle 4 top of the line GPUs. Meanwhile, the retail PSU market for enthusiasts building their own PCs is dominated by 500-850W models that are way oversized for almost all ordinary desktop usage, but guarantee safe headroom for the few users who do extreme overclocking or actually bother with multiple GPUs.


Something most people ignore is that many of these 500-850W enthusiast PSUs have higher efficiency and often near-silent or even no-fan performance at the usually used 50-200W power usage range.


GPU servers nowadays use proprietary form factors for GPUs anyway (like SXM2).


I'd be curious to hear who is using servers with the proprietary form factor GPUs. The prices for datacenter vs. gamer versions of the same GPUs have gotten really out of whack, where a 1080ti is $700, and a P100 is $7000, and at least on FP32, they have the same performance.

From what I hear, most all of the major DL research labs are using 4U servers with 8 of either the 1080ti or the Titan Xp rather than the SXM2 Teslas. On the other hand, all of the cloud providers seem to be sticking with the Tesla products, and pricing GPU instances commensurately. I really wish there were a cloud provider that offered instances with the gamer cards in them so that companies in the DL field weren't effectively forced to buy and manage their own hardware.


A P100 has a gp100 with nvlink, hbm2, and a 610mm die, while a 1080ti or titan Xp has a gp102 with gddr5x and 471mm die. It is not like quadro vs geforce where they both run on a gp102 with different drivers, the gp100 is considerably physically larger and has a much smaller userbase. I would be very surprised if they have the same fp32 performance in real world situations.


OK, so let's compare eight GP102s against one GP100. Is there any metric where GP100 wins?


Power consumption and space consumption.


Right, well, people who need FP64 aren't going to be satisfied with the 1080ti so that is the answer.

See this comment of mine: https://news.ycombinator.com/item?id=14597486


I actually think FP16 performance is more of a factor for those choosing the Tesla platform.


You mean for the Volta architecture, surely? I'm talking about currently available hardware.


The Pascal Teslas (P100), at least, support FP16.


You can use the AWS g2 instances.


That's not really true. The first card to introduce this form factor was the Pascal p100. It is still priced way out of most people's price range compared to the older cards. They also have pcie versions of that card, which perform close to the same.


Ah, thank you, that was it! I mistakenly believed there was a connection between the power input to the card and the amount it could send out over the lanes.


There have even been graphics cards that had their own power plug that you plugged directly into the wall!


Seems kinda of silly, especially since PCI-Express is not cache coherent.

Why not just connect high performance devices directly to hypertransport, Infinity fabric, QPI, or whatever the fast cache coherent serial interface of the day is?


The first generation of Infinipath did exactly that. We had to create a slot standard for Hypertransport (HTX), convince motherboard makers to build boards with that slot, and then at the end of the day we just ended up convincing everyone that they never wanted to go that route ever again.

These days Intel's putting Omnipath on-package for Xeon Phi and Skylake. Likely it's still a PCI-Express connection, but it doesn't count against the total available lanes for external cards.


> and then at the end of the day we just ended up convincing everyone that they never wanted to go that route ever again.

Hmm, why? Intel executed well with Nehalem while AMD stumbled, leaving HTX even more niche than it already was, or do you mean there was some fundamental (technical) problem with it?

> These days Intel's putting Omnipath on-package for Xeon Phi and Skylake. Likely it's still a PCI-Express connection, but it doesn't count against the total available lanes for external cards.

Assuming the parts with integrated omni-path use the same socket, some pins will be needed for the OPA connection, no? Presumably pins that were reserved for PCIe in the normal chips, I'd guess...


Look at photos -- Omnipath Skylakes have an extra connector.


Because those aren't standard and nobody can agree on what the standard should be. Do you want to be the GPU manufacturer that has to make a half-dozen different kinds of the same card for different interfaces? What about the performance over distance? Most of that kind of bus is designed to work well over short distances and much stricter tolerances.

These things would make cards cost many times more than they currently do which isn't feasible. Better a "slow" standard that works.


What exactly do you mean when you say PCI-Express isn't cache coherent?

I would think that's dependent on the specific platform more than anything. For example, on a cache coherent architecture like the Freescale T2080, if a PCI device writes to some RAM location that's currently cached, the line automatically gets invalidated. There’s this whole “coherency fabric” thing which handles all of the snooping and interaction between the peripherals.


My problem with PCIE has always been latency issues. While more lanes will help with throughput it still doesn't help me get data on and off my NIC any quicker.


Do busses matter in devices with a singular SoC, like modern phones and tablets? If not, then perhaps laptops/desktops could follow suit?


They still matter there. Some of the high end ARMv8 stuff has been shipping with PCI-express for talking to storage peripherals and such.

https://developer.arm.com/products/system-design/development...


Yes, storage for example isn't on the SoC, so bus speed will limit storage speed.


Is PCIe really the bottleneck for storage devices? I haven't measured experimentally to see, but just off the top of my head, it seems like an x4 PCIe 3.0 slot should be able to keep up with everything but the latest SSD's (and for those, I'm not really sure.... depends on the workload as well probably).


It's a bit of a self-fulfilling prophecy -- the more flash chips you've got the more bandwidth is possible, but there's not much incentive to go faster than x4 can if you know the device is going to plug into x4.


If you're building a storage server using PCIe SSDs, you really don't want to have to pay Avago/Broadcom's prices for PCIe switches. That's one of the things that makes AMD's new server platform with 128 PCIe lanes so appealing: you can put together a 2U 24x NVMe SSD server plus NICs without the expense of a PLX chip.

It seems like we'll be sticking with PCIe x4 interfaces for SSDs for quite a while; client/consumer use cases don't allow for higher pin counts, and at the moment there's not much reason to prefer using half as many SSDs with twice the lane count for enterprise storage, especially when it means none of the parts can be shared with the client/consumer ecosystem.


Topic is too overgeneralized. For general computing cases PCI Express is not the bottleneck, nor it will be.


This blog "offers in-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds."

It's not about general purpose computing.


Yeah they even link directly to PDF datasheets of PCB laminates and prepreg material (the core layers that the copper adheres to on a printed circuit board) that were used in an effort to show why they are having issues.

Pretty specific. Good article.


This article confronts PCIe with eth/IB multiple times.

I wonder, does the author know how network cards are attached to CPU? There is no magic, apparently. And on most desktops you can not even have a full size video card and a 10GE adapter working simultaneously at full speed. And this bottleneck is in CPU (its interfaces).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: