Hacker News new | past | comments | ask | show | jobs | submit login
How PCI-Express works (2020) (ovh.com)
202 points by superjan on Sept 11, 2021 | hide | past | favorite | 89 comments



One neat thing about PCIe is that every PCIe compliant device must support using less lanes than it can. So your GPU which would like to work with x16 must support x1 etc.

So in theory one could install a GPU or HBA in one of those x1 slots on the motherboard if one isn't terribly concerned about bandwidth. Except most of the motherboards use x1 connectors which doesn't have an "open back" for larger cards... so much for that[1].

Another thing I didn't see it mentioning was this. Say you have a x16 slot on your motherboard, but you really want to install 4 NVME m.2 drives. Well each of those need a x4, but you have an x16 so per above you should be fine right, just plop in a "dumb" four socket m.2 PCIe card?

Well, no, not unless the motherboard supports bifurcation[2]. If not, you'll need a card with a PCIe switch which can turn those four x4's into one x16.

[1]: yes I know about risers

[2]: https://www.10gtek.com/new-1414


The fail0verflow team that got Linux to run on the PS4 ran a single, very slow PCIe lane over RS232 as I recall so that they could intercept the traffic. It’s very tolerant.


https://fail0verflow.com/blog/2016/console-hacking-2016-post... I think this is the talk that goes through their PCIe intercept stuff.

Absolutely hilarious to me; I believe they were running one direction at 115.2kbps and everything was just a-ok about it.


Yesssss, one of my favorite hardware-hacking talks ever:

https://media.ccc.de/v/33c3-7946-console_hacking_2016#t=375


"x1 connectors which doesn't have an "open back" for larger cards..."

Because they bothered to read the PCIe spec.

Open back slots are against the PCIe specification ("up plugging" is the term IIRC). That is for two (three?) reasons, first its a mechanical thing, if you noticed plugging a big heavy x16 GPU into a x1 slot is an exercise in getting it just right. Any bump/whatever will move the card into a bad angle, even if its screwed into the back panel. Second, in the past, with older pcie specs the amount of current a slot was required to supply was dependent on its size. So a big x16 card might need the full slot rating while a smaller slot might only provide a smaller amount of current. (IIRC the limits were something like 75W for a x16 and only 25W for a x1). Thirdly, down plugging _IS_ part of the spec and the correct way to handle this situation. That means the MB vendor provides a larger, say x16, mechanical slot which is provides a fraction of the signal lanes, say x4.

So, the vendors which were providing cut back slots were either ignorant of the spec, too cheap to pay the couple extra cents for the larger connector, or had some fundamental constrain on providing it and were willing to provide a non-conformant part (generally unlikely). I've seen these cut back slots a few times in parts of the arm ecosystem where the vendor can't be bothered to read the spec much less implement it correctly. This is also how one has the pile of problems that are frequently seen in the SBC market where the boards won't actually work with some PCIe card, USB device, whatever because the HW/SW is only implementing the convenient parts the relevant standard.


For point [1] it's also trivial to file or Dremel the slot to be open-backed, if you're careful. I've done this successfully a handful of times, unsuccessfully a few more than that...


Let me get you straight: you say that the process is trivial, yet you admit to have a worse than 50% failure ratio? What does “unsuccessfully” mean in this case? Is the part completely ruined? At what failure rate would you consider a process not-trivial?


Admittedly it was more of a tongue-in-cheek joke with some fact - I've done it successfully 4 or 5 times on systems meant for "production" only after testing on EoL/scheduled for the bin hardware with mixed results - at least two I hit data pins and rendered the port unusable, one the Dremel slipped and ended up cutting a slot through the board itself, and another one you would have sworn was perfect but just didn't go (probably something shorted out I couldn't physically see). I did three good ones after those 5 or 6 failures, at which point a friend wanted me to do all the x1 ports on his board to fit x4 NICs he had, not caring if it was successful or not (he had backup hardware just in case, which thanfully wasn't needed), followed by another friend with a server he wanted to put a GPU in. If someone handed me a board today I'd feel fairly comfortable I could do it without issue; I still wouldn't exactly recommend it to the faint-of-heart or those without steady hands, but once you kinda 'figure out' the right way to do it with the right tools, it's pretty straightforward.


It's trivial to take a rotary tool to a piece of plastic, but now you have plastic dust and shavings all over the place. like on the PCIE contacts.

You could also use a razor blade or other such manual cutting devices...


There is a useful lesson here: most people don’t use words to mean anything other than what they believe they are describing. Trivial to him means it doesn’t take a lot of time, which is how he uses the word, in the context of time. Now, he’s unaware that he’s wrong and he is also unaware that his misuse of the word means his explanation is irrelevant to anyone who doesn’t already understand him! In effect, he says nothing and you respond with confusion asking why you needed to understand what he didn’t!

(People are very very good with pretending at intelligence and bad at the practice of it)


True, I'm just really annoyed an open-back socket isn't the default choice for the manufacturers. Not like I enjoy using a Dremel on my motherboard...


You’ll find many cases where lack of open back slots usually also means there will be components behind the slot that will interfere with a full/half length card.

Also don’t use a Dremel or even a file a hobby knife or flush cutters work just fine. If you have a steady hand then a hot knife attachment for a soldering iron is fine too. The cutting wheel on a dremel tool can run off and the dust it produces can be conductive and short something.


> So in theory one could install a GPU or HBA in one of those x1 slots on the motherboard if one isn't terribly concerned about bandwidth. Except most of the motherboards use x1 connectors which doesn't have an "open back" for larger cards... so much for that[1].

It's not so simple, there's another factor other than the number of lanes: power. According to Wikipedia (https://en.wikipedia.org/wiki/PCI_Express#Power), a x1 slot is limited to 25W on the 12V pins, while a x16 slow has a higher limit of 66W; the power pins are the same, but the maximum current is higher (2.1A vs 5.5A).


Fair point, I didn't think of that since my primary wants for a x1 port has been low-power stuff.


PCIe compliant device must support using less lanes than it can

Is that true in practice or is it like USB where manufacturers can wildly violate specs and just expect consumers to deal with it?


When the PCIe device is initialized, it starts up in x1 mode. The controller then negotiates the number of lanes with the device. So yeah it pretty much has to work in x1.


It generally always works in practice. Common real-world example is “External GPUs” connecting via thunderbolt: these are actually just generic PCI express 4x slots on a cable (4x is max throughout thunderbolt 3 achieves, can’t remember what cap is on new thunderbolt 4…).

Although graphics cards always generally should go in a 16x slot for best performance, they all work in the external 4x, even the latest most high end power/bandwidth hungry GPU parts. performance may suffer in some workloads though due to reduced bandwidth of the 4x slot though.

You can stick any PCI express device in an external GPU enclosure, will pretty much “just work” despite being a 4x slot - these things are really external PCIe enclosures despite the GPU focus.


AFAIK thunderbolt has an overhead that it adds to plain old PCIe, making it slower in practice. I kind of wish it was just plain PCIe, you could directly connect cheap-ish pcie adapters to it then.


In terms of PCIe 3 lanes, the max on thunderbolt 3 is 2.75x, and the max on thunderbolt 4 is 4x. Out of a roughly-5x total bandwidth.

Why was 3 capped? Nobody seems to know.


It is true in practice. The older style of mining rigs for certain types of cryptocurrency rely on a ton of gpus (6-18) connected via risers to every available pci-e slot on a motherboard, most of them being 1x slots. There are special motherboards for this that don’t even come with 16x slots, instead breaking that out into 16 1x slots to allow for more gpus to be connected.


There’s a nice photo of an open back PCIe x1 connector here:

File:PCIe J1900 SoC ITX Mainboard IMG 1820.JPG - Wikipedia https://en.wikipedia.org/wiki/File:PCIe_J1900_SoC_ITX_Mainbo...


You can also literally just file down the plastic on any “closed” socket to make it open.


Only work if your motherboard don't have other items (socket, capacitor... etc) in that area.


Open back and closed back PCIe connectors are electrically exactly the same. There are plenty of instructions on the net for grinding out the backplane to allow physically larger cards[1].

1 - https://youtu.be/fZed5r9tXHQ


> One neat thing about PCIe is that every PCIe compliant device must support using less lanes than it can. So your GPU which would like to work with x16 must support x1 etc.

I tried this with a SAS controller. They make 16-port SAS PCI-e x8 cards, so on a x1 slot I should get full speed for 1-2 SAS devices, and the SAS drive(s) will only slow-down if I add more... right?

Wrong. Even with a single device, you get about 1/8th the speed you should. So, yeah, PCI-e cards properly downgrade to fewer lanes, but not necessarily in a very useful way.


I have a wifi6 card that only works in the x1 slot, for some really annoying reason


I have come across devices which would only work on specific slots. There’s clearly a gap between theory and practice…


Lots of BIOS/efi just really suck.

On the enterprise side of things loading up a board with bootable raid and 10g cards will lead to situations where the raid card needs to go topologically first or early so the raid bios has memory to load.


Some dev boards have a PCIe slot with an "open back".


Minor correction to the article. It mentions that each PCIe lane has two wires. One for inbound and one for outbound.

I think this should really read that each lane has two differential pairs (interestingly the diagram shows twisted pairs). One for inbound, one for outbound. Which sums to four wires (or traces) in total for each lane.

I’m also a little confused about the whole

> A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound.

Not really sure what the author is trying to say. But the ideas that outbound has double the bandwidth doesn’t really make much sense, as outbound is different depending on perspective as a device. So I read this statement as saying that each device transmits at twice the speed the receiving device can receive data. Which is clearly non-sensical.


Yes, I too confused on the same points you mentioned. Thanks for clarification.


Maybe by outbound they mean root-to-peripheral.


I would guess they mean it's full duplex


I meant full duplex ;-) sorry guys for the confusion, I'll correct !


This is a very low quality article with a lot of errors and completely missing on key concepts of PCIe like the root complex.


It's a fine article if it gets you excited to learn more about PCIe (which is pretty dang cool). The MindShare book on it is quite good, very detailed and comprehensive, and doubles as a doorstop when not in use.

But yeah, I spotted the "two wires" stuff and a bunch of other problems. Credit for enthusiasm, perhaps, but it's not terribly accurate.


It reads like marketing branding fluff…

Based on other articles by the author https://www.ovh.com/blog/understanding-the-anatomy-of-gpus-u... it seems all of them are quite terrible word salads with no substance.


> A lane is composed of 2 wires: one for inbound communications and one, which has double the traffic bandwidth, for outbound.

I might be wrong, but I believe PCIe lanes are differential. Meaning you have to have 2 pairs of wires (1 for each direction), not 1 totaling 4 physical wires per lane.

Also both pairs in the lane should have the same bandwidth but since they are "full duplex" system, you might say it doubles the overall bandwidth of the lane but it does not mean that one of them is twice as fast.


Yes. Every lane has 4 data pins. TX+, TX-, RX+, RX-

The data is transmitted and received as differential pairs because this helps in signal integrity issues.

You are also correct that you can transmit and receive at the same time but whether this helps our now depends on what you are running.

The Serdes used for PCI-E are also used for other protocols like SATA and USB3. They are extremely similar at the physical layer but the higher layer protocols are very different.

USB 1/2 has a single differential pair so you can only transmit or receive but not at the same time.

USB 3 has differential pairs for RX and TX so you can do both at the same time.


> USB 1/2 has a single differential pair so you can only transmit or receive but not at the same time > > USB 3 has differential pairs for RX and TX so you can do both at the same time.

Just in case this leaves some readers thinking it's only possible to transmit or receive on a differential pair, not so:

Copper gigabit ethernet (1000Base-T) transmits and receives at the same time over all four differential pairs, using adaptive equalisation and echo cancellation to separate the outgoing and incoming signals on each pair.

The signal processing is quite power hungry, and doesn't work so well at higher speeds. So it makes sense that PCIe wouldn't do this.


You are not wrong I can’t tell if this article is some machine learning written garbage or a Dunning–Kruger effect thesis.

It literally starts with a discussion about add-in cards form factors like it even matters. Then has tons of blunders, like the full/half duplex one, calling transfers / second frequency which it isn’t how many transfer a second has little to do with the clock frequency because it’s dependent on many other factors such as the width of the channel and how many times per clock cycle you can transfer data. It doesn’t even begin to go over the basic concepts of the PCIe root complex and the structure of a switched fabric.

Then I has some gems like “having a nice GPU with 16 PCIe Lanes and having a CPU with 8 PCIe Bus lanes will be as efficient as throwing away half your money because it doesn’t fit in your wallet.” which really isn’t true…



I've always been interested in understanding the hardware side of PCI-E better. By that I mean, how is it just so much faster than every other type of mobo/CPU interface (e.g. SATA, DVI, USB). Why not just use PCI-E _everywhere_ if it's so much faster?


In 2003 SATA did 150MB/s and PCIe x1 did 250MB/s. In 2004 SATA did 300MB/s. In 2007 PCIe x1 did 500MB/s. In 2008 SATA did 600MB/s. In 2010 PCIe did 1GB/s.

The only reason SATA fell behind on a per-lane basis is because they stopped updating it.

PCIe has no special sauce for running faster, just more lanes. It's taking over drives, but mostly for other reasons.

DVI was replaced by HDMI. HDMI 2.1, which came out in 2017, is about 2/3 as fast per lane as PCIe gen4, which also came out in 2017. For an external passive cable that's really good.

USB 3.1 jumped to 1.2GB/s in 2013, which means it was faster than PCIe for four years. And USB4 is twice that speed per lane, making it faster than PCIe gen4.

So the constraining factor is not the type of signalling, it's the number of wires and transceivers. You can make anything go faster with more wires, but that's expensive.


Mostly because it's more channels in parallel.

These days all of these interconnects (and high speed ether) are in the process of a bit of a convergence into what essentially just versions of raw serdes interfaces on chips - the same wires coming off the die can be 10Gb ether, PCIE, etc etc - they all just self clocked differential pairs


Thunderbolt is basically PCIe in a cable.

But the big issue you face is that PCIe is fast because the physical layer is so tightly controlled. Lane lengths to the same port are all approximately the same length. All the lanes are carefully shielded with a complex mix of grounded traces and large ground planes above and below. The length of the traces is limited to make sure the speed of light doesn’t introduce too much delay.

So PCIe speed doesn’t come from fancy algorithms, but from fancy electronics, and a very tightly controlled physical layer. Extending that layer outside of a computer is tricky.

Just like trains are much faster than cars, PCIe is much faster the DisplayPort. But trains require tracks, that are long and straight, and car don’t. Equally PCIe requires electrical connections that are short and straight, and DisplayPort doesn’t.

Thunderbolt kinda splits the difference. It’s basically PCIe, but heavily downrated so it can handle being outside of a motherboard.


Recent versions of PCIe have short reach within a system precisely because the physical layer is not all that tightly controlled. Cheap PCB materials are a much worse physical medium for high speed signals than twinaxial cables, which are used in a lot of rackmount servers to bring PCIe from the back of the system where the CPUs and expansion slots are up to a SSD backplane in the front of the system.

Thunderbolt goes for even longer cable runs by using active cables, while long PCIe runs use redrivers or retimers that aren't integrated into the cables. Thunderbolt is also not really a downrated PCIe in any way; Thunderbolt transmits data faster than PCIe 4.0 lanes, but doesn't scale to as many lanes as internal PCIe links.


Because PCIe allows direct access to everything else.

A USB3 port is already pretty dangerous: you can plug in something that will generate keystrokes or mouse movements and also present storage, so a malicious device can mount itself, copy over a payload, run it, and then pretend to be a cup-warmer again.

Plug in a PCIe device and it gets to control your system.



From your link:

System compatibility

Kernel DMA Protection requires new UEFI firmware support. This support is anticipated only on newly-introduced, Intel-based systems shipping with Windows 10 version 1803 (not all systems).

So, it's CPU specific, motherboard specific, firmware specific, and OS version specific.

That's not really solved, is it?


It's CPU specific in the sense that the CPU needs IOMMU instructions (almost everything made in the past 10 years) and OS version specific in that Windows and Linux have both supported it for about 3 years.

The problematic part is that the UEFI needs to support it, it seems most systems with Thunderbolt have enabled it since 2018 and systems without Thunderbolt still don't bother.


Most hardware features need that kind of adoption to work.

It's mandatory for thunderbolt 4.


Should I say "mitigated"? But DMA itself is a core part to improve transfer performance, and itself is a security hole.

Anyway mitigation is considered to enough TB/USB4 to be adopted for new devices.


It's susceptible to interference so there's probably no need to use something like this where something simpler (and more robust) will fit the bill. But PCI-E is used externally - that's what Thunderbolt is.


Now it's going to everywhere, isn't it?

* NVMe 2.0 going to support HDD

* Thunderbolt, USB 4.0 supports PCIe based connection

* CFexpress, SD express is based on PCIe


Price and range, in addition to the security issues related to DMA others have brought up. Max cable length for the very high signaling rates is very short: https://superuser.com/questions/885232/what-is-the-maximum-l...


Max cable length is very short, if you insist on using cheap ribbon cables with the relatively ancient PCIe CEM slot as a connector. Or you can do PCIe 4.0 x8 over a 2M cable, if you use quality twinaxial cabling with more modern connectors, eg. https://www.serialcables.com/product-category/pcie4-oculink-...


I mean, there's the length covered by the standard, which is short, and then these longer cables. If they work, great, but I don't believe it's required by PCIe. 2m is also a fairly short cable, if not as short as a few inches.


I don't believe I've ever run across a section of the PCIe standard dictating cable or PCB trace length requirements. All I've ever seen are signal integrity requirements (ie. dB loss) and timing requirements. The timing requirements are not tight enough to preclude long cables, and while the signal loss requirements are chosen based on what's practical and affordable to achieve with a PCB, they don't directly concern themselves with setting minimum or maximum reach lengths.


So one thing I still wonder about is why PCIe peripherals often present themselves as a region of memory to the main processor. If the chipset/CPU is buffering writes for fast transmission, does it guess when you are done writing? Is any regular of the normal RAM caches used? If so what happens when write only a partial cache line? It seems so backward to me.


This enables DMA (direct memory access) which is a much faster way of moving data around.

It allows the CPU (or any other device in the PCI bus) to write/read data to/from the device when it’s convenient for the sender, without having to interrupt whatever the receiving device is currently doing.

You still need to coordinate with the device to tell it that you’ve written to its memory, or read from it. But that’s a pretty cheap operation.

The alternative is the both CPU and GPU would have to stop what their doing and manage the data copy while doing nothing else.

So it’s basically the difference between sending someone an email vs giving them a call.

With DMA your emailing a big document, the later calling to make sure they received it. Without DMA it would be like calling them and reading the document out over the phone. One is clearly better for everyone’s productivity.


Thanks!

I could still read this two ways however: one where the memory is on the peripheral, and one where the memory is main memory, where the peripheral is copying to/from using DMA. Which one is it?


Typically with devices like network cards (that also operate over PCI-E)

You send the device a circular list of descriptors (pointers) to a region of main memory.

In order to send data to the device, you write your network packet to the memory region associated with the pointer of the current ‘head’ of the descriptor list.

So far, you have a ring of pointers, one of those pointers points to a location you just wrote to in ram.

You then tell the device that the head of the list has changed (as you just wrote some data to the region that the head of the list is pointing to - so it can consume that pointer), the device then goes ahead and copies the data from ram into an internal buffer on the card. Once the data is consumed, the tail pointer of the ring buffer is updated to indicate that the card is finished with that memory region.


__padding replied to you, but unfortunately their comment is dead because of their account being new, so I’ve reposted it as I cannot vouch yet.

> __padding 45 minutes ago [dead] [–]

> Typically with devices like network cards (that also operate over PCI-E) You send the device a circular list of descriptors (pointers) to a region of main memory. In order to send data to the device, you write your network packet to the memory region associated with the pointer of the current ‘head’ of the descriptor list. So far, you have a ring of pointers, one of those pointers points to a location you just wrote to in ram. You then tell the device that the head of the list has changed (as you just wrote some data to the region that the head of the list is pointing to - so it can consume that pointer), the device then goes ahead and copies the data from ram into an internal buffer on the card. Once the data is consumed, the tail pointer of the ring buffer is updated to indicate that the card is finished with that memory region.


DMA can work in both directions. You could put information in a specific location in main memory, then ask a peripheral to read it.

But equally a peripheral can expose its own memory and ask the host to write into it.

Cheap devices tend to do the former because it avoids the need to have expensive memory built in. They can just “borrow” system memory. More expensive, performance optimised, devices tend to do the latter.

It’s also worth mentioning that DMA tends to work between every device attached to the PCIe bus. So Microsoft’s DirectStorage API seems to be using this feature, by having the GPU directly read data from an SSD, without the data ever touching the CPU or main memory.


There are a number of mechanisms at work for such use cases. You can flush caches manually, or have the memory uncached, also you need to make sure, some accesses don't reorder. If you are interested, you might search i.e. for ARM device memory. https://www.embedded.com/dealing-with-memory-access-ordering...


I/O to the peripherals can generally be done with either memory mapped I/O or a separate I/O address space (port mapped I/O) accessed with different instructions than memory accesses. x86 can do either, but some architectures don't have a separate I/O space. PCI allows decices to require I/O access to be used, so on systems without it, it's got to be emulated by the controller somehow. For x86, the I/O space is only 16-bit, vs at least 32-bit for MMIO (but many devices allow a 64-bit MMIO address). For devices like GPUs with large amounts of RAM, you can feasibly map the whole memory with MMIO and access it directly, whereas with I/O ports, you'd probably need to write a destination address to one port and then read/write from another port. If you wanted concurrency, you'd need multiple pairs of access ports or locking, rather than letting the memory subsystems arbitrate access.

As others have said, you'd normally configure the MMIO space for uncached access, or you'd need to be careful to force the memory ordering you need. The device specific interfacing requirements would be the guide there. Devices can indicate if their MMIO ranges are prefetchable or not, which should indicate if stray reads would cause side effects or not.

One bonus of MMIO is DMA could interface with other devices, whereas I don't think devices are allowed to drive the I/O bus like that.


How do you thinking they should appear?

It's interesting to compare to embedded processors without a memory management unit, like this STM32 reference manual, see p.68 and following:

https://www.st.com/resource/en/reference_manual/dm00124865-s...

Everything looks like a memory address. Note that it's not actually memory, the processor just diverts requests for that address to the peripheral instead of memory. But on that little ARM processor, if you want to write to actual RAM, that's memory addresses 0x2001 0000 to 0x2001 BFFF. Data in the onboard Flash memory is at 0x0020 0000 - 0x002F FFFF. If you want to talk to something on a serial port, write to registers from 0x4001 1000 - 0x4001 13FF. If you want to show something on an attached LCD, or pull a buffer from the USB or Ethernet peripherals, or work with GPIO, or do anything at all, really, it's at some memory offset. This chip has some DMA, you can set it up to automatically push from one peripheral memory space to actual RAM or vice versa. But everything happens at a region of memory.


I imagine that you have regions of normal RAM reserved for Peripheral IO, and that the peripheral directly copies it’s data to or from that region using DMA. For the CPU and the driver developer, this means there is no special handling required of these ranges of memory. If the PCIe bus is slow, the CPU is not slowed down because the transfer can be async.

This is perhaps the DMA scenario you describe in the last two lines, my thinking is that it would make sense to do this all the time, at least when transfers are large.


You need more than just a shared area to put the data, which main memory, with DMA for devices to access works great for, you also need a way to coordinate on what data is ready to send or ready to receive. That could be in main memory and polled, but it usually makes sense to signal the device directly that something is ready to be sent so the device isn't constantly issuing reads to the memory to poll. For the device to signal the host, you'd generally raise an interrupt somehow, and then the host would ask the device what was ready. Usually you want that host to device communication to be direct, and memory mapping the i/o is a straightforward way to do that. (Using DMA to store/retreive the data also means the device doesn't need a whole lot of buffer space: enough to store a packet to be sent and maybe two packets to receive assuming you'll have time to DMA the first before the second is done, if the bus may be more congested, you'll need more a bigger receive buffer.


Such regions of memory are usually marked as uncachable (UC) or write-through (WT) or sometimes write-combine (WC). These specify what type of caching is allowed, and if you somehow configure an incorrect caching type you're going to have a bad time.


I've never given much thought to how much PCIe lanes are available to my hardware. However, with the advent of NVME drives I realise that there are only so many drives one can install.


This is a crucial consideration when choosing hardware (especially Intel vs AMD) [0]. Of note is the AMD Threadripper family supporting up to 56 PCIe lanes directly to the CPU. On the datacenter/server side AMD Epyc supports 128 CPU PCIe lanes [1]. Note this is per socket so with multi-socket systems these numbers increase accordingly.

Massive amounts of I/O such as this are instrumental for high and extremely dense I/O applications whether it be for storage, network, GPUs, etc (or combinations of all, of course). I know it's been a consideration for companies such as Netflix, Cloudflare, and Nvidia.

[0] https://www.cgdirector.com/guide-to-pcie-lanes/

[1] https://www.amd.com/en/processors/epyc-7002-series


> On the datacenter/server side AMD Epyc supports 128 CPU PCIe lanes [1]. Note this is per socket so with multi-socket systems these numbers increase accordingly.

However, AMD Epyc uses half of the 128 lanes on each socket to talk to the other socket, so on two-socket systems each socket has only 64 PCIe lanes available, for a total of 128 lanes, the same as with a single socket. On the other hand, newer AMD Epyc processors can (depending on the motherboard) use only 48 lanes to talk to the other socket, freeing another 16 lanes per socket (for a total of 160 PCIe lanes) at the cost of slower inter-socket communication; and they also have an extra single PCIe lane per socket to be used to connect to a BMC. The article at https://www.servethehome.com/why-amd-epyc-rome-2p-will-have-... has a great explanation of all that.


> On the other hand, newer AMD Epyc processors can (depending on the motherboard) use only 48 lanes to talk to the other socket, freeing another 16 lanes per socket (for a total of 160 PCIe lanes) at the cost of slower inter-socket communication; and they also have an extra single PCIe lane per socket to be used to connect to a BMC.

For people that aren't reading the whole article, I want to make it clear that "slower" is only relative to other newer chips. The lanes on the old chips were half as fast, so even 48 lanes beats them by a factor of 1.5x.


Oh wow! I didn't know that. I'm assuming this is for DMA between the two sockets (and potentially) cross-socket PCIe device routing?

I still remember building a two socket Xeon workstation some years ago and being puzzled that video wasn't initializing. Turns out one CPU wasn't quite seated correctly and the GPU was in a slot wired to it. I wonder if this architecture avoids that?


> I'm assuming this is for DMA between the two sockets (and potentially) cross-socket PCIe device routing?

It's more than just that. The memory is attached directly to each socket, so for instance you could have 64GB of memory on each socket for a total of 128GB; for a core in one socket to access memory which happens to be attached to the other socket, it has to go through these inter-socket links. More than that, for a core in one socket to access memory in the same socket but which has been cached somewhere in the other socket, the cache coherence traffic has to go through these inter-socket links.

> Turns out one CPU wasn't quite seated correctly and the GPU was in a slot wired to it. I wonder if this architecture avoids that?

No, each PCIe link is wired to only one of the CPU sockets (PCIe is point-to-point, not a bus like classic PCI), or to an auxiliary chip which is wired to only one of the CPU sockets; if that CPU is not seated correctly, what you saw could happen. The architecture which avoids that is the older one in which all CPUs were wired together in a bus, with PCI and memory attached to an auxiliary chip also on that bus.


It’s for cache coherency between the two sockets, in addition to remote device (eg, PCIe card local to the remote socket) and memory access.


PCIe switches are a thing, and devices don't need all their lanes if you're okay with less max bandwidth.

It ends up being the same as any other interface: there's a limit to how many devices can be connected before the limit for non-blocking simultaneous bandwidth is reached, but that merely means that bandwidth can be a bottleneck if you go above that.


Below switches which on the high end can have their own root complex, the more common thing is lane bifurcation and multiplexing.

This holds especially true for consumer motherboards and PCIe breakout boards.


Speaking from experience: bifurcation is very uncommon and support is absolutely terrible, with most motherboards lacking the feature or having it be broken. Even when it is supported it's usually a pain...

Multiplexers just lets you switch between connected devices - it won't let you use both simultaneously. It's hot-swapping without having to physically change devices.


Bifurcation is much better supported these days due to the prevalence of mining.

And yes multiplexing is basically “time sharing” and the PLX chips operate on the physical layer only. PCIe switches have an internal bus, buffers and actual decode the packets to know where to send them too, they can also mediate between different versions of PCIe connected to the same switch so connecting a PCIe 2.0 device to a switch would not impact other 3.0/4.0 devices whilst a multiplexer would always operate at the lowest “speed”.

But PLX chips were pretty much the only thing you can get on consumer motherboards at least back when multi GPU setups were common and SLI/XFire motherboards we’re a thing (that said PLX did help quite a bit more with XFire setups than SLI due to the lack of an external cross bridge between GPUs in later XFire revisions).

The chipset on your motherboard can also support its own PCIe lanes however at least on Intel chipsets it’s not a classical PCIe switch its closer to the silicon in the CPU that runs the root complex.


Didn't realize PCIe gen 6 was this close . Seems like they should have skipped a few generations like CAT8 instead of making new versions so close to each other .


Each upgrade makes it harder to make a chip IP/motherboard that is in spec due to the increasing frequency and signal integrity requirements.

There's a reason gen4 took so long to roll out - it has been finalized for 4 years (with he manufacturers working on support for much longer), and it's still not ubiquitous. I'm sure you'd rather have a gen4 or gen5 board when gen6 is rather than be stuck on gen3.

(Plus, for gaming loads there are diminishing returns - outside reducing texture upload times, GPUs in gaming scenarios do not use a lot of PCIe bandwidth.)


I wouldn't really hold ethernet cables up as a model of skipping unnecessary generations. There's 6, then there's 7, then they came out with 6A and 7A, and 7 is unpopular and 7A unheard of, and then there are apparently two kinds of 8 with different plug and shielding requirements?


There was 7 year gap between 3 and 4 and they're catching up.


also consider that you need the CPU to be compatible and this is where the innovation struggles to be implemented in real life.


It seems to me there's a lot of overhead in this protocol. How much more performance could a new-from-scratch protocol achieve?


Indeed there is, and you would be correct in that assumption! Take a look at CXL. It's currently optimized for NVMe, Ethernet, or co-processors (GPUs.) It uses the PCIe Gen 5.0 physical layer (but then again most SerDes IPs are general purpose and go into all different types of controllers.) It's essentially an alternate protocol that is negotiated in PCIe link up.

What's nice is that it still keeps everything good about PCIe - like the electrical and mechanical spec, but does away with the old legacy things that nobody likes about PCIe. However, I'm not super familiar with CXL, so maybe it just looks nice from far away.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: