Okay, that is just stupidly wide bandwidth :-). As a systems guy I really don't see the terabytes of bandwidth in main memory coming in the next 5 years. GDDR6, the new "leading edge" is nominally 768GBs, which sounds great but how many transactions per second can you push through a CPU/GPU memory controller? Are we going to see 1152 bit wide memory busses? (1024 bits w/ECC) That is over 1/3 of all the pins on a high end Xeon and you're going to need probably another 512 pins worth of ground return. HBM soldered to the top of the die? Perhaps, that is where GPUs seem to be headed, but my oh my.
I'm sure on the plus side there will be a huge resale market for "combination computer and room heater" :-)
Wow, okay then. I wonder if that sort of setup will be like the quad itanium motherboard setups that needed massive sustained airflow to keep running...
Storage cards for many fast nvme/ssd drives in hot swap bays? I haven't don't much with this, but from the outside view it looked like the PCIe slot woukd be the real limiting factor if your got a bunch of nvme based drives that could each pretty much saturate the connection.
Database running in ram will benefit enormously, just running an ISAM db on a ram drive massively increases performance. RDBMS run alot of stuff in ram as well so all in all I think we will see some decent DB performance gains.
Maybe I'm missing something but how does a newer version of the PCI-e spec directly lead to enormous benefits for a database that's running entirely in RAM?
In fact, on the surface, I would think that "traditional" databases (i.e., those residing on actual storage devices) would be more likely to benefit enormously as a result of this "upgrade" -- because of the massive increase in available bandwidth for storage and networking, of course.
--
(I'll certainly admit thatI don't follow or keep up with developments in new generations / iterations of Intel processors / hardware architectures, but I'm assuming that Intel's Xeon CPUs still access RAM via their own memory controller. Intel has definitely done (more than) their fair share of stupid shit over the years but I'm guessing they haven't recently decided that, now, the proper place for system memory is on the PCI-e bus!)
It's quite possible that in the near future, memory expansion will happen mostly over the PCI-e bus.
Vendors are moving towards bringing the main ram closer, to increase bandwidth, reduce power and reduce latency. However, this will necessarily mean that the amount of ram a CPU has will be fixed -- there will be no more dimm slots.
So what if you need more ram than the market provides on the top-end cpu? Samsung has your back, with CXL.memory. [0]
This might first see usecases in the semi-high end enterprise where there are PCI peripherals and SoCs that can spend that amount of pins for the speedup provided but can't afford the more exotic transports.
Stupid question from me: how the hell do they prototype and debug this? It's very far from my EE knowledge. Communication analysers will show the eye diagram and tell you a lot about stray capacitance and inductance, but I know that most 10GSa/s+ devices cost more than a house and have a tendency to live in a lab for a long time. It seems to my mind like there are so many difficulties in driving pci-e fast, from timing to voltage drop and current limitations. All of this must be implemented, somehow, if the spec is finalised. Right? They wouldn't write it up otherwise (right?!).
So -- how do you prove that your infrastructure can blow the socks off everything without building two of them and testing?
Any communication protocol has functional layers to them, and at the very bottom is what's called the Phy layer (or physical layer.) While pretty much every layer can be implemented in digital logic, the physical layer has a lot of analog circuitry. (It's programmable analog circuitry, so it's technically mixed signal.) Most wired communication protocols that you are familiar with (including SAS/SATA, USB, HDMI/DP, PCIe/NVMe, and Ethernet) all use what's known as a "SerDes Phy" which is short for Serializer Deserializer. It takes parallel data in, and serializes it to transmit data, and does the reverse when receiving data. In the example of PCIe Gen 4, it takes 32 bits of data in at 500 Mb/s (per bit) and serializes it to 16 Gb/s.
Because the SerDes Phy is such a common part of all communication protocols, it's often reused in other designs. For example - a GPU may use the same SerDes Phy design for its PCIe lanes as it does for its HDMI ports (at least on the transmit side.) They are however programmed differently.
Because these designs see so much reuse in so many different chips, there are companies that specialize in designing SerDes Phys, and their development is usually ahead of protocol development. Though in today's day and age, it's super complicated, and consists of way more than shift registers and a PLL. But there are SerDes designs that exist in products that are capable of 112Gb/s (not sure if 224Gb/s exists outside of test chips yet.)
You can buy modern FPGAs that have PCIe Gen 5, but with a SerDes Phy capable of going way faster. And if you want, you could bypass that hard block and talk straight to the SerDes, and program your own PCIe 6.0 controller.
But to answer your question on how they prototype it and test it, they run computer simulations that are exceedingly complicated and slow, and once a test chip comes back from the fab, eye diagrams are usually collected using what's known as a repeating signal scope. Since it's insanely difficult to sample a 112Gb/s signal at any reasonable resolution, it's sampled statistically (in time) over billions of samples, and plotted by knowing the frequency of the target signal. An eye diagram is what comes out of a traditional NRZ (non return to zero) where each cycle represents either a 1 or a 0. In an eye diagram, you see all permutations of bit transitions overlayed on each other, since you're sampling billions of cycles statistically. This is important because you can see other anomalies of your signal like jitter and rise/fall times.
There's a lot more to it, and this comment is already getting to be a wall of text but definitely feel free to ask more questions!
Do you happen to know why 10Gb Ethernet is still so expensive when PCIe and USB involve much higher bitrate PHYs by the dozen? Is it just a business thing? Or do 10GbE PHYs do something funky that has prevented them from sliding down the price curve, even after all this time?
You can get 10 Gb networking for fairly affordable prices with say, Mikrotik hardware.
I think the reason why it costs a lot is that packets are still tiny, and ports are many. A quality 16 port, 10 Gb switch is expected to have no internal bottleneck, able to handle every port being used to full capacity at once. And that works out to 320 Gbps bandwidth in the whole switch.
Then there's that thanks to the Internet, we're still stuck with 1500 byte packets, which means best case we're dealing with 812,744 pps, and worst case, we're dealing with 14,880,952 pps. And that's on one port.
With the later number, your time budget per packet is 67ns. And if you start looking, context switches, cache misses, locks, etc all have times also measured in nanoseconds.
It's getting very, very tight and requires extremely careful programming to actually make this work. And of course in modern times, we have firewalls, NAT, statistics, routing, etc all of which cost precious time out of this budget, and which are all user-configurable, so it's something that's difficult to just shove into a purpose-built chip to take that load off the CPU.
PCIe has the advantage of that it doesn't have to deal with half the crazy stuff networking involves.
PCIe is a packet switched network with point-to-point routing, firewalls, and link aggregation. Everything is done on dedicated silicon because it simply can't be done in software -- forget having a cache miss, the PCIe bus itself might be the mechanism to serve the cache miss. Or ship a bunch of 400Gbps ethernet traffic... speaking of which, picking a bandwidth fight with PCI express is probably not the best move ever :)
> 16 port, 10 Gb ... 320 Gbps bandwidth in the whole switch
That's a nice switch. Expensive, but nice. Let's do the math on a modern low-end consumer CPU:
PCIe: 20x 32Gb/s x2 = 1280Gbps
Chipset: 8x 16Gb/s x2 = 256Gbps
Display: 4x 4x 12Gb/s = 192Gbps
...and if we compared the switch to a CPU in its own price bracket, this would go from silly to ludicrous.
Look, I'm not asking for a battery of 400GbE ports on every laptop, but USB can deliver 10Gb/s across a few meters for a few dollars while Ethernet wants a few hundred dollars to ship the same bits over the same distance, and that's unfortunate.
People keep trying to make excuses for Ethernet, but I don't think it deserves their efforts. I think something's broken in the consumer networking industry -- it feels like they gave up on consumers entirely after 1GbE, and I think we could all benefit if someone breathed some life back into that corner of the market.
Distance and signal integrity would be my guess. The longer your transmit the signal, the more complicated and expensive it is to maintain signal integrity. There are out of spec USB cables that manage to get longer distance by converting to fiber optic (meaning it’s an active cable) and even then it’s extremely expensive for not that much cable.
Intense preemphasis and equalization are table stakes for high speed serial these days. Have been since PCIe3. By Gen5, the fundamental lobe goes to 32GHz, and the mandates are steep. It has to be able to fish the signal up after the dielectrics have eaten all but a few parts per thousand of the high frequency energy! Compared to modern transceivers -- these new ones or even, if you'll indulge my speculation, the transceivers in your own computer that are going unused by the dozen -- compared to those, dual simplex 10Gb/s just looks... old. The price doesn't really match up, and I wonder if it's a business thing or if there is an unfortunate design choice that baked in some manufacturing difficulty.
I don’t know anything about what you just said as I don’t have that level of digital/analog design knowledge to know how to respond. All I know is that USB 3.0 cable lengths is 18 meters but to get there the cable becomes incredibly thick, heavy and inflexible. ePCIe cables are measured in cm. Ethernet can go up to 100m at 10GBe or more if you’re willing to sacrifice some speed (or faster if you sacrifice some cable length)
Could switching to PCIe instead of Ethernet for the digital signaling have better performance? I don’t know but I doubt it. Ethernet is hitting 100gbps and higher and I suspect the challenge is designing cabling that can hit just as much as the analog design piece. Not to mention that USB3.2 and Ethernet have very similar speeds, just Ethernet can manage it over much longer distances at significantly lower cost. I’m skeptical PCIe has some kind of magic bullet here. These protocols are optimized for totally different use cases.
As someone without a background in EE I find it fascinating how we are able to build hardware that operates and deals with data at high speeds that it needs to be debugged using statistical analysis.
How is it even possible to have correctly functioning circuits at such high speeds? I'm guessing at the very bottom of the Phy layer there several transistors switching at incredibly fast speeds. I understand that these transistors have several parameters such as their delay, rise, and fall time. Things most certainly don't happen instantaneously, a few logic ICs I played with also have a propagation delay time specified in their data sheets lasting a few nanoseconds.
Take the deserializer for an example, happening at 16Gb/s, which if I understand correctly equates to 1 bit every 62.5ps (0.0625ns). How does any circuit affected by the parameters I just talked about handle signals at such high speeds?
(Please go easy on me, I know some of my assumptions are probably wrong or oversimplified but I would love to learn more)
Select 30nm CMOS processes already had fT (frequency at which transistor no longer has gain) of 300GHz, so the component-level limits are probably flirting with THz by now. Of course, you need to knock off an order of magnitude to build useful comparators/etc out of those, and that gives you... 100GHz, right about what we see on those high speed SerDes :)
One more thing: CMOS historically wasn't the fastest logic family. Bipolar ECL was faster, but it burned a ton of idle current, so it ran into power limitation long before CMOS.
In either case, thermal limits prevent you from using these speeds across the whole chip, but just because your thermal budget can't afford 10,000,000,000 speed demons doesn't mean it can't afford 1000 of them in a few SerDes :)
But yeah, decades of Moore's law have not gone to waste, and those discrete logic chips are frozen in time. It's a neat reminder of how far we have come!
Is it the SerDes component that also does the word encoding (8b10, etc., forgot the proper terminology for this) ?
I work on the Hardware/Software interface quite a bit (more in the past than now) and this is always fascinating, I'm on the Software side, thanks for the details!
The encoding takes 8 bits and turns them into 10 for DC balance and to help the clock recovery system. You are basically losing 20% of your bandwidth to encoding overhead.
PCIE Gen 3 and above have switched to 64b/66b encoding for less overhead.
At a higher level, considering that all protocols are serial these days, how does a multi-lane protocol (PCIe x4 or x16, Ethernet x4) use the multiple lanes, is it multiple independent requests going on each lane?
PCIe "shreds" each packet across multiple lanes to reduce latency.
Generally, if you are wondering what PCIe does, the answer is usually "they do the hardest, fastest, highest-performance thing and they do it so well that nobody notices."
I'm not in EE, but I'd imagine that you start with x1 lane and work your way up.
Each PCIe lane is an independent SERDES. Its not like a parallel memory controller where they share a clock, when you have x16 lanes, you have 16x independent streams you're shoving to the device downstream.
We build a test chip with the serdes and some control logic. One wafer might give us 50 chips. We package them and build a board with two of the chips and have them talk to each other.
PCI SIG has regular validation events and beyond that the lab where I cut my teeth is well renowned for hosting plugfests for cutting edge technologies.
Yep, and the highest data rates are only really possible on the best processes (12nm, 7nm, 5nm) , which is a fun chunk of change to spend for a tape out.
Modern high end serdes have very complex signal mgmt and monitoring tools built in. You can detect the link quality, pull eye diagrams/etc from them.
So a lot of the reason for using some of these high end scopes/analyzers is bypassed by the testing/diagnostics functions on the serdes itself. And then of course there are a ton of signal tuning parameters which get programmed during link training.
Even modern analyzers have problems with this, and I've seen vendors that instead of even trying to reconstruct the signal at the endpoint use virtual transmission models and require one to probe at the source.
This is one of those areas that I bump into every couple years, and its frequently quite educational.
I suspect it's heavily sampled, and maybe not only on the time axis. They would already have a reconstructed clock available, so it would be straightforward to sync up a time aperture, and the voltage aperture could be swept by adding an offset into the output comparator(s). They could then scan the ranges of both parameters to build up the eye diagram with minimal additional hardware. This would make it a brilliant oscilloscope if you want an eye diagram of a high speed serial bus, but for just about any other task it would be a bit rubbish. There isn't a UXR hiding in every PCIe 5+ lane.
Haha, yes, I'd love to see what slathering vaseline on a high speed diff pair does to the eye. Link equalization is the real MVP! It'll be fun to see how this evolves with 32Gbps and PAM4.
You better make isolation a subpart of 'some signal conditioning' or you'll be looking at some expensive repairs. That's the hard part of making a scope: isolating the source and the scope guts while maintaining signal integrity.
How long does it usually take to get consumer products of new PCIe specs? Fast PCIe Gen 4 is only just getting affordable. Like $350 for 2 TB NVMe ssds.
Also, I remember playing around with PCI implementation on FPGAs over a decade ago and timing was already not easy. What goes into creating a PCIe Gen4/5 device these days? How can you actually achieve that when you're designing it? Are people just buying the chipsets from a handful of producers because it's unachievable for normal humans?
EDIT: What's inside the spec differences between say gen 3 and 6 that allows for so many more lanes to be available?
It really is something else. I used to run a scaled out Exchange system circa 2008, and we spent alot of time optimizing I/O. I think 150 iops was the target per disk performance.
Only now (PCIe 4 and very recent controllers etc) are the very latest top-end NVME drives hitting around 150k IOPS (which isn't stopping manufacturers from claiming ten times that; WD's NVME drive tests at around 150-200k IOPS and yet they claim 1M) and only in ideal circumstances...reads and writes coming out of the SLC cache, which typically under 30GB, often a lot smaller except in the highest-end drives.
Many drives that claim to reach that sort of performance are actually using Host Backed Cache, ie stealing RAM.
IOPS on SSDs drops precipitously once you exhaust any HBC, controller ram, SLC cache, mid-level MLC cache...and start having to hit the actual QLC/TLC. In the case of a very large database, a lot of IO would be outside cache (though certainly any index, transaction, logging, etc IO would likely be in cache.)
More than 500k IOPS was downright common for PCIe gen3 SSDs. 1M IOPS is totally achievable with one high-end gen4 SSD.
If you meant IOPS with a queue depth of one, or sustained write IOPS, then you need to specify those extra conditions before calling vendors liars.
No high-end drives implement the NVMe Host Memory Buffer feature, let alone rely on it for maximum performance.
You can buy an Enterprise QLC drive that is fully capable of 800k IOPS for reads. TLC drives that can do twice that are available. Those drives don't have SLC caching.
I would love to pick up from the 200k IOPS laptop quote and demo a RAM drive and then saturate the RAM drive into swapping - I don't know how you could do this on stock distros or Windows but it would make a great executive suite demo of the issues.
There's not more lanes available. The generations are getting faster just by increasing the transfer clock rate, up to PCIe 5, and in PCIe 6 by increasing the number of bits per transfer. The way they doubled the speed every generation was pretty basic: the timing tolerances were chopped in half every time. The allowable clock phase noise in PCIe 4 is 200x less than in PCIe 1. The miracle of progress, etc.
That miracle is somewhat over. They're not going to be able to drive phase noise down below 1 femtosecond, so 6.0 changes tactics. They are now using a fancy encoding on the wire to double the number of bits per symbol. Eventually, it will look more like wifi-over-copper than like PCI. Ethernet faster than 1gbps has the same trend, for whatever it's worth.
Never. Power consumption for 10Gbps Ethernet on copper is huge compared to 1 Gbps, so it is not sexy for laptops. Consumer switches with 10 Gpbs ports don't exist - price per port is way too high versus 1 Gbps. Even 2.5 & 5 Gbps switches are rare and very expensive, so the 2.5Bbps Intel and Realtek Lan on Motherboard chips are there, but not useful.
At the same time, 10 Gbps on fiber is low power and great speed, but the form factor of SFP+ is simply too big for any laptop of 2022. In theory USB-C 3.2 @ 10Gbps to SFP+ fiber adapters are possible, but fiber is not popular outside server rooms, so there is no market for it.
As I said, never. It is not a matter of PCIe speeds, but a technology one: copper is power hungry and requires new cabling, fiber has no cabling.
10GbaseT power requirements are dropping every year as the MACs get more efficient. Its also of course dependent on cable length. Besides saving power it has also enabled 10GbaseT SFP+'s in recent years.
So, there probably isn't a good reason for not putting it in a laptop. Like wifi, just speed shift it based on load, etc. AKA see 802.3az
So, the old max power draw models from 15 years ago, don't really apply entirely.
Those Mikrotik SFP+ RJ 10Gbps modules get VERY hot. Idle temp is usually in the 60-70C range in my experience. They have guidance about not putting them next to each other in switch ports: https://wiki.mikrotik.com/wiki/S%2BRJ10_general_guidance#Gen...
I know about that option, but the price per port, including RJ45 transceiver, is huge compared to 1 Gbps ports. Also it is limited to 4 ports only - even for home use it is not enough, all the switches in my home (more than 3) are 8 ports and above.
Well, the idea is that you only wire trunks, wifi APs, and SANs w/ 10GbE, and light a select few wall jacks that matter.
~$120 per port (assuming 309s, the recommended interleaved spacing, and amortizing the switch cost over # ports) isn't very expensive if (a) you need the capability & (b) already have copper in the walls.
Drywall finishing along will run you that much. :) But yeah, new builds should definitely be fiber.
> Power consumption for 10Gbps Ethernet on copper is huge compared to 1 Gbps, so it is not sexy for laptops.
How often do you have your laptop plugged into networking but not power? And if that's an important use, I could see manufacturers spicing it up with some power over ethernet to not only fix but reverse the problem.
Though it feels like ethernet ports on laptops are already unsexy at any speed. And you could have the 'balanced' power profile limit the port to 1Gbps on battery.
Many laptops have a thunderbolt port which serves a similar purpose. On TB4 I get 15gbps in practice, and I can bridge it to ethernet using either a dock or a PC (I use a mac mini with a 10g port to bridge TB to 10ge).
10GB Ethernet requires at least CAT6 or CAT6a cabling. CAT5e is insufficient. For home use, that means that the newer standards of 2.5GB and 5GB which do work over 5e are more reasonable for personal computers.
A USB4 peripheral interface is $15 and a 10g ethernet IC is like $80 at least, plus a little box, SFP+ thing, other physical items. Seems hard to get the price down that low and still make a profit. Plus the buyer still needs an expensive cable on both sides.
A 10G PCIe card is only $100 retail so $150 seems fair for an external version. Newer dongles will include a captive USB4 cable. And consumers use base-T not SFP+.
> How long does it usually take to get consumer products of new PCIe specs?
Like 2 years.
When PCIe 3.0 was getting popular, 4.0 was finalized. When 4.0 was getting popular, 5.0 was finalized. Now that PCIe 5.0 is coming out (2022, this year), PCIe 6.0 is finalized.
There was a much bigger gap between 3.0 and 4.0. PCIe 3.0 was available with Sandy or Ivy Bridge, so 2011/2012. PCIe 4.0 was introduced with Zen 2 in 2019.
We seem to be back to a faster cadence now however.
With the rise of GPU-compute, a lot of the supercomputers are playing around with faster I/O systems. IBM pushed OpenCAPI / NVLink with Nvidia, and I think that inspired the PCIe ecosystem to innovate.
PCIe standards are including more and more coherent-memory options. It seems like PCIe is trying to become more like Infinity Fabric (AMD) / UltraPath Interconnect (Intel).
One bus to rule them all! Would make some sense and if it's fast enough, not saying we will see memory in PCIe form factor, though, size of memory progression, maybe down the line it may go that way.
> It seems like PCIe is trying to become more like Infinity Fabric (AMD) / UltraPath Interconnect (Intel).
So far CXL is pretty strictly a leader/follower type approach, not really suitable for a more symmetric relationship like you have in a multi-socket system. But maybe one day..
With sufficiently fast memory bandwidth, does dedicated GPU memory become irrelevant? AFAIK discrete graphics on laptops still share the main memory, but at what point does that become a standard setup on desktops too?
If anything, the higher-bandwidths of future PCIe-specs would allow the CPU to access the dedicated GPU VRAM at its full speed, rather than PCIe-limited speeds.
It is not just about getting a product out ( i.e PCI-E 6.0 SSD ), but also the platform support. ( i.e Intel / AMD Motherboard support for PCI-e 6.0 )
Product Launch are highly dependent on Platform support. So far Intel and AMD dont have any concrete plan on PCI-E 6.0, but I believe Amazon could be ahead of the pack with their Graviton platform. Although I am eager to see Netflix's Edge Appliance serving up to 800Gbps if not 1.6Tbps per box.
I recently bought a few Zen 2 and Zen 3 HPE servers, and found out only via trial and error that HPE sells Zen 2 servers without Gen4 motherboard support!
It seems they took the original Zen motherboards with Gen3 and just swapped out the CPU. Only the Zen 3 has a refreshed motherboard. Makes me now check things more carefully to be sure.
Same, worse my vendor sent me an outdated motherboard that officially didn't support the Zen2 CPU and it randomly wouldn't boot, and then they sent me a patched BIOS so it would boot consistently.
We needed that machine yesterday, and it's seen maybe 2 minutes of downtime in the past couple years, so I couldn't send it back. My fault for not specifying I wanted the .v2 of the barebones when ordering it over the phone I guess..
I've not done PHY development personally, but these interfaces are called SerDes. SerDes is short for Serial-Deserializer. Outside of dedicated EQ hardware, everything on the chips are done in parallel so nothing needs to run at a multi-GHz clock.
I think that these days there's a lot of convergence going on - everything is essentially serdes in some form - some chips just have N serdes lanes and let you config them for PCIe/ether/data/USB/etc as you need them, much as more traditional SoCs config GPIOs between a bunch of other functions like uarts/spi/i2c/i2s/pcm/...
> How long does it usually take to get consumer products of new PCIe specs?
Personally I'm expecting this spec to drive pcie 5.0 adoption into consumer space.
Tbh consumers dont need this througjput. But given that consumer space has remained stuck around 20x lanes off the cpu (plus some for the chipset), the 5.0 and 6.0 specs will be great for those wanting to build systems with more peripherals. A 1x 16GBps link is useful for a lot.
I'd be leery of dismissing the potential consumer demand. That much throughput could be put to good use for a myriad of personal and business functions, and software trends to filing whatever hardware can provide. It's like every prediction about users not needing X amount of ram or cpu or dpi or network or storage space.
Having that much throughput suggests paging and caching across multiple disks, or using giant models (ml or others) with precomputed lookups in lieu of real-time generation. At any rate, all it takes is a minor inconvenience to overcome and the niche will be exploited to capacity.
> Having that much throughput suggests paging and caching across multiple disks, or using giant models (ml or others) with precomputed lookups in lieu of real-time generation.
Those sound to me like situations where latency will be the bottleneck, even on something like PCIe 3.0 x4.
I can currently only name one consumer use for these super high speed data transfers, and that's loading 3d assets and textures from SSD to GPU in real time based on where you're looking.
Signal integrity is becoming a serious issue for these higher speed links. So it's a compromise between how many traces you need vs. the signal speed.
For the server market, GPU's and other accelerators, high speed NIC's and SSD's have an unending appetite for bandwidth, so there I expect a lot of interest. But I think we'll see an increasing gap between this and the consumer market, for better or worse.
PCIe Gen5 is now available in the latest Intel desktop processors. There are very few lanes, so it can really only run a single GPU, but that's covers a lot of the potential market.
Look like Desktop Chipset 600 series just supports gen 3 & 4[1] and PCIe gen5 ports are only available on AlderLake Desktop[2] not Mobile[3] processors.
Not strictly true. The only thing that got huge delay was 4.0, which was due to technical difficulties. 5.0 is by and large an extension of 4.0, and arrived roughly on schedule.
PCI-E 6.0 was in research for long because they knew 5.0 would be the end of the way things were working previously.
I think PCIe bandwidth has suddenly become super relevant. It used to be that videogames were not bottlenecked by PCIe bandwidth, and GPUs were the most data intensive thing on the bus.
Nowadays there's all sorts of business cases that require high bandwidth, and the devices to fill them also exist. From NVMe arrays to crazy interconnects to hyper connected machine learning accelerators, PCIe is enabling it all.
Now if only PCIe had woken up two years earlier and spent a bit more time on development… then PCIe 4 would be the equivalent to this new PCIe 6 standard.
what a coincidence, my boss just asked me to buy the memerbership.
keywords: PAM4 and 128GBps/per x16 lane(full width, per my understanding, not 256Gbps as title says), which mean it can be used to make a 1Tbps NIC, or any traffic at the scale.
from my reading, we will see pcie6 products mid-2023.
I just spoke to a friend who, as a consultant, was tasked with verifying that a "motherboard" for a projector conformed to the HDMI specs. He'd gotten a bunch of measurements and numbers in a document, and asked if he couldn't get the full spec as it would be easier to understand what to do.
He was told they had the specification but it was at headquarters in Belgium, on a computer that was not connected to the internet. He'd have to fly there, view it on the computer and write off what was needed... which was exactly what someone had done to make the document he'd gotten.
From what I gather the HDMI group wants to ensure something marked HDMI just works when plugged into another HDMI device. I guess this gate-keeping is part of it.
The DRM often conflated with HDMI (HDCP) existed years before the first HDMI cable was sold. Or were you saying DRM when it comes to a physical port specification? In which case, that wouldn't really be digital rights management.
Right, but for a device that's to be sold on the market, would you trust some random PDF from Google?
Low volume, expensive products like the one my friend got the job for or consumer products, having to do a recall and board replacement will not be a fun experience for those involved.
Yes it has that, but also no it's not the same. DDR5 has no link-level error correction. The faster and faster DDR5 bus is entirely unprotected on consumer devices.
The on chip ECC for DDR5 isn't because of faster memory; it's because of denser memory. It can rowhammer itself, and fox it up silently. And the cheaper brands can start shipping chips with defects like they do with flash, relying on the ECC to paper over it.
reading the articles I think that FEC is being used to protect link integrity - it's different from ECC on DRAM which is also protecting the contents (against things like row-hammer and cosmic rays)
One question I always have about new PCIe standards is how much cruft is caused by the insistence on backwards compatibility with previous PCIe versions? If you were designing a brand-new protocol to do the same things that PCIe does without having to maintain backwards compatibility, what would that new, optimized protocol look like? How much more performant could it be relative to PCIe?
Check out CXL, it uses the same electrical standards but new protocols on top. There's a good servethehome video on it too. The end goal is to enable non-local memory pooling and other flexibilites to devices. It's a bit of the next evolution of things like DPUs and other datacenter mish-mashing of hardware.
FWIW, PCI-e has never had backwards compatibility with PCI. They're two vastly different protocols that were somewhat standardized by the same standards body. So, PCI-e has always been vastly more performant than PCI since 2004.
They're not very relevant nowadays, but PLX and others make bidirectional PCI-PCIe bridges that I think were mostly used so that companies could ship transitional products— eg, do a new board layout that accommodates the PCIe slot/power/whatever, but add the bridge IC to it so that you don't have respin your super fancy custom ASIC.
> what would that new, optimized protocol look like?
Exactly like PCIe 6.0, but with a slightly simpler negotiation process and none of the fallback modes.
> How much more performant could it be relative to PCIe?
It would have the same performance, but be slightly cheaper to implement.
When two PCIe devices are connected to each other, they don't start in any data transfer mode, they start in a complex autonegotiation mode [0], where they find out the width of the link, the maximum protocol version that each side support, and the maximum speed that the connection between them can support. They were forward-thinking enought to make sure that this process can support pretty much any kind of protocol that does packet switching over LVDS signaling. Since packet switching over LVDS is still the best game in town, they are not overly burdened by backwards compatibility, other than having to also carry all the previous versions with them in the interface controllers. 6.0 changed to a completely different kind of packet system to support FEC, and no-one other than the implementors of the interface controllers have to care.
Because PCIe is so flexible, it won't be unseated until there is some radically new kind of signaling that is better than LVDS on copper, say some kind of optical link. And even then, the successor is almost certain to be a variant of PCIe, just with a new physical layer.
[0]: PCIe LTSSM. I sadly cannot link the spec here because it requires registration but have the diagram someone kindly ripped for SO: https://i.stack.imgur.com/QYyCM.png
The problem with PCIe is not bandwidth, it is the limit in lanes on consumer PCs: 20 lanes from CPU and a few more from the SouthBridge is not enough when the GPU is usually linked on a 16 lane connection. The easy way out is to reduce GPU lanes to 8, that leaves plenty of bandwidth for nVME SSDs and maybe for 10 or 25 Gbps NICs (it's about time).
For servers it is a different story, but the recent fast move from PCIe ver 3 to ver 5 improved the situation 4x, doubling again is nice, but it does not seem that much of a deal. Maybe moving NICs from the usual 8 lane to a lot less (8 lanes of ver 3 means 2 lanes of ver 5 or a single lane of ver 6) will also make some difference.
But doubling the bandwidth per lane allows one to use half as many lanes to a GPU and maintain bandwidth. As you mention, it allows an eight lane GPU to be a viable option. And better yet, due to how PCIe handles variable number of lanes between host and device, different users with the same CPU, GPU, and even motherboard can choose to run the GPU at eight lane with a couple of four-lane SSDs, or at sixteen lane for even more bandwidth if they don't need that bandwidth elsewhere.
8 lane GPU is viable for a long time (benchmarks on PCIe 8x versus 16x shows a 5% perf difference), but it did not change the physical layout the motherboard manufacturers use; you cannot use the existing lanes any way you want, on some motherboards you cannot even split it the way you want between physical connectors and video card manufacturers continue to push 16x everywhere.
It's an AMD problem as well. It's absolute nightmare trying to research for a computer today. What ports can you use in what circumstances. Which slots go the CPU directly and which go to a chipset.
Which lanes are disabled if you use nvme-slot 2. Which slot has which generation etc. A proper nightmare.
And while we are at it, dedicating pci-lanes to nvme-slots must be one of the most boneheaded decisions in modern computers. Just use a pci-card with up to four nvme-slots on it instead.
Maybe it’s because I bought a “gaming” motherboard, but the manual was pretty (for my understanding at least) as to what configuration of m.2 drives and PCIe lanes would run at what version, what went to cpu and what went to chipset.
The issue is cross comparison, not reading from the manual of a single board. In the past you could look at the physical motherboard layout and could compare it with another motherboard. Today, it's merely a suggestion. You take the physical layout with a huge grain of salt because it's nothing but lies. Which means going to the website of every single potential motherboard you wish to purchase, downloading their PDF, and digging through enough caveat fine print documentation to make your eyes bleed.
It's not just segmentation. Laptop buyers are not going to pay for 64 lanes. A regular Intel SKU of the 12th generation has 28 PCIe 4.0/5.0 lanes. A Xeon has 64, does not have 5.0, and costs way more, partly because it has 4189 pins on the bottom, which is insane.
> The problem with PCIe is not bandwidth, it is the limit in lanes on consumer PCs... For servers it is a different story
EPYC/Threadripper is awesome but also explicitly not the segment being discussed. But since it was mentioned it's 64 PCIe lanes per socket on Intel (up to 8 sockets/512 lanes) instead of 128 lanes for single socket/up to 160 lanes for dual socket on Epyc. Gonna cost you a damn arm and leg and you better know your NUMA though.
In the consumer segment AMD has 16 PCIe lanes for the graphics and up to 4 PCIe lanes for NVMe drives, the chipset link is and additional x4 PCIe instead of a different interconnect. This comes to 20 PCIe lanes for user devices and 4 lanes worth of PCIe bandwidth off the chipset. https://i.imgur.com/8Aug02l.png
Intel offers something similar, 16 PCIe lanes for graphics and 4 PCIe lanes for NVMe. They opt for a proprietary DMI connection to the chipset which is equivalent to 8 lanes of PCIe 3.0 bandwidth or 4 lanes of PCIe 3.0 bandwith. https://i.pcmag.com/imagery/reviews/070BdprI2Ik2Ecd2wzo0Asi-...
Each offers splitting the x16 for the GPU into 2 x8 as well as oversubscribing the downstream PCIe bandwidth from the chipset, devices through the chipset on each obviously hit different latency penalties than CPU direct lanes as well. In the end the offerings are both pretty much identical in the consumer space.
With the bandwidth you do get more flexibility though, and with pci express bifurcation you can get the bandwidth equivalent of 4 pcie4 16x from a single pcie6 16x.
And that's great since the vast majority don't need that much bandwidth anyway.
Today you typically have a whole slew of devices sharing 4x to the cpu. More bandwidth would open up for more usb and perhaps cheaper onboard 10gig Ethernet etc.
The term gigatransfers per second is essentially Gb/s, so 2 bits x 32 billion per second is 64 billion per second. This is done using PAM4, which stands for Pulse Amplitude Modulation, and 4 is the number of valid voltage levels. Normally we use what's called NRZ or Non Return to Zero, where there are only two voltage levels: 0 and 1.
The did pull something cheeky though - the 256 GB/s is for a PCIe x32 connection. It exists in the spec, but I've literally never seen it in the wild.
> The term gigatransfers per second is essentially Gb/s
It's not really that either because PCIe 2.0 was "5GT/s" despite being only 4 gigabits per second.
You could make some kind of argument about pre- and post-encoding bits, but that still falls down in other circumstances. Gigabit ethernet has five voltage levels per lane, 125 milllion times a second. What's its MT/s if the answer isn't 125?
> It's not really that either because PCIe 2.0 was "5GT/s" despite being only 4 gigabits per second.
Sort of, depends on where you're measuring. PCIe 2.0 runs at 5GT/s, which is the speed that the Phy layer runs at, but it uses an 8b/10b encoding, so the Data Link layer sees that 4GB/s. For Gen 3.0 and 4.0 (and I think 5.0) the Phy layer uses a 128b/130b encoding, which has much less overhead. So technically it's about 63Gb/s but we round up. But that's kind of a useless measurement because there's additional overhead from Acks/Nacks on the Data Link Layer and TLP headers on the transaction layer (which depends on the size of your TLPs, and whether or not you're using infinite credits.)
edit: I'm not sure which encoding the Phy layer on PCIe Gen 6.0 uses, since it's PAM4. The (approximately) 63 GB/s on the data link layer assumes 128b/130b encoding
> What makes you say that? The chart says x16.
Maybe that metric includes both directions. Either way, it's misleading, as you only get 128 MB/s (before protocol overhead) on PCIe 6.0 x16. (64Gb * 16 lanes) / 8 bits per byte = 128 MB/s
They're definitely considering both directions for the total bandwidth number.
But I don't think that's what's happening with the gigatransfers. When they launched 5.0, they were clearly counting one differential pair of pins: "Delivers 32 GT/s raw bit rate and up to 128 GB/s via x16 configuration"
But my point is, even though they were already doing that, they were claiming 32GT/s last generation. But it's 64GT/s now. I don't think that's counting more pins, I think they're saying that each transfer is two transfers.
I'd be surprised if any of this mattered for them, since their workload (at least the "copy movie files from disk to network" part of it) is embarrassingly parallel.
Unless they're really squeezed on power or rack space budget, I would imagine they'd do just fine being a generation back from the bleeding edge.
They are doing 400Gbps per box already [1], they have 800Gbps Network card waiting to be tested, with PCI-E 6.0 they could do 1.6Tbps, assuming they somehow figure out how to overcome CPU and Memory bottleneck. Sapphire Rapids with HBM3 would give 2TB/s :)
>Unless they're really squeezed on power or rack space budget
I think this is the case for their open connect appliances (or whatever they call them). They want to try to maximize throughput on a single device so they don't have to colocate so much equipment
The headline number is network I/O, but IIRC, the bottleneck is really in/around system RAM. Since most of the content isn't in cache and neither the disks nor the network cards have sufficient buffering, you need to have the disk write to system RAM and indicate readiness, then the network card reads from system RAM. NUMA bandwidth and latency can be an issue there too.
But, I don't think there was room on the pci-e lanes to do something like put a gpu in to use as a DMA buffer to get more ram bandwidth.
I think the next Epyc generation should have PCIe 5 and DDR5, both of which should help.
If an nvme drive supports the controller memory buffer (CMB) feature, an RNIC can do a peer to peer transfer.
From what I recall of the netflix storage node that was linked from HN a few months back, the current generation has 4 x 100 Gb mellanox ethernet ports (CX6, PCIe Gen 4) and somewhere around 20 to 30 PCIe gen 3 NVMe drives.
Assuming they can figure out how to do peer to peer transfers, scaling up by a factor of 4 doesn't seem implausible.
There are a lot of disks in a Netflix content appliance. The latest slide says 18 drives, each with PCIe 3.0 x4. There are 4 PCIe 4.0 x 16 nics, and there's all your PCIe lanes. You could get PCIe 4 drives, but it's not the bottleneck at the moment.
But, RAM needs at least twice the bandwidth as your network, because you can't have the NIC read from the disk directly, you need to have the disk DMA to ram, and the NIC DMA from ram, and (normal system) ram isn't dual ported, so reads and write contend. If you need to do TLS in software, you touch ram 4 times (disk read, cpu read, cpu write, nic read), so ram bandwidth is an even bigger bottleneck.
I thought with things like direct-storage (the equivalent of gpudirect) and ssdk this wasn't the case any more (no more cpu intervention, every device has their own programmable dma engines ?)
what use case does this level of bandwidth open up, and I am not able to understand why the article thinks SSDs are one of them. PCIe already provides more than enough bandwidth for the fastest SSDs, am I missing some forward-looking advancement?
CXL puts PCIe in competition with the DDR bus. The bandwidth was already there (now doubly so), but CXL brings the latency. That's exciting because the DDR bus is tightly linked to a particular memory technology and its assumptions -- assumptions which have been showing a lot of stress for a long time. The latency profile of DRAM is really quite egregious, it drives a lot of CPU architecture decisions, and the DDR bus all but ensures this tight coupling. CXL opens it up for attack.
Expect a wave of wacky contenders: SRAM memory banks with ultra low worst-case latency compared to DRAM, low-reliability DRAM (not a good marketing name, I know) where you live with 10 nines of reliability instead of 20 or 30 and in exchange can run it a lot faster or cooler, instant-persistent memory that blurs the line between memory and storage, and so on.
Thank, that it quite an interesting technology I wasn't aware of. Apparently Samsung already made a CXL RAM module for servers in 2021 (1). I wonder how Intel optane would have been if it had used CXL (assuming it didn't).
Side note but AMD devices' (laptops/NUCs) lack of thunderbolt or pcie access is why I'm quite hesitant to buy a portable AMD device which is quite unfortunate. I really hope AMD/their partners can offer a solution soon now that thunderbolt is an open standard.
Plenty of people are betting that future Optane will use CXL; it makes perfect sense. Sapphire Rapids doesn't support CXL.mem so we may have to wait for 4th gen Optane on Granite Rapids.
Current SSDs, but there is literally nothing stopping people from putting PCIe switches in place and maxing out literally any PCIe link you can create.
The limit then becomes the amount of RAM (or LLC cache if you can keep it there) bandwidth in the machine unless one is doing PCIe PtP. There are plenty of applications where a large part of the work is simply moving data between a storage device and a network card.
But, returning to PtP, PCIe has been used for accelerator fabric for a few years now, so a pile of GPGPU's all talking to each other can also swamp any bandwidth limits put in place between them for certain applications.
Put the three together and you can see what is driving ever higher PCIe bandwidth requirements after PCIe was stuck at 3.0 for ~10 years.
From what I understand, internally an SSD's bandwidth can be easily scaled by spreading reads and writes across arbitrarily large numbers of NVRAM chips within the SSD.
So, you can just create SSDs that saturate whatever bus you connect them to.
In a sense then, it is the bus specification itself that limits SSD throughput.
You can make SSDs arbitrarily fast by just making them wider/more parallel. The reason it seems like PCIe 4 or 5 is "fast enough" is because the SSDs are co-designed to suit the host bus. If you have a faster host bus, someone will market a faster SSD.
> PCIe already provides more than enough bandwidth for the fastest SSDs
Today, yes and a few tomorrows as well. But even when a standard is announced as finalized, it can be a long time (Years even) until it makes it's way onto motherboard of the consumer space. By which time, the current goalposts may start looking closer than expected.
I'm just glad they have one number, no endless revisions and renaming of past releases and with that - thank you PCI-Sig.
current top of the line SSDs are close to maxing out 4 lanes of PCIE gen 4. Gen 6 will make it a lot easier to find room for a few 100gb/s ethernet connections which are always nice for faster server to server networking, as well as making it easier to use pcie only storage.
Internally passing around multiple 4k video outputs would use this. Maybe you don't want the output via the port on the back of the card but want to pass it through internally to some other peripheral? I think this is how thunderbolt ports work right (happy to be corrected)?
Good news for everyone and hopefully this will not interfere with the PCIe 5.0 rollout, even though the 4.0 one is not fully adopted by the market yet.
Not that I’d be able to use it, but it’s a pity they make the spec available to members only (4000$/year membership) or they sell it at ridiculous prices https://pcisig.com/specifications/order-form
I know that other specifications and even ISO standards are provided for a fee https://www.iso.org/store.html and perhaps something similar should be applied for open source software to avoid similar issues like faker.js and colors.js
I'm sure on the plus side there will be a huge resale market for "combination computer and room heater" :-)