Hacker News new | past | comments | ask | show | jobs | submit login
Optical PCIe 7.0 connection hits 128 GT/s (tomshardware.com)
204 points by WithinReason 7 months ago | hide | past | favorite | 133 comments



Marvell was recentlty showing off a a PCIe 6.0 Alaska-P retimer that's good for pcb, copper cable, & optical interconnect. Doesn't include optical transceivers but does show the growing interest in optical PCIe. https://www.servethehome.com/marvell-extending-pcie-gen6-rea...

Right now optical seems exotic & expensive but we seem near a severe tipping point. Copper keeps facing increasingly channeling signal integrity challenges, requiring expensive & energy consuming retimers. Meanwhile we think we can keep scaling optical down, integrating silicon photonics, getting increasingly lower pJ/b energy costs. Without the range & signal integrity issues. Not super duper deep but this 2 year old Cadence blog post goes into it, and it seems indeed to be where things are heading. https://community.cadence.com/cadence_blogs_8/b/breakfast-by...


How much of this article is written by the author instead of Cadence itself? The use of certain adjectives to describe how excellent some numbers are seems weird coming from a generic news editor. I don't know if I can trust this is as groundbreaking as it sounds.


Also, very sparse on technical details. Even basics are missing, like - what's optical signal like? CWDM? multiple polarizations? QAM? How many fibers?

I mean - one can build a 16-fiber, CWDM16 monstrosity that would transport an 8x 128GT/s PCIe using mere 8 Gbit/s per channel - all of which is technology that was "groundbreaking" circa 2005.


So I guess what this makes me wonder is: Why are we using electrical signals to connect the data lanes between components and computers these days, rather than moving everything to optical for data movement (obviously power would stay electrical, but that's already on separate lines)? I assume there's an element of cost, and once the photons get where they're going they have to be turned back into electrical signals to actually be used until such time as we get around to getting pure light based computers working (someday but not yet...), but that must not overwhelm the advantages or we wouldn't be looking at this being developed.


> I assume there's an element of cost, and once the photons get where they're going they have to be turned back into electrical signals to actually be used until such time as we get around to getting pure light based computers working (someday but not yet...)

You got it. We can't make optical transceivers as good as electrical ones. Not as small or power-efficient.

They require significantly different fabrication processes, and we don't know how to fab them into the same chip as electrical ones. I mean: you can either have photonics, or performant digital (or analog) electronics.

We've gotten really, really good at making small electronics, per the latest tech coming out of Intel & TSMC. We are... not that good at making photonics.


> They require significantly different fabrication processes, and we don't know how to fab them into the same chip as electrical ones.

There are actually a few commercial fabs that will monolithically integrate the photonics, analog electronics, and digital electronics, all in the same CMOS process. See for example GF’s process:

https://www.cmc.ca/globalfoundries-fotonix-45spclo/

Integrating good optical sources in silicon remains a challenge, but companies like Intel have mastered hybrid bonding and other packaging techniques. TSMC too has a strong silicon photonics effort.


This sounds like something that could help us become really good at photonics though. The major issue when trying to displace the good old silicon transistors is that we have invested so much in the tech and they are just so good that competing technologies with higher potential but starting at the bottom of the S curve are simply not competitive. PCIe is widely used, if it switches to photonics, despite the current shortcomings this is very encouraging for photonics.


> We can't make optical transceivers as good as electrical ones. Not as small or power-efficient.

I was under the impression that for 10Gb and above network transceivers, optical SFPs weren't getting as hot as copper ones. Is that difference related to something else?


There's a distinction between carrying high speed electrical signals across a PCB and carrying them over 30m or 100m (10GbT range.) Those "long reach" electrical transcievers are chock full of both analog and digital wizardry to both push and also decipher what becomes quite a mushy signal, which is why it's so energy intensive.

You can also think about it another way: SFPs are also connected with high bandwidth electrical links; for 10GE that signal is a pure straight 10.3125 Gbaud. Yet the SFPs don't heat up as much. You can also look up 10Gbase-KR, which is "stretching those plain PCB signals as far as we possibly can", as well as DAC cables and their ranges.

State of the art [cf. https://www.xilinx.com/products/technology/high-speed-serial... ] for SERDES blocks (= what makes your short-range PCB electrical link) is ca. ≤ 150Gbaud at PAM4 (2 bits per baud), i.e. ca. 300Gbit/s, but you need error correction at that point. PCIe 7.0 pulls back to a safe (and cheaper to manufacture) 64Gbaud with PAM4 to get its 128GT/s.


> Not as small or power-efficient.

I wonder what the latency for switching medium is these days too (for the super small transceivers). To my understanding optical is better for attenuation than electric (less noise, and thus easier to shove more frequencies and higher frequencies on the same pipe), and can be faster (both medium dependent, neither yet approaching the upper bound of c).

I'm imaging the latency incurred by the transceiver is eventually offset from the gains in the signal path (for signal paths relevant to circuit boards and ICs)


Depending how you do the actual modulation, optical modulation does not add any significant latency (there can be no processing involved and the extra transmission length you'd need for the modulation (i.e. Electroopotic conversion, rf amplifiers...) is negligible.

The big issue is really 1. Photonic waveguides are much larger than electronic ones (due to the wavelength) 2. You loose dynamic range and in EO conversion (shot noise is significant at optical frequencies) 3. Co integration of optics and photonics components is nontrivial due to the different materials and processes. 4. Power efficiency of EO conversion is also not that great.

Where photonics shines is for transmission of high frequencies (i.e. a lot of data) over long distances and being immune to EM interference. So there is certainly a tradeoff for at what transmission distances to go optical and as data rates keep going up the tradeoff length has become shorter and shorter. Intel, Nvidia, AMD et al. All do research into optical interconnect.


I seem to recall information travels slower in fiber (optical) vs wires (electric), resp. ~2c/3 vs ~c, or am I remembering it wrong? Or is it a significantly different optical medium?

If so, does that matter at all here? Dunno if that holds up for such kind of devices and/or at these scales (much shorter distance, but also much higher speed).


IIRC the lower speed in fiber-optic cables has to do with the the refractive index of the glass, and maybe some bouncing introduced by curvature.

I'm not sure what kind of refractive indices are possible in much smaller photonic circuits, particularly if it's not practical to develop and run everything in a permanent vacuum.


In general that's correct, however it's not quite as simple. For latency you need to consider group velocity (not phase velocity), which depends on the waveguide and frequency of the EM wave. I'm actually not quite sure what it is for electric waves, but for most photonic structures it's very similar to the phase velocity. The phase velocity is about c/1.5 for silica, but more like c/3.5-c/2 for most materials used in photonic integration.

I'm not an expert on integrated electronic circuits, but I guess the difference could matter depending on application.


I'm thankful you posted this and adjacent chain of responses and I enjoyed learning what you shared =)

Viva la HN


Photonic wavelengths are shorter than electronic wavelengths. Why are photonics waveguides bigger then?


Sorry this is a somewhat handwavy explanation that glosses over many details, but the size of components in electronic circuits (i.e. transistors and connections) fundamentally depends on the de Broglie wavelength of the electrons (some tens/hundreds of pm) why the size of a photonic circuit depends on the wavelength of light (order of 1 mu m). So one can make much more compact electronic circuits than photonic. Note that is somewhat different to transmission lines which are larger for electronics (depending on the RF carrier frequency) than optics.


What’s an electronic wavelength in this context? What’s its size? Photonic ones I assume are in near-IR, on the order of a micrometer.


Optical fiber communication operates at 1.55 micro meters or 193THz. Electronics operates in the electromagnetic spectrum or at GHz. There might be no fixed size or wavelength in either case, but the shortest wavelength for radio signals that can be transmitted with acceptable loss with today’s technology is mili meters.

Which brings the question: why operating wavelengths are smaller but “waveguides” are bigger in optical fiber communication. In fact, fiber itself is a waveguide and its diameter is tens of micro meters.


The full optical wave is contained in the dielectric conductor. This conductor needs it's minimum cross section such that the wave can propagate. If it is too small then the wave can not propagate. Also there is a maximum cross section if you want single mode operation.

You get to this result if you take the electromagntic wave equation - a partial differential equation - and solve that for your transmission line configuration.

The proper analogy in the realm of electrical waveguides is the hollow waveguide. The hollow waveguide supports TE- and TM-modes but not TEM modes just like a dielectric conductor. The size is also a function of the dielectric constant ε.

What we mostly use are TEM waveguides like microstrips or coaxial cables. The difference between electrical waveguides that supports TEM modes and waveguides that supports TE/TM modes is that the former has two independent potential planes and the latter only one. Also TEM waveguides do not have a lower cutoff frequency. A TEM wave with any frequency can propagate on any microstrip configuration.

This is not true for TE/TM waves.

What's important to understand is that for microstrips/coaxial cables the power isn't transferred in the metal but in the space (dielectric) around the metal - see Poynting vector. So what happens if you have a second conductor in that space? You get crosstalk! So TEM transmission lines do not contain the wave like hollow waveguides or optical fibers (edit: ok coaxial cables do, microstrips don't)

Now the question, how big is the microstrip? Is it just the width of the signal conductor? No, it is not.

Edit: The width of the metal lines in a chip is given by the current it must carry - current density requirement, electro-migration issues. Power lines are wide because they have to supply power to the circuit but logic traces in CMOS technology only carry negligible amount of current. In circuits like RF power amplifiers with bipolar transistors the trace width is much larger because it has to carry a much larger current. But again, microstrip lines do not have a lower cutoff frequency.


electrical connections don't require electrons fly all the way across dielectric materials


> To my understanding optical […] can be faster (both medium dependent) […]

The speed of light in optical fiber — for all types based on glass, ignoring miniscule differences — is 68% that of air/vaccuum. And that's not changing, and no state-of-the-art high speed applications are being developed on plastic fibre or free air.

So, latency wise, on runs with non-negligible length, optical will lose out to electrical, which is generally quite close to the speed of light. (Except of course after some point the electrical signal is just noise, and if you factor in delay caused by amplifiers/repeaters it becomes much harder of a question.)

There's this kinda-famous story of some HFT company pulling a copper cable across some bay, because they'd gain some nanoseconds compared to the fiber they had.

The transceiver latency for "long-range" links (well, call 100m long range for copper…) is actually worse for the copper links, as the whole DSP getup you need for that takes a few symbol times to process. Optical transceivers are just optodiodes and "simple" amplifiers, the latency is much less than a symbol.

(symbol = unit of transmission, roughly 1 bit on "old" fibre [≤ 25G lane rate], ca. 4 bit for 10Gbase-T, will vary more for faster connections.)


The speed of electrical waves in a chip is ~c/2.

The relative permittivity εr of SiO2 is ~4.

c = c0 / sqrt(εr)

c0 = 1 /sqrt(ε0εr × μ0μr) and in vacuum εr=μr=1.

But the frequency needs to be sufficiently high in order to observe wave propagation, let's say >10GHz.

For low frequencies the electric conductor behaves more like a RC chain.


Considering the PCIe context, I was assuming we were talking about off-chip connections, i.e. diff microstrips/striplines in FR4 and air. εr is 2.6-ish there.

But you're right, I might have accidentally mixed in some radio connection bits, with the HFT company anecdote.


There's single-mode hollow-core fibers - which constrain the light via non-refractive / non-dielectric physics magic, i.e. without slowing down, to the hollow core. Not yet commercially available, I think.


Nice! But even after they're commercially available, putting fibers into places has massive inertia, along the lines of ≥10 years… and on short ranges (inter-chip connections up to building/city networks) you really don't care :)


> I assume there's an element of cost,

This assumption is very correct. Optical interconnects are extraordinarily expensive relative to copper. We have the art of manufacturing copper PCBs and connectors mastered. Putting optical interconnects into a system requires that the signal go through transceivers at either end as well as external optical cables, which are not integrated into the PCB. It’s extra components and complexity everywhere.

The reason optical interconnects are being explored here is that next gen PCIe is so extremely fast that the signals cannot travel very far in PCBs without significant losses. PCBs built for these speeds require special, expensive materials on the layers with those signals. They might require retimer chips to amplify the signal past a certain distance. These limitations may not apply to consumer motherboards with a single GPU very near to the CPU, but datacenter motherboards might need to connect many GPUs across a large chassis. The distances involved could require multiple retimer chips along the way and very expensive PCBs. Going to optical interconnects could introduce much more flexibility into where the GPUs or other add in cards are located.


As a side note: while the O in OCuLink does stand for optical that variant is not in use. Every OCuLink connector and cable are just ordinary copper interconnection.

Meanwhile Samtec has PCIe active optical cables, they have had them since 2012, it's a very niche application currently.


Since thunderbolt is related to PCIe, there's this that goes into copper vs optical there: https://en.wikipedia.org/wiki/Thunderbolt_(interface)#Copper...


Intel's product name foe Thunderbolt was initially was "Light Peak".


I believe it was originally supposed to be optical.


So we're moving towards crystals from Stargate? Neat


We're a lot closer to the limits of copper than I realized. Apparently motherboard designers have to make the length of clock/data traces for DDR5 memory as close to equal as possible, otherwise the entire bus just doesn't work right (maybe this isn't news to other folks).


You have to do that for all high-speed multipin interfaces. Generally, once you get past roughly 100 MHZ, it is good practice to match the lengths as well as possible, even though it probably doesn't matter until a couple of gigahertz (it depends on many factors, but generally, signals move at 15 cm per nanosecond on PCBs).


I know when silicon photonics were brand new, they had a big limitation because they only had designs that fired out of the plane of the chip, not across it. That limits you to interconnects and not cross-chip signaling. And since for a CPU you need a heat sink on top that means you have to fire down, toward the motherboard.

Also it turns out the speed of light in glass is not that impressive. So encoding and decoding at the ends eats up the speed advantage. That’s my impression as to why a lot of high profile articles on optical logic came out shortly thereafter. What if we just keep it as light for longer?


Changing from light to electricity (and vice versa) is relatively slow, expensive, and cumbersome.

Additionally, we don’t have a decent way of transferring significant power over fiber optics.

So since everything has to have copper power fed to it anyway, unless there is some compelling reason (like distance) to make optical/fibers disadvantages worth it, copper only is usually simpler and better.

At least for now.


If memory serves the original plan for Light Peak was power wires and shielding braided around a fiber optic core.


Hence why we ended up with USB-C cables.

Less fragile, similar data rate, simpler BOM, less sensitive to dirt and debris, plenty of power.


What's the minimum bend radius for plastic fiber these days?


Larger than ‘getting kinked in a desk drawer’. Though smaller than when the market made the decision of course.

You still need the copper wires to do power delivery, so either you end up with an even thicker cable, or multi-purpose the copper cables for signaling too.


Around 10 mm for OS2 cables.


GT/s = gigatranfers per second.


What's a transfer? That like a packet or a single bit?


One bit, but it's a bit of the underlying signal layer which has a 1-2% redundancy over the actual data. PCIe 2.0 and earlier encode 8b data in 10b signal. 3.0 to 5.0 encode 128b data in 130b signal. 6.0 and 7.0 do a more complicated thing: https://pcisig.com/blog/pcie%C2%AE-60-specification-webinar-...

Also the speed is per lane, eg an x8 slot / port / device is called that because it has 8 lanes, which all transfer in parallel.


So, it's still bits all the way down? Whether we're talking about app, IP, or LL it's always bits/s, each level bringing a cost due to encapsulation. And then at PHY there's baud.

Is non-ISO unit "T" / "transfer" a marketing term or really specialised jargon? "transfer" just doesn't click in my mind, at best "a transfer" (countable) is about moving a sizeable aggregate chunk that has some semantic meaning, not a single fundamental quantum of information.

Unrelated: "gigatesla per second" is such a mind-boggling unit.


It's really specialized jargon that underscores the fact that it's raw bandwidth, not usable bandwidth.


Also, another common usage of transfers/second is with RAM. DDR5 6000MHz RAM is actually 6000MT/s, and the clock actually runs at 3000Hz ("DDR" == "double data rate").


> it's a bit of the underlying signal layer which has a 1-2% redundancy over the actual data. PCIe 2.0 and earlier encode 8b data in 10b signal. 3.0 to 5.0 encode 128b data in 130b signal. 6.0 and 7.0 do a more complicated thing

Though the exact details of the overhead don't matter very much. They add 6% extra bits, good enough.

The part I want to call out as complicated/confusing is that a PCIe 7.0 lane puts out a voltage 64 billion times per second, but because each voltage is based on two bits that counts as 128 billion "transfers".


Yeah, the overhead isn't a big deal now that the overhead is single digit. Back when it was 20% in PCIe 2 it was a much bigger discrepancy.


Back then they were adding the overhead on top of the baseline speed, not subtracting it. With 1 and 2 you got the full 1/4 and 1/2 gbps of data per lane, but then 3 was only .985 instead of 1. So I'd argue that 6% for PCIe 6 and 7 is the most meaningful the overhead has ever been.


Edit: Nope, I misread. As reply notes, 16GB/s/lane.

So... That's about 16 terabytes per second per lane. AKA more bandwidth than I can imagine any use for, though I'm sure we will find ways to take advantage...

(Seriously, that's enough to move 16 largish laptop drives every second, on a single lane.)


>So... That's about 16 terabytes per second per lane.

If you assume 1T/s = 1b/s, 128GT/s is 128Gb/s = 16GB/s, not 16TB/s


Oops, yep, I misread that. Thanks for the correction


Assuming it was 16tb/s.... Imagine a JIT data lake loading stuff into main memory like brrr...

Actually at that point, a pcie7 nvme would be faster than ddr6

https://www.pcworld.com/article/2237799/ddr6-ram-what-you-sh...

That said, per-pin, 16GB/s seems to be the same ballpark as contemporary (to pcie7) main or graphics memory..... Like, actually more if I'm reading this right?

https://www.anandtech.com/show/21287/jedec-publishes-gddr7-s...


A transfer is 1 action to the size of the width of the channel.


Then you have to define what an action is.


An once that becomes generally available operating systems will eat the bandwidth in an instance and any speed-up to be gained on a desktop will be completely negated.

It seems like we're stuck at a pre-set level of latency, which is just within what people tolerate. I was watching a video of someone running Windows 3.11 and notice that the windows closes instantly, which on Windows 10 and 11 I've never seen there NOT be a small delay between the user clicking close and the window disappearing.


> It seems like we're stuck at a pre-set level of latency,

I booted and used an old computer recently. Not Windows 3.11 old, but old enough to have a mechanical hard drive.

The experience was anything but low latency. It’s easy to forget just how slow mechanical hard drives were in the past.

Modern desktops are extremely fast. Closing a window and having a tiny delay doesn’t bother me in the slightest because it has zero impact in my workflow.

I can launch code editors quickly, grep through files at an incredible rate, and compile large projects with a dozen threads in parallel. Getting worked up over a split second delay in closing a window is not a concern in the slightest.

Regardless, it has nothing to do with next generation PCIe bandwidth. I don’t understand why this is the top voted comment on this otherwise interesting article. Is HN just a place to find creative ways to be cynical and complain about things these days?


My first laptop with an SSD booted into games so much faster that I didn’t even get mad if the machine crashed while playing cooperative games. I’ll be back online in 45 seconds guys.


I agree with you, but I also agree with GP. Raw compute is many orders of magnitude faster, so for exactly those things you mention it's super awesome.

But for user interfaces at least, it really does feel like things are slower, or at least no faster than they were. As he mentions - at a level just within what we will tolerate.

As far as code editors - I don't know, Sublime (and Notepad!) is fast, but IntelliJ, VS Code and such still feel pretty 'heavy.' And I still sometimes have that experience of my computer not being able to keep up with my typing rate which is dumb. I don't even type fast.


> which on Windows 10 and 11 I've never seen there NOT be a small delay between the user clicking close and the window disappearing.

Isn't that delay related to the default animations? On my particular machine with animations disabled, if I click the minimize button, the window disappears instantly. This is your standard win11 on a shitty enterprise laptop running some kind of 11th gen i7u with the integrated graphics and a 4k external display.

Maximization is sometimes janky, but I guess it's because the window needs to redraw its contents at the new size.


Modern operating systems render to buffers on the GPU and then composite them, which I would guess adds some latency (although likely unnoticeable).


It's not unnoticeable. Ever notice how on Windows, when you start to drag a window, your cursor disappears for a frame? That's Windows replacing your cursor with a software-rendered one so it doesn't appear ahead of the window. But drag anything else (i.e. browser tabs, text highlighting) and you'll quickly notice it lagging behind the cursor. Why? Because the cursor is a hardware overlay that can be moved before the composition is actually complete. The composition lags one frame behind. In other words, the price of the compositor is lagging one frame behind. That may not sound like much, but it is, especially when most displays are only 60 FPS.

Of course, it's only one of the contributing factors to the total latency of things like keystrokes: https://danluu.com/input-lag/


Unless the issue is that your setup cannot composite at 60 fps (don’t get me wrong, not pretending that Windows isn’t at fault if that’s the case), then neither double buffering nor software cursors introduce delay.

Unless your goal is tearing updates (a whole other discussion), then your only cause of latency is missed frame deadlines due to slow or badly scheduled rendering.

There is no need to switch to software cursor rendering unless you want to render something incompatible with the cursor plane, e.g. massive buffers or underlaying the cursor under another surface. Synchronization with primary plane updates is not at all an issue.


> Synchronization with primary plane updates is not at all an issue.

While I wouldn't be surprised if this is technically true in a hardware sense, software-wise, Windows knows where the cursor is before it's finished rendering the rest of the screen, and updates the hardware layer that contains the cursor before rendering has finished.


> While I wouldn't be surprised if this is technically true in a hardware sense, software-wise, Windows knows where the cursor is before it's finished rendering the rest of the screen

The earlier you sample the cursor position and update the cursor plane, the more the position is out of date once the next scanout comes around, increasing the perceived input delay.

The approach that leads to the smallest possible input latency is to sample the cursor position just before issuing the transaction that updates the cursor position and swaps in the new primary plane buffer (within Linux, this is called an atomic commit), whereas you maximize content consistency with still very good input latency by sampling just before the composition started.

Note that "composition" does not involve rendering "content" as the user perceives it, but just placing and blending already rendered window content, possibly with a color transform applied as the pixels hit the screen. Unless Microsoft is doing something weird, this should be extremely fast. <1ms fast.


> The earlier you sample the cursor position and update the cursor plane, the more the position is out of date once the next scanout comes around, increasing the perceived input delay.

No, the cursor position is more up-to-date than the rest of the screen because it doesn't need to wait for a GPU pipeline to finish after it's moved.

> Unless Microsoft is doing something weird, this should be extremely fast. <1ms fast.

Look, I'm saying this is what's going on. (not to scale)

    ... | vsync                                                         ...
    ...  | cursor updated for frame 0                                   ...
    ...   | frame 0 scanout                                             ...
    ...     | frame 1 ready                                             ...
    ...                                   | vsync                       ...
    ...                                    | cursor updated for frame 1 ...
    ...                                     | frame 1 scanout           ...
    ...                                       | frame 2 ready           ...
Frames are extremely fast to render, but they arrive the frame after they were originally scheduled, because GPU pipelines are asynchronous. However, the cursor position arrives immediately because the position of the hardware layer can be synchronously updated immediately before scanout. The effect is that updates to the cursor position are (essentially) displayed 1 frame sooner than updates to the rest of the screen. If you actually try any of the tests I mentioned in my original comment you'll see this for yourself.


> Unless Microsoft is doing something weird, this should be extremely fast. <1ms fast.

And it should also be scheduled for near the end of the frame period, not happening right at the start.

But all this stuff is hard to do right and higher refresh rates make it simpler to do a good job.


Pekka Paalanen wrote a nice blogpost about the concept of repaint scheduling with graphs: https://ppaalanen.blogspot.com/2015/02/weston-repaint-schedu... (note that the Weston examples gives a whopping 7ms for composition).

I'm making some assumptions about your chart as it is not to scale, but it looks like the usual worst-case strategy. Given a 60Hz refresh rate and a 1ms composition time an example of an optimized composition strategy would look something like this:

    +0ms      vblank, frame#-1 starts scanout

    +15.4ms   read cursor position #0, initiate composite #0
    +16.4ms   composition buffer #0 ready
    +16.5ms   update cursor plane position #0 and attach primary plane buffer #0
    +16.6ms   vblank, frame #0 starts scanout

    +32.1ms   read cursor position, initiate composite #1
    +33.1ms   composition buffer #1 ready
    +33.2ms   update cursor position and attach primary plane buffer #1
    +33.3ms   vblank, frame #1 starts scanout
In this case, both the composite and the cursor position is only 1.2ms old at the time the GPU starts scanning it out, and hardware vs. software cursor has no effect on latency. Moving the cursor update closer would make the cursor out of sync with the displayed content, which is not really worth it.

(Games and other fullscreen applications can have their render buffer directly scanned out to remove the composition delay and read input at their own pace for simulation reasons, and those applications tend to be the subject at hand when discussing single or sub-millisecond input latency optimizations.)

> Frames are extremely fast to render, but they arrive the frame after they were originally scheduled, because GPU pipelines are asynchronous.

The display block is synchronous. While render pipelines are asynchronous, that is not a problem - as long as the render task completes before the scanout deadline, the resulting buffer can be included in that immediate scanout. Synchronization primitives are also there when you need it, and high-priority and compute queues can be used if you are concerned that the composition task ends up delayed by other things.

Also note that the scanout deadline is entirely virtual - the display block honors whatever framebuffer you point a plane to at any point, we just try to only do that during vblank to avoid tearing.

> If you actually try any of the tests I mentioned in my original comment you'll see this for yourself.

While it might be fun to see if Microsoft screwed up their composition and paint scheduling, that does not change that it is not related to GPUs or the graphics stack itself. Working in the Linux display server space makes me quite comfortable in my understanding of GPU's display controllers.


> that does not change that it is not related to GPUs or the graphics stack itself. Working in the Linux display server space makes me quite comfortable in my understanding of GPU's display controllers.

I didn't mean to suggest some sort of fundamental limitation in GPUs that makes it impossible to synchronize this. If you take a look at my previous comments, you'll see me explicitly pointing out that I'm talking about Windows, specifically, and I'm only using it as an example of how short a latency is still perceptible. How exactly that latency happens is almost certainly not a hardware issue, however, and I never meant to imply such.


Hardware planes will hopefully reduce the latency again as it will allow windows to skip the compositor and allow mapping parts of the framebuffer to a window buffer.

I believe there's some work on Linux already for them, but I'm not so sure on Windows. I would be surprised if macOS doesn't already use them in some capacity given Apple's obsession with delegating everything to firmware on a co-processor.


> Hardware planes will hopefully reduce the latency again as it will allow windows to skip the compositor and allow mapping parts of the framebuffer to a window buffer.

Hardware planes are great but there are a limited number of them. Right now I believe Windows only uses them for the mouse cursor, and exclusive fullscreen.


Arguably the only window that needs to be in a hardware plane is the window currently in focus.


Windows 11 has pushed the use of MPO for windowed games and video display.


> It seems like we're stuck at a pre-set level of latency,

Bandwidth isn't latency, and PCIe 7.0 running as fast as 128 GT/s is no statement at all about its latency. I remember this great analogy from university: a truck carrying a full load of backup tapes across a country has amazing bandwidth but atrocious latency.

(I still agree with your sentiment, just PCIe is not one of the problems in this regard. The connection between bandwidth becoming available and being eaten up vs. latency is a red herring; it's all about properly engineering software for responsitivity.)


If your Win27k startup is a 8k 120fps video of a butterfly transforming to a windows logo - then it is latency

Btw all bandwith is built to reduce latency, aren’t they. Bit philosophy heh


Latency and bandwidth are often in tension. (And guaranteeing low latency can eat up a big chunk of theoretically available bandwidth, due to overhead.)

The canonical example is probably a dial-up modem or other slow link between two locations. The latency is under 1 second to send one byte over the modem. But it's probably faster to just ship a hard disk if you want to send 100 gigabytes from one location to the other, even though the latency might be hours or even days, until the first byte arrives.

In practice, you can send lots of tiny little packets with lots of overhead (but low latency) or you can send lots of big heavily buffered packets with low overhead (but with high latency).

This is why multiplayer game protocols often consist of a constant stream of tiny UDP packets containing events like "character moved 40 nits east at game time ..." or "character fired weapon at game time ...." Even a 10 kilobyte bulk state update is going to cost at least a few milliseconds, more probably tens or even hundreds of milliseconds over some wireless connection. And that's a very noticeable lag.


Another good example is the memory in your computer. DDR is much lower latency, and GDDR is much higher bandwidth.


No, neither of these are true. If Win27k startup is an 8k 120fps video, it is either latency or stutter if you don't have enough bandwidth. You can absolutely design a system with priorities set such that latency is above stutter-/drop-free playback, and if you do, the startup time will be unaffected by that bandwidth.

And, no, not all bandwidth is built to reduce latency. There is a lot of bulk, best-effort traffic - for example, YouTube and Netflix proactively distributing videos between datacenters across the world. (They totally do that before anyone ever clicks play, they have enough data to know what is likely to be needed where.)

The same applies to your YouTube/Netflix playback at home. It doesn't need to be low latency. The only effect of latency is a longer time between you clicking play and playback actually starting. From there onwards, you just need enough bandwidth to keep the buffer filled, and you can do that quite a bit ahead of reaching playback position. Latency is a real non-issue there.

Same locally for bulk copying files around. If your OS & FS is designed well, latency only shows up at the beginning of the operation. Most file systems were designed when data was on rotating rust, and that's dealt with readahead and the likes.


GT/s is a measure of latency (not total system latency, but the bus itself is only adding 128 billionth of a second). In fact it does not say anything about bandwidth if you don't know how many bits in a transfer.


Throughput != latency, and often is tradeoff to latency (e.g. if you send stuff in big batches, database can process 100k tx/sec, but one by one it's 1k tx/sec at most)


I'm sorry, but you're multiply wrong. First, a "transfer" is not a term in the PCIe spec; if anything, there's "transaction". But GT/s does not refer to transactions as you seem to be implying, and in fact "GT" does not have an assigned long form in the PCIe base specification. The term is introduced / defined like this:

| The primary Link attributes for PCI Express Link are:

| · The basic Link – PCI Express Link consists of dual unidirectional differential Links, implemented as a Transmit pair and a Receive pair. A data clock is embedded using an encoding scheme (see Chapter 4) to achieve very high data rates.

| · Signaling rate – Once initialized, each Link must only operate at one of the supported signaling levels. For the first generation of PCI Express technology, there is only one signaling rate defined, which provides an effective 2.5 Gigabits/second/Lane/direction of raw bandwidth. The second generation provides an effective 5.0 Gigabits/second/Lane/direction of raw bandwidth. The third generation provides an effective 8.0 Gigabits/second/Lane/direction of raw bandwidth. The data rate is expected to increase with technology advances in the future.

| · Lanes – A Link must support at least one Lane – each Lane represents a set of differential signal pairs (one pair for transmission, one pair for reception). To scale bandwidth, a Link may aggregate multiple Lanes denoted by xN where N may be any of the supported Link widths. A x8 Link operating at the 2.5 GT/s data rate represents an aggregate bandwidth of 20 Gigabits/second of raw bandwidth in each direction. This specification describes operations for x1, x2, x4, x8, x12, x16, and x32 Lane widths.

(from PCIe 4.0 base specification)

So, GT/s is used to be less ambiguous on multi-lane links.

Next,

> the bus itself is only adding 128 billionth of a second).

no, the bus does actually add more latency since almost all receivers need to reassemble the whole transaction (generally tens to hundreds of bytes) to checksum validate and then dispatch further to continue. This latency can show up multiple times if you have PCIe switches, but (unlike endpoints) these are frequently cut-through.

However, that latency is seriously negligible compared to anything else in your system.

> In fact it does not say anything about bandwidth if you don't know how many bits in a transfer.

How many bits are in a transaction does in fact influence that latency mentioned right above, but has no impact on bandwidth. What does have an impact on available end-user bandwidth is how small you chunk longer transactions since each of them has per-transaction overhead.

And finally —

> GT/s is a measure of latency

— absolutely not. It is a measure of raw bandwidth. It indirectly influences minimum and maximum latency, but those are complicated relationships especially on multi-lane links, and especially maximum latency depends on a whole host of factors from hardware capabilities, to BIOS and OS settings in PCIe config, to driver behavior.


PCIe 1.0 to 5.0 used NRZ (non return to zero) electrical signaling for transmission. In NRZ signaling at high rate and over long distances, there are challenges w.r.t clock recovery, DC balance and error correction. To deal with this, encoding is used.

Encoding is basically a block of data (a sequence of zeros and ones) is represented as a sequence of electrical voltage changes (a block of symbols).

GT/s does stand for Giga Transfers per second. Here, the transfers are referring to number of symbols transferred per second, and not actual usable data bits per second.

We say GT/s instead of Gbps, because the actual usable bits/sec is determined by the encoding scheme used.

PCIe 1.0 and 2.0 encoded 8 data bits in 10 symbols (NRZ electrical signals). That's 20% overhead.

PCIe 3.0 to 5.0 encoded 128 bits of data in 130 symbols. That's a much lower overhead of 1.54%.

PCIe 6.0 (& yet to be standardized PCIe 7.0) use PAM4 for signaling and doesn't required any encoding on top –hence it is written as 1b/1b. (btw, in PAM4, each symbol is 2 bits).

You can see similar NRZ signaling with encoding in SATA, Ethernet, Fiber Channel etc. Btw, PAM4 with NRZ is used as well!

Coming to latency, latency is the time it takes for a single bit of usable data to transfer from A to B. Many factors affect this latency. Signaling medium's speed of transmission (a fraction of speed of light), signaling medium's length, signaling frequency (Mhz, Ghz etc of voltage switching), encoding scheme (encoding overhead, clock recovery or its failure and hence retransmissions, error detection/correction quality or its failure and hence retransmissions) - each of these things affect the latency of usable data.

GT/s = Signal Frequency x Bits per cycle.

Remember, PAM4 encoding in PCIe6.0 has 2 bits per cycle (2 bits per symbol).


You're confusing GT and Gbaud. Gbaud/s = Symbol rate × bits per symbol.

GT/s is in fact G"bit"/s, before line coding (I left that can of worms unopened because line coding wasn't relevant to the bandwidth vs. latency discussion.) PCIe 6.0 is "64GT/s", but only 32 Gbaud, since as you correctly point out it uses PAM-4.

> GT/s does stand for Giga Transfers per second.

If you have a citable source for this, that'd be nice — it's not in the PCIe spec, and AFAIK the term is not used elsewhere.


Here's a pcisig article talking about symbol rate in terms of GT/s:

https://pcisig.com/blog/pci-express®-50-architecture-channel...


That's great, but…

https://pcisig.com/pci-express-6.0-specification "64 GT/s raw data rate and up to 256 GB/s via x16 configuration"

The symbol rate for 6.0 is only 32Gsym/s. So GT/s can't be symbol rate. (And the references to PCIe 6.0 putting it at "64 GT/s" seem to be far more common, and in particular the PCIe (4.0, newest I have access to) specification explicitly equates GT/s with data rate.

My takeaway (even before this discussion) is to avoid "GT/s" as much as possible since the unit is really not well defined.

(And, still, I don't even know if there is a definition of it anywhere. I can't find one. The PCIe spec uses it without defining it, but it is not a "textbook common" unit IMHO. If you are aware of an actual definition, or maybe even just a place¹ that can confirm the T is supposed to mean "transfer", I'd appreciate that!)

¹ yes I know wikipedia says that too, but their sources are… very questionable.

P.S.: I really don't even disagree with you, because ultimately I'm saying "GT/s is confusing and can be interpreted different ways". The links from each of us just straight up conflict with each other in their use of GT/s. Yours uses it for symbol rate, mine uses it for data rate. ⇒ why I try to avoid using this unit at all.


> PCIe 6.0 (& yet to be standardized PCIe 7.0) use PAM4 for signaling and doesn't required any encoding on top –hence it is written as 1b/1b. (btw, in PAM4, each symbol is 2 bits).

Nothing on top if you exclude the error correction bits, which I don't think you should.


I do totally agree about the relative merits of bandwidth vs latency. However, I also still think GT/s is generally accepted as an abbreviation for gigatransfers per second and that the PCIe spec is assuming it as such. I also note you have had to pull in lots of additional specifications to describe the bandwidth of the complete Link supporting my assertion it is not a pure function of the GT/s.


I think we're having some communication/understanding issues, but that's OK. To be clear my main issue with GT/s is that even the PCI SIG doesn't agree with itself and uses the term in conflicting ways (see discussion in sibling thread.)

As far as I can research, GT/s is a "commoner's unit" that someone invented and started using at some point, but there is no hard reliable definition of it. Nowadays it seems to be used for RAM and PCIe (and nothing else really), though some search results I found claim it was also used for SCSI.


Part of me wonders if this is the natural outcome of things. The world gets faster, so we're more efficient, now we have more memory and time in a datacenter, let's do it on our servers so we can charge a monthly fee! That works, but now things are slow until the next boost (say bandwidth/storage/whatever) - now we're more efficient, let's shove ads on it, there's plenty of end-user CPU cycles we can profit from.

It seems like enshittification is not just the inevitable outcome, but almost desirable (from a profit standpoint), and thus things getting faster for you only help me (the vendor) if I can extract _more_ value by things being faster - otherwise why would I spend money to make things better?


Yes we should ban any progress for 10 years and everything would be so cheap.

Then one leap, next os eats it all, but cheap again.

/s


The problem isn't that computers are more capable, it's that they're being used inefficiently.


Here I am still on PCI-E 3.0...


Most hardware (NVMe drives, GPUs, etc) doesn't run at more than 4.0 speeds anyway. The primary advantage of 5.0 and higher is that it'll allow that hardware to use fewer CPU lanes, eg what requires 4.0 x4 could use 6.0 x1.


> eg what requires 4.0 x4 could use 6.0 x1

FWIW, this is only true for newer hardware. ie if you plugged in a pcie gen3x16 device into a pcie gen4x8 slot, although the bandwidth provided is in the same ballpark, the device will only run at pcie gen3x8.

So we'll need until the devices upgrade themselves to gen4 in this scenario to make use of higher bandwidth.


It could still help connecting the chipset.

Not all pci-e lanes on your motherboard are created equal: Some are directly attached to the CPU others are connected to the chipset, which in turn is connected to the CPU.

It's possible to convert a single 5.0 x16 connection coming from the CPU to 2 4.0 x16 connections.


I'm not saying the device itself would negotiate a higher PCIe version. I'm saying that the 4.0 x4 M.2 NVMe slot on your mobo would map to only one 6.0 CPU lane.


Huh, I was pretty sure that you need an extra chip in-between. Otherwise, the 4 CPU lanes will just drop to 4.0 level.


Yes you do.


You'd need a bunch of those chips to make good use of a bunch of 6.0 lanes on the CPU, and at that point you're paying so much for the converters that I'm skeptical it would almost ever be worth it.

I expect consumer machines to keep doing some conversion and expansion in the chipset, but nowhere else. I expect servers to directly attach almost everything and drop down to smaller lane counts for large numbers of devices.

It's worth noting that when Kioxia first put out PCIe 5.0 EDSFF drives, they were marketing them as being optimized for 2 lanes at the higher speed.


It kinda worked that way in the past too. IIRC, newer AGP graphics cards would be keyed for 4x or 8x slots but would barely use more bandwidth than the original 2x slots provided.


We didn't really bottleneck AGP8X until nVidia dropped the 200 series GeForce GPU. Then PCI-E was pretty much a requirement.


I can we can stack them? If we had 8x NVMe drives bundled I assume we'd need PCI 7.


It felt like we were on 3 for a long time, and then all of a sudden got 4 through 6 (and soon 7) in quick succession. I'd be curious to know what motivated that - maybe GPGPU taking off?


the big use cases are inter-computer communication and nvme ssds. pcie 4x16 gets you 400 gbps Ethernet. 6x16 will be 1.6 tbps. for SSDs, it's the difference between needing 4 and 1 lanes to saturate your bandwidth. a modern 2u server at this point can have an ungodly amount of incredibly fast storage, and can expose all that data to everyone else without a bandwidth bottleneck.


Definitely data centre usage of some sort.


AI/GPU communication is definitely driving it forward now. It is a speed race for how quickly you can move data around.


Really? I hadn't heard of GPU or GPGPU pushing bandwidth recently. Networking certainly does. 400GbE cards exceed PCIe 4.0 x16 bandwidth, 800 is here, and 1.6 apparently in the works. Disk too though, just because a single disk (or even network phy) may not max out a PCI slot does not mean you want to dedicate more lanes than necessary to them because you likely want a bunch of them.


We are at PCIe5 in the Dell XE9680. We add in 8x400G cards and they talk directly to the Network/ 8xGPUs (via rocev2).

800G ethernet is here at the switches (Dell Z9864F-ON is beautiful... 128 ports of 400G), but not yet at the server/NIC level, that comes with PCIe6. We are also limited to 16 chassis/128 GPUs in a single cluster right now.

NVMe is getting faster all the time, but is pretty standard now. We put 122TB into each server, so that enables local caching of data, if needed.

All of this is designed for the highest speed available today that we can get on the various bus where data is transferred.


I wonder if any of this trickles down into cloud providers reducing costs again. After all if we have zounds of fast storage, surely slower storage becomes cheaper?


We do not directly compete with them as we are more of a niche based solution for businesses that want their own private cloud and do not want to undertake the many millions in capopex to build and deploy their own super computer clusters. As such, our offerings should not have an impact on their pricing. But who knows… maybe long term we will. Hard to say.


Nvlink 4.0 used to connect H100 GPUs today is almost as fast as PCIe-7.0 (12.5GBs vs 16GBs). By the time PCIe-7.0 is available I’m sure NVlink will be much faster. So, yeah, GPUs are currently the most bandwidth hungry devices on the market.


Will the lead time still be 50+ weeks though? My guess is yes.


Same.

I upgraded to a 4 TB NVMe drive that theoretically reads/writes at up to 7.7 GB/s, but I only get 3.5 GB/s because my CPU is still an i9-9900K running PCI-E 3.0.

Planning on upgrading once the next Intel generation drops.


It's worth being clear about what Cadence are announcing here (Cadence sells VLSI tooling and libraries) - they have 2 things:

- cells for a chip to send/receive at 128Gb/s - this solution requires 8 of them running in parallel (like 8 PCIe lanes) - a module that takes 8 lanes in/out and drives/receives a single fiber


GigaTeras?


GigaTransfers.

128 GT is 8192 Gbps


For DDR RAM, which uses 64 parallel lines. PCI-E transmits one bit per transfer, which has its practical bandwidth reduced after error correction.


Damn. Just put together 2 7900X3D + H13SAE-MF + RTX 4070 Ti SUPER + 100GbE-SR4 NICs with PCIe 5.0 and ATX 3.0. Mostly for bare-metal NIC load testing. Already obsolete.

In other news, I'm getting a Xeon with 4 whole GiB of DDR2 ECC RAM shipped from China that has 3 ISA slots. ;D


>7900X3D

What purpose could this possibly serve? Enjoying worse performance than a 7900X?


What's your problem, bucko? 2x the L3 and a lower price tag.


But that L3 is only on 6 of the cores, the other 6 don't have it. The non-3d cores can clock like 10% higher or more.


What kind of 100GbE throughput you’re getting?


The invisible kind because not all the parts have arrived because WiredZone and Next International are terrible Supermicro VARs to be avoided.


> In other news, I'm getting a Xeon with 4 whole GiB of DDR2 ECC RAM shipped from China that has 3 ISA slots. ;D

There is no DMA there, so your Sound Blaster wouldn't work [from the box].


[flagged]


What do you think people come to this site for?


Apparently you have an answer in mind, do tell?


Eating all this popcorn while I read about people who are super deeply invested in AI or crypto.


Can't they use proper units? I refuse to even read this.


These are the standard and correct units.


No, GT/s is not, regardless of the downvotes, please point me to the standard that defines them if I'm wrong. Let's keep using the boring GB/s since we are still talking about data.


The PCIe standard uses GT/s: https://pcisig.com/specifications/pcie-70-specification-vers...

This is not the same as Gb/s. There's a few percentage points difference due to error correction.


Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: