The headline number is network I/O, but IIRC, the bottleneck is really in/around system RAM. Since most of the content isn't in cache and neither the disks nor the network cards have sufficient buffering, you need to have the disk write to system RAM and indicate readiness, then the network card reads from system RAM. NUMA bandwidth and latency can be an issue there too.
But, I don't think there was room on the pci-e lanes to do something like put a gpu in to use as a DMA buffer to get more ram bandwidth.
I think the next Epyc generation should have PCIe 5 and DDR5, both of which should help.
If an nvme drive supports the controller memory buffer (CMB) feature, an RNIC can do a peer to peer transfer.
From what I recall of the netflix storage node that was linked from HN a few months back, the current generation has 4 x 100 Gb mellanox ethernet ports (CX6, PCIe Gen 4) and somewhere around 20 to 30 PCIe gen 3 NVMe drives.
Assuming they can figure out how to do peer to peer transfers, scaling up by a factor of 4 doesn't seem implausible.
There are a lot of disks in a Netflix content appliance. The latest slide says 18 drives, each with PCIe 3.0 x4. There are 4 PCIe 4.0 x 16 nics, and there's all your PCIe lanes. You could get PCIe 4 drives, but it's not the bottleneck at the moment.
But, RAM needs at least twice the bandwidth as your network, because you can't have the NIC read from the disk directly, you need to have the disk DMA to ram, and the NIC DMA from ram, and (normal system) ram isn't dual ported, so reads and write contend. If you need to do TLS in software, you touch ram 4 times (disk read, cpu read, cpu write, nic read), so ram bandwidth is an even bigger bottleneck.
I thought with things like direct-storage (the equivalent of gpudirect) and ssdk this wasn't the case any more (no more cpu intervention, every device has their own programmable dma engines ?)