> vs. aggregate CPU bandwidth of up to 150-200 GB/s for streaming data into a CP...

vardump · on Dec 24, 2014

Well, don't set NUMA to interleave! Instead set all of first socket's memory first, then all of second socket memory, etc. 2 MB/1GB pages (don't want TLB miss every 4kB!). DRAM wise prefetch for each memory channel, to cover DRAM internal penalties. I think DRAM bank switch penalties span 256 bytes, assuming 4 memory channels, every 4, 8 or 16 kB. Things are variable, that's what makes it hard and annoying. Don't overload a single memory channel. Worst case memory channel wise is read 64 bytes aligned, skip next 192 bytes. Again, assuming 4x [64-bit] memory channels per CPU socket. Correct me if I'm wrong, but I think a single memory channel fills a single 64-byte cache line.

And no matter what you do, don't write to same cache lines, especially across NUMA regions. Also avoid locks and even atomic operations. Try to ensure also PCIe DMA happens in local NUMA region.

I'm impressed of getting 100 GBbps CPU bandwidth. It's hard to avoid QPI saturation.