Hacker News new | past | comments | ask | show | jobs | submit login

> vs. aggregate CPU bandwidth of up to 150-200 GB/s

for streaming data into a CPU you'll be lucky to get double digit bandwidths. peak figures are ~50gb/socket, but for anything more than a memcpy, it drops off like a cliff. then you also have NUMA issues, bank conflicts, TLB misses if your data is big enough..

i've written codes that sustain >270gb/s on high end GPUs - it's not trivial, but it can be done.

you are correct though, about the quantity of GPU memory available on an average GPU. the AMD S9150 has 16gb of ram. very high for a GPU, but nothing compared to high end servers.

> PCIe 4.0, 16 lanes would give 30 GB/s

afaik it's not in anything, so we're limited to 6gb/s for GPU <-> host.. :/

> With more realistic setting, CPU would be even more ahead.

depends. getting a good fraction of peak bandwidth on a GPU is fairly straightforward - coalesce accesses. some algorithms need to be.. "massaged" into performing reads/writes like this, but in my experience, a large portion of them can be.

getting a decent fraction of peak on a CPU is a totally different ballgame, however.

IMO, if the data can persist on the GPU, then this could be a big win.




Well, don't set NUMA to interleave! Instead set all of first socket's memory first, then all of second socket memory, etc. 2 MB/1GB pages (don't want TLB miss every 4kB!). DRAM wise prefetch for each memory channel, to cover DRAM internal penalties. I think DRAM bank switch penalties span 256 bytes, assuming 4 memory channels, every 4, 8 or 16 kB. Things are variable, that's what makes it hard and annoying. Don't overload a single memory channel. Worst case memory channel wise is read 64 bytes aligned, skip next 192 bytes. Again, assuming 4x [64-bit] memory channels per CPU socket. Correct me if I'm wrong, but I think a single memory channel fills a single 64-byte cache line.

And no matter what you do, don't write to same cache lines, especially across NUMA regions. Also avoid locks and even atomic operations. Try to ensure also PCIe DMA happens in local NUMA region.

I'm impressed of getting 100 GBbps CPU bandwidth. It's hard to avoid QPI saturation.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: