I'm having trouble seeing the value add here. From the Anandtech article [1] the...

DigitalJack · on July 26, 2016

The writer had some of the same thoughts, and found this:

"The performance differential was actually more than I expected; reading a file from the SSG SSD array was over 4GB/sec, while reading that same file from the system SSD was only averaging under 900MB/sec, which is lower than what we know 950 Pro can do in sequential reads. After putting some thought into it, I think AMD has hit upon the fact that most M.2 slots on motherboards are routed through the system chipset rather than being directly attached to the CPU. This not only adds another hop of latency, but it means crossing the relatively narrow DMI 3.0 (~PCIe 3.0 x4) link that is shared with everything else attached to the chipset."

protomok · on July 26, 2016

Thanks, I missed this. I noticed that the PCIE 3.0 x1 bandwidth is just above what AMD reported in their throughput test - 985 MB/s...I wonder if AMD used a system with the SSD connected to a M.2 slot on the motherboard with just a PCIE 3.0 x1 link.

I should clarify that I do think this onboard SSD concept could be really compelling for certain use cases, such as needing to store several hundred gigs of data which needs to be randomly accessed.

voltagex_ · on July 27, 2016

I wonder if this would be useful to companies like Pixar and Weta Digital - I imagine a big speedup in frame rendering time or a reduction in number of build machines required would be worth lots of cash.

ksec · on July 27, 2016

That is why I think SSD should be directly attached to CPU, not to Chipset. This would solve whole lots of problem.

gambiting · on July 27, 2016

Weren't old HDD DMA drives exactly like this? And we went away from it, because copying anything between drives was causing like 50% cpu usage? Having a separate chipset managing your drives seems like a good idea, but the bus should definitely be faster in this case.

rtkwe · on July 26, 2016

It takes the CPU (and all the buses in between) out of the request pipeline.

Sanddancer · on July 26, 2016

The current pipeline for such a request would be the GPU sending a request for more data, the CPU receiving it, the kernel figuring out who should handle the request, the handler then making a request to the kernel for the data on the SSDs, the SSDs sending the data back to main memory, the kernel telling the handler where the memory is, then the handler telling the kernel to send the memory to the video card.

In contrast, the route with the on-board SSDs is that the GPU makes the request, the ASIC that handles NVMe requesting from the memory chips, the memory chips sending to the ASIC, which goes into memory. There are a lot fewer steps there, and a lot fewer places where delayed interrupts, etc can introduce lag.

protomok · on July 26, 2016

With DMA and the fact that the GPU is a PCIE bus master I don't expect much userland/kernel/interrupt activity...here is my naive implementation of "traditional" 8k video streaming:

(CPU) WHILE video not done

  1 - Initiate DMA from SSD -> main memory with some large (say 128MB) chunk of 8k video

  2 - Signal GPU to begin DMA transfer

  3 - Wait for interrupt from GPU

(GPU) WHILE video not done

  1 - Wait for CPU to indicate data is ready

  2 - Initiate DMA from main memory -> local GPU memory over PCIE 3.0 x16 link

  3 - Issue interrupt to CPU

  4 - Render chunk of video

banachtarski · on July 26, 2016

What you're missing is that PCIE 3.0 x16 is emphatically NOT plenty of bandwidth. It's more than what you might otherwise get of course, but having a texture loaded locally on the GPU die is orders of magnitudes different from reading it from main memory. It's also not "low latency" at least not relatively speaking. I would have killed for hardware like this when I worked with terrain rendering in the past.

frozenport · on July 26, 2016

The software portion also adds a tremendous value. It's not trivial to supply the GPU with data in an efficient manner. In my work I have a 30 GB volume I need fourier filter and remap.