Hacker News new | past | comments | ask | show | jobs | submit login

I'm having trouble seeing the value add here. From the Anandtech article [1] they are using 2x512GB Samsung 950 Pro SSDs which use PCIE v3 x4 with M.2 connectors and a PCIE switch. The drives presumably are using NVME.

The demo claims that without the SSDs they were rendering raw 8k video @17 fps and using the SSDs improved rendering to > 90fps. How can this be such a significant improvement over accessing the same SSDs connected directly to a motherboard? The graphics card would have a PCIE 3.0 x16 connection...plenty of bandwidth and very low latency.

Maybe I'm missing something?

[1] - http://www.anandtech.com/show/10518/amd-announces-radeon-pro...




The writer had some of the same thoughts, and found this:

"The performance differential was actually more than I expected; reading a file from the SSG SSD array was over 4GB/sec, while reading that same file from the system SSD was only averaging under 900MB/sec, which is lower than what we know 950 Pro can do in sequential reads. After putting some thought into it, I think AMD has hit upon the fact that most M.2 slots on motherboards are routed through the system chipset rather than being directly attached to the CPU. This not only adds another hop of latency, but it means crossing the relatively narrow DMI 3.0 (~PCIe 3.0 x4) link that is shared with everything else attached to the chipset."


Thanks, I missed this. I noticed that the PCIE 3.0 x1 bandwidth is just above what AMD reported in their throughput test - 985 MB/s...I wonder if AMD used a system with the SSD connected to a M.2 slot on the motherboard with just a PCIE 3.0 x1 link.

I should clarify that I do think this onboard SSD concept could be really compelling for certain use cases, such as needing to store several hundred gigs of data which needs to be randomly accessed.


I wonder if this would be useful to companies like Pixar and Weta Digital - I imagine a big speedup in frame rendering time or a reduction in number of build machines required would be worth lots of cash.


That is why I think SSD should be directly attached to CPU, not to Chipset. This would solve whole lots of problem.


Weren't old HDD DMA drives exactly like this? And we went away from it, because copying anything between drives was causing like 50% cpu usage? Having a separate chipset managing your drives seems like a good idea, but the bus should definitely be faster in this case.


It takes the CPU (and all the buses in between) out of the request pipeline.


The current pipeline for such a request would be the GPU sending a request for more data, the CPU receiving it, the kernel figuring out who should handle the request, the handler then making a request to the kernel for the data on the SSDs, the SSDs sending the data back to main memory, the kernel telling the handler where the memory is, then the handler telling the kernel to send the memory to the video card.

In contrast, the route with the on-board SSDs is that the GPU makes the request, the ASIC that handles NVMe requesting from the memory chips, the memory chips sending to the ASIC, which goes into memory. There are a lot fewer steps there, and a lot fewer places where delayed interrupts, etc can introduce lag.


With DMA and the fact that the GPU is a PCIE bus master I don't expect much userland/kernel/interrupt activity...here is my naive implementation of "traditional" 8k video streaming:

(CPU) WHILE video not done

  1 - Initiate DMA from SSD -> main memory with some large (say 128MB) chunk of 8k video

  2 - Signal GPU to begin DMA transfer

  3 - Wait for interrupt from GPU
(GPU) WHILE video not done

  1 - Wait for CPU to indicate data is ready

  2 - Initiate DMA from main memory -> local GPU memory over PCIE 3.0 x16 link

  3 - Issue interrupt to CPU

  4 - Render chunk of video


What you're missing is that PCIE 3.0 x16 is emphatically NOT plenty of bandwidth. It's more than what you might otherwise get of course, but having a texture loaded locally on the GPU die is orders of magnitudes different from reading it from main memory. It's also not "low latency" at least not relatively speaking. I would have killed for hardware like this when I worked with terrain rendering in the past.


The software portion also adds a tremendous value. It's not trivial to supply the GPU with data in an efficient manner. In my work I have a 30 GB volume I need fourier filter and remap.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: