Hacker News new | past | comments | ask | show | jobs | submit login
GpuScan and SSD-To-GPU Direct DMA (hatenablog.com)
197 points by matsuu on Sept 18, 2016 | hide | past | favorite | 26 comments



There is no explanation how it works. Does it work on top of existing APIs in user space? Or is there a custom kernel driver bypassing user space?

I've done some high throughput streaming from HD/SSD to GPU before, and it's pretty easy to beat the naive solution but getting the most out of it would require kernel space code.

I was doing random access streaming of textures using memory mapped files for input and copying to persistent/coherent mapped pixel buffers on the CPU with memcpy with background threads. This was intended to take advantage of the buffer caches (works great when a page is reused) and intended for random access. If I would have been working on a sequential/full file upload, my solution would be entirely different.

Edit: here's the source: https://github.com/kaigai/ssd2gpu

It has a custom kernel module.


Its kernel module provides some special APIs. The userspase application (PostgreSQL) is enhanced to use them. From the point of user view, SQL still has been the interface to access the data.


This is very interesting in the light of recent AMD announcement of their "Solid State Graphics", ie GPU with SSD ducktaped on: http://www.anandtech.com/show/10518/amd-announces-radeon-pro...


This would be incredibly useful for distributed machine learning - imagine a Tensorflow implementation that almost entirely bypasses CPU.


For most applications, getting training images onto the GPU isn't the bottleneck by far. Training the Inception model, for example, handles batches of 32 images (299x299x3) in 1.2 seconds. That's a pretty boring ~300KB * 32 ~= 10MB/sec of read bandwidth off the SSD for Imagenet. Even dealing with "real" images is probably only 10x that, which is trivial to get over the PCI bus.

The question would be whether we can turn the crank on the design of models to make it possible to do something really cool given access to very high-speed SSD storage.


I was thinking the same thing, but is SSD to GPU faster than RAM to GPU? In many (not all) cases you buy a tonne of RAM and load your entire dataset into memory once and then iterate over it as necessary.

You also lose the flexibility of doing any sort of data modification or augmentation. One domain where your data usually doesn't fit in RAM is image recognition, but often you want to do things like apply random flips, crops and change hues before training to make the neural net less sensitive to those changes, which you can't really do with this.


SSD is probably not as fast as RAM, but it's much much cheaper, in the order of 10x per gigabyte. With SSD-GPU bridge you can have fast access to a multiple TiB training set, on a single machine.

Data pre-processing is indeed an issue, but hue adjustment/flipping/cropping could be implemented as Tensorflow operations, on the GPU. Similarly with input decompression - it would either have to be done on GPU, or the data would have to be stored uncompressed.


As long as the average bandwidth isn't a bottleneck, it's not going to matter - at worst, you're just going to need to prefetch (and due to SSD latency, that's likely optimal regardless).


RAM-to-GPU is always faster than SSD-to-GPU. It is a solution to help a situation when data size does not fit RAM size (or when user has less budget to purchase enough RAM. In fact, we can purchase Intel SSD 750 (400GB) with 300USD).


For the scenario you're targeting: databases, this makes a tonne of sense, database data regularly exceeds the size of RAM and the operations you want to do on the data are pretty static in the sense that they're the SQL operators.

In deep learning you are usually doing a lot more custom processing and your datasets are usually not as big, such that just buying more RAM is often cost effective.


So, if I understand correctly, data is being loaded directly from the SSD to the GPU and then filtered by the GPU before the CPU handles the more difficult queries.

Neat.


This is very awesome. If further developed + made into a feasible option for PostgreSQL, this has potential to do interesting things to TPC benchmarks. :)


See also https://developer.nvidia.com/gpudirect and to some extent https://en.wikipedia.org/wiki/NVLink.

NVLink is in the Power9 servers Google is using.


AFAIK Intel is stonewalling NVLink on their CPUs so they can (try to) sell Knight's landing. Quite a shame, although it might hurt then in the long run if they drive more institutes to buy ARM plus Tesla or Power plus Tesla clusters.


This is a perfect example of why we need healthy competition in the CPU market.

(Disclaimer: I don't have any way to verify whether the parent post is true, but I think the point stands regardless of whether this specific case is true or not.)


Do we have any data that Google is actually using their POWER9 boxes? I always read that as investing in anything not Intel (see also RISC-V), just to be in a better negotiating position with Intel for the vast number of CPUs that they buy.


Allegedly they have Power8 servers in their data centres:

Maire Mahoney, engineering manager at Google and now a director of the OpenPower Foundation, confirmed to The Next Platform that Google does indeed have custom Power8 machines running in its datacenters and that developers can deploy key Google applications onto these platforms if they see fit. Mahoney was not at liberty to say how many Power-based machines are running in Google’s datacenters or what particular workloads were running in production (if any).[1]

It's pretty unclear what that actually means, though.

[1] http://www.nextplatform.com/2016/04/06/inside-future-google-...


AFAIK POWER9 is not even out yet.


I'm really hoping that Optane delivers on the hype, in which case our durable storage could be just 10x slower than RAM. At least, I imagine that it would be really helpful for speeding up even this approach.


I hope this brings us closer to widespread external GPUs, where you could use a slower-than-PCIe bus like Thunderbolt 3 or USB 3.1 to upload all assets to the EGPU's SSD during a one-time loading screen.


Direct Direct Memory Access? That's pretty direct.


Redundancy makes it pretty indirect


My headache is painful. It might be called "SSD-to-GPU P2P DMA".


Amazing results! We need more of that kind of thinking - GPU/SSD accelerate all the things!


Who is providing the DMA engine in this case? Has the GPU access to PCIe device memory?


NVME-SSD performs as DMAC in this case. All GPU doing is mapping its own device memory on the PCI BAR area.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: