GpuScan and SSD-To-GPU Direct DMA

exDM69 · on Sept 18, 2016

There is no explanation how it works. Does it work on top of existing APIs in user space? Or is there a custom kernel driver bypassing user space?

I've done some high throughput streaming from HD/SSD to GPU before, and it's pretty easy to beat the naive solution but getting the most out of it would require kernel space code.

I was doing random access streaming of textures using memory mapped files for input and copying to persistent/coherent mapped pixel buffers on the CPU with memcpy with background threads. This was intended to take advantage of the buffer caches (works great when a page is reused) and intended for random access. If I would have been working on a sequential/full file upload, my solution would be entirely different.

Edit: here's the source: https://github.com/kaigai/ssd2gpu

It has a custom kernel module.

kaigai · on Sept 18, 2016

Its kernel module provides some special APIs. The userspase application (PostgreSQL) is enhanced to use them. From the point of user view, SQL still has been the interface to access the data.

zokier · on Sept 18, 2016

This is very interesting in the light of recent AMD announcement of their "Solid State Graphics", ie GPU with SSD ducktaped on: http://www.anandtech.com/show/10518/amd-announces-radeon-pro...

foobar2020 · on Sept 18, 2016

This would be incredibly useful for distributed machine learning - imagine a Tensorflow implementation that almost entirely bypasses CPU.

dgacmu · on Sept 18, 2016

For most applications, getting training images onto the GPU isn't the bottleneck by far. Training the Inception model, for example, handles batches of 32 images (299x299x3) in 1.2 seconds. That's a pretty boring ~300KB * 32 ~= 10MB/sec of read bandwidth off the SSD for Imagenet. Even dealing with "real" images is probably only 10x that, which is trivial to get over the PCI bus.

The question would be whether we can turn the crank on the design of models to make it possible to do something really cool given access to very high-speed SSD storage.

Eridrus · on Sept 18, 2016

I was thinking the same thing, but is SSD to GPU faster than RAM to GPU? In many (not all) cases you buy a tonne of RAM and load your entire dataset into memory once and then iterate over it as necessary.

You also lose the flexibility of doing any sort of data modification or augmentation. One domain where your data usually doesn't fit in RAM is image recognition, but often you want to do things like apply random flips, crops and change hues before training to make the neural net less sensitive to those changes, which you can't really do with this.

foobar2020 · on Sept 18, 2016

SSD is probably not as fast as RAM, but it's much much cheaper, in the order of 10x per gigabyte. With SSD-GPU bridge you can have fast access to a multiple TiB training set, on a single machine.

Data pre-processing is indeed an issue, but hue adjustment/flipping/cropping could be implemented as Tensorflow operations, on the GPU. Similarly with input decompression - it would either have to be done on GPU, or the data would have to be stored uncompressed.

emn13 · on Sept 18, 2016

As long as the average bandwidth isn't a bottleneck, it's not going to matter - at worst, you're just going to need to prefetch (and due to SSD latency, that's likely optimal regardless).

kaigai · on Sept 19, 2016

RAM-to-GPU is always faster than SSD-to-GPU. It is a solution to help a situation when data size does not fit RAM size (or when user has less budget to purchase enough RAM. In fact, we can purchase Intel SSD 750 (400GB) with 300USD).

Eridrus · on Sept 19, 2016

For the scenario you're targeting: databases, this makes a tonne of sense, database data regularly exceeds the size of RAM and the operations you want to do on the data are pretty static in the sense that they're the SQL operators.

In deep learning you are usually doing a lot more custom processing and your datasets are usually not as big, such that just buying more RAM is often cost effective.

witty_username · on Sept 18, 2016

So, if I understand correctly, data is being loaded directly from the SSD to the GPU and then filtered by the GPU before the CPU handles the more difficult queries.

Neat.

justinclift · on Sept 18, 2016

This is very awesome. If further developed + made into a feasible option for PostgreSQL, this has potential to do interesting things to TPC benchmarks. :)

nl · on Sept 18, 2016

See also https://developer.nvidia.com/gpudirect and to some extent https://en.wikipedia.org/wiki/NVLink.

NVLink is in the Power9 servers Google is using.

m_mueller · on Sept 18, 2016

AFAIK Intel is stonewalling NVLink on their CPUs so they can (try to) sell Knight's landing. Quite a shame, although it might hurt then in the long run if they drive more institutes to buy ARM plus Tesla or Power plus Tesla clusters.

pcwalton · on Sept 18, 2016

This is a perfect example of why we need healthy competition in the CPU market.

(Disclaimer: I don't have any way to verify whether the parent post is true, but I think the point stands regardless of whether this specific case is true or not.)

monocasa · on Sept 18, 2016

Do we have any data that Google is actually using their POWER9 boxes? I always read that as investing in anything not Intel (see also RISC-V), just to be in a better negotiating position with Intel for the vast number of CPUs that they buy.

nl · on Sept 18, 2016

Allegedly they have Power8 servers in their data centres:

Maire Mahoney, engineering manager at Google and now a director of the OpenPower Foundation, confirmed to The Next Platform that Google does indeed have custom Power8 machines running in its datacenters and that developers can deploy key Google applications onto these platforms if they see fit. Mahoney was not at liberty to say how many Power-based machines are running in Google’s datacenters or what particular workloads were running in production (if any).[1]

It's pretty unclear what that actually means, though.

[1] http://www.nextplatform.com/2016/04/06/inside-future-google-...

gpderetta · on Sept 18, 2016

AFAIK POWER9 is not even out yet.

carbocation · on Sept 18, 2016

I'm really hoping that Optane delivers on the hype, in which case our durable storage could be just 10x slower than RAM. At least, I imagine that it would be really helpful for speeding up even this approach.

Razengan · on Sept 18, 2016

I hope this brings us closer to widespread external GPUs, where you could use a slower-than-PCIe bus like Thunderbolt 3 or USB 3.1 to upload all assets to the EGPU's SSD during a one-time loading screen.

foobarbecue · on Sept 18, 2016

Direct Direct Memory Access? That's pretty direct.

flamedoge · on Sept 18, 2016

Redundancy makes it pretty indirect

kaigai · on Sept 19, 2016

My headache is painful. It might be called "SSD-to-GPU P2P DMA".

musha68k · on Sept 18, 2016

Amazing results! We need more of that kind of thinking - GPU/SSD accelerate all the things!

MrBuddyCasino · on Sept 18, 2016

Who is providing the DMA engine in this case? Has the GPU access to PCIe device memory?

kaigai · on Sept 18, 2016

NVME-SSD performs as DMAC in this case. All GPU doing is mapping its own device memory on the PCI BAR area.