GPGPU Accelerates PostgreSQL

JonnieCache · on Dec 23, 2014

So glad to see this coming along.

If any of the project team are reading, what I'd like to see most is GPU-accelerated point-in-polygon lookups in postGIS, ST_Contains and so forth.

vardump · on Dec 23, 2014

It's likely GPUs are slower than CPUs for spatial data structures. Getting the data to GPU and results back takes just too long. Point in polygon is also very branchy in general case. GPUs are really bad with branchy code, so it's very unlikely you could even GPU accelerate such query, if it operates on point coordinates and a set of polygon vertexes. At least it would be very hard.

Edit: right, after thinking about it, the branches can be optimized out. It could be fast if there are a set of sorted segments, just parallel compares and some boolean logic.

Which leaves the problem of getting the data to GPU. Because you can definitely stream same comparisons on the CPU much faster (memory bandwidth limited) than you can stream the data to GPU over PCIe.

So 2 CPU socket system, such as Xeon E5, I'd bet on the CPU. PCIe 4.0, 16 lanes would give 30 GB/s (not sure if PCIe 4.0 is supported anywhere), vs. aggregate CPU bandwidth of up to 150-200 GB/s. Dual socket Xeon E5 supports at least 1 TB of RAM (16x 16 GB buffered DDR4). 32 GB DDR4 memory modules exist as well, I think, and larger DIMM banks than 16 slots can be supported. It's just the number of slots on typical 2 socket mainboards.

With more realistic setting, CPU would be even more ahead. All of this is ignoring GPU latency issues, which can be anywhere from microseconds to tens of milliseconds in pathological cases.

Unless the data was on the GPU in the first place... I think currently a single GPU can have up to 12 GB of RAM. Maybe larger GPUs exist too. That's just not much RAM compared to what is typical for CPUs. Currently smallest amount of RAM a dual socket Xeon E5v3 standard server can have is 64 GB, if all memory channels have at least one DIMM.

greggyb · on Dec 23, 2014

What's important to keep in mind is the progress that AMD has been making with their Heterogeneous System Architecture. With high-frequency RAM (especially when we start seeing widespread availability of DDR4) shared between a CPU and GPU that share a single die, all of this communications overhead goes away.

There is limited software support right now, because this architecture is very new, but on the benchmarks that take advantage of the on-die GPU, AMD's latest can keep up with and surpass much more expensive i7s. We're still at a point where it's unclear that AMD's HSA will take a commanding lead, but it's promising, especially considering the price/power requirements for an A10 (~$160 currently), vs the equivalent of a high end GPU and a Xeon.

You could, ignoring storage and peripherals (reasonable in a server farm arrangement) put together many more iGPU boxes than Xeon/dGPU boxes.

Even assuming moderate gains from GPU acceleration, high-throughput database servers could be made cheaper through this method.

1Bad · on Dec 23, 2014

Point in polygon actually seems like a good problem for the GPU. It can be calculated with a simple angle calculation performed in parallel against all segments. Trig functions are fairly heavy weight, and serially or with lesser degrees of threading on the cpu, creates a large computation, making it good for offloading to the GPU.

colanderman · on Dec 23, 2014

Point-in-polygon is usually done with simple vector arithmetic; not trig.

7952 · on Dec 23, 2014

Surely a non-iterative angle calculating method would only work with convex polygons?

foxhill · on Dec 23, 2014

> vs. aggregate CPU bandwidth of up to 150-200 GB/s

for streaming data into a CPU you'll be lucky to get double digit bandwidths. peak figures are ~50gb/socket, but for anything more than a memcpy, it drops off like a cliff. then you also have NUMA issues, bank conflicts, TLB misses if your data is big enough..

i've written codes that sustain >270gb/s on high end GPUs - it's not trivial, but it can be done.

you are correct though, about the quantity of GPU memory available on an average GPU. the AMD S9150 has 16gb of ram. very high for a GPU, but nothing compared to high end servers.

> PCIe 4.0, 16 lanes would give 30 GB/s

afaik it's not in anything, so we're limited to 6gb/s for GPU <-> host.. :/

> With more realistic setting, CPU would be even more ahead.

depends. getting a good fraction of peak bandwidth on a GPU is fairly straightforward - coalesce accesses. some algorithms need to be.. "massaged" into performing reads/writes like this, but in my experience, a large portion of them can be.

getting a decent fraction of peak on a CPU is a totally different ballgame, however.

IMO, if the data can persist on the GPU, then this could be a big win.

vardump · on Dec 24, 2014

Well, don't set NUMA to interleave! Instead set all of first socket's memory first, then all of second socket memory, etc. 2 MB/1GB pages (don't want TLB miss every 4kB!). DRAM wise prefetch for each memory channel, to cover DRAM internal penalties. I think DRAM bank switch penalties span 256 bytes, assuming 4 memory channels, every 4, 8 or 16 kB. Things are variable, that's what makes it hard and annoying. Don't overload a single memory channel. Worst case memory channel wise is read 64 bytes aligned, skip next 192 bytes. Again, assuming 4x [64-bit] memory channels per CPU socket. Correct me if I'm wrong, but I think a single memory channel fills a single 64-byte cache line.

And no matter what you do, don't write to same cache lines, especially across NUMA regions. Also avoid locks and even atomic operations. Try to ensure also PCIe DMA happens in local NUMA region.

I'm impressed of getting 100 GBbps CPU bandwidth. It's hard to avoid QPI saturation.

bhouston · on Dec 23, 2014

If the data can stay on the GPU, then it is likely a win. GPU have 8GB or more now. Thus it depends on how much polygon data one has.

berkut · on Dec 23, 2014

Not necessarily: even if the data's on the GPU so doesn't have to pay the PCI-E transfer penalty, GPUs still have cache hierarchies and these have latencies as well, and they can be worse than for CPUs as the branch-predictors and pre-fetchers of GPUs are still fairly primitive in comparison to what CPUs are capable of, meaning access patterns on a GPU can actually matter quite a bit - you end up having to change block-size per GPU type and code which works on one very well doesn't work as well on another GPU.

However, point-in-polygon is a fairly simple algorithm, and if each polygon was mostly < 40 vertices, I suspect a GPU might be faster. However, for more complex algorithms, GPUs don't do as well and with many more vertices I suspect GPUs won't do as well for point-in-polygon tests.

In terms of raw theoretical FP processing power, GPUs look good - but when you start to do more complex things with them - i.e. when branching happens a lot, say with path tracing, they don't look as good. E.g. a dual Xeon 3.5 Ghz quad i7 (costing ~£950 each) is as fast at path tracing as a single NVidia K6000 costing ~£4100.

bhouston · on Dec 23, 2014

Here is a production quality path tracer that for most users runs noticeably faster than competing CPU-based renderer: https://www.redshift3d.com/

Pragmatically, it produces results of similar quality quicker than CPU-based competitors.

It is really taking the high end rendering world by storm this year.

berkut · on Dec 23, 2014

Erm??...

That's a biased renderer that uses all sorts of caching and approximations that there's no CPU-based renderer that supports (VRay's closest with it's ability to configure primary and secondary rays using different irradiance cache methods), and as Redshift doesn't support CPU rendering, it's hardly a comparison worth talking about as you'd be comparing different algorithms. The pure brute-force without any caching numbers I've seen for it don't look any better than the other top CPU renderers doing brute-force MC integration.

Also, a quibble, but I guess by "high-end rendering world" you mean archviz (where VRay and 3DSMax are dominant) and a few small VFX studios who happen to be running Windows?

Varcht · on Dec 23, 2014

Your selling it a little short, pretty much everyone not doing feature films like game cinematics, commercials and product viz.

berkut · on Dec 23, 2014

"Pretty much everyone"

Really?

I know companies like Blur are trialling it, but they're still using VRay. I know The Mill have done stuff with it, but they're still using Arnold too.

Varcht · on Dec 23, 2014

I mis-read, thought you were saying not many studios using windows and VRay, agree on Red Shift.

sjolsen · on Dec 24, 2014

> Getting the data to GPU and results back takes just too long

I was involved in a database research project recently, and this is exactly what we found: sure, GPGPUs and the like are much faster than CPUs for the right database queries, but the transfer overhead is so absolutely horrendous that it completely dwarfs any gains in execution time.

rodgerd · on Dec 24, 2014

Did you throw AMD APUs into the project, or does their lack of x86_64 per-core performance kill the advantage of the shared memory pool.

sjolsen · on Dec 24, 2014

No AMD stuff. We just had a box with a couple of Xeons, a Xeon Phi, and a K20.

7952 · on Dec 23, 2014

Point in polygon is difficult in GIS because the polygon can have a very large number of verticies that all have to be checked. Does a GPU have sufficient bandwidth to load large polygons quickly? Do GPU's have to sample each vertex every time?

Often the best way is to divide polygons into tiles that have a limited number of verticies (this also makes indexing much more effective).

rcpt · on Dec 24, 2014

About a year ago there was a lot of press about a GPU database called MapD. It's not released yet but most of their demos involve spatial data

cf http://blogs.nvidia.com/blog/2013/11/20/juicing-big-data-sta...

trentnelson · on Dec 23, 2014

Yeah this is a problem well suited to parallelism. Oracle 12c's spatial stuff can be configured to leverage AVX for these sorts of problems.

And let's admit it, writing SQL statements with `vector group by ...` make you feel like a bad-ass.

sitkack · on Dec 23, 2014

You should submit data and queries that you would like accelerated.

majc2 · on Dec 23, 2014

As an aside, if this is of interest you might be interested in the third run of the GPGPU course by coursera/University of Illinois starts in January. See: https://www.coursera.org/course/hetero

joelthelion · on Dec 23, 2014

It's great to see that they're using OpenCL. GPU computation desperatly needs standardization, and this could help bring OpenCL drivers on par with CUDA.

vardump · on Dec 23, 2014

A lot is lost in the translation when it comes to OpenCL. Current nVidia and AMD GPUs are just wide SIMD machines, just like current x86 cores. GPUs are just just wider and with much less cache, slower serial execution.

fiatmoney · on Dec 23, 2014

"Current nVidia and AMD GPUs are just wide SIMD machines"

That's not really true; you can kind of treat it that way, depending on the algorithm (eg, matrix multiplication breaks down cleanly that way), but there are serious flexibility advantages in the "single-instruction-multiple-thread" model vs SIMD. For example, consider streaming large numbers of hash lookups - difficult to express clearly with a pure vector processing model.

vardump · on Dec 24, 2014

Right, I should have mentioned GPUs have a ton of hardware threads. Then again, they have to, GDDR5 memory access can take a microsecond. Try latency like that on a generic CPU, hyperthreading or not...

So: GPUs are wide SIMD machines with a lot of hardware threads, massive branch and glacial memory latencies. When there's a branch or memory latency, HW simply switch thread. GPUs don't care about serial execution performance.

matsemann · on Dec 23, 2014

Most servers I've used don't even have a GPU. It will be interesting to see how this and other GPGPU applications for server software will shape the server parks in the future.

perlgeek · on Dec 23, 2014

In the scientific computing community, servers with GPUs are pretty common, and available off-the-shelf. See for example http://www.supermicro.com/products/nfo/gpu.cfm

So asking the folks that already use that stuff could give pretty accurate predictions.

nl · on Dec 23, 2014

GPUs on servers are a growing submarket. It's big enough for Amazon to offer them: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_clu...

trentnelson · on Dec 23, 2014

Only a matter of time I guess. I was tickled pink when I first logged into my most recent client's standard production box and saw that a 1TB Fusion I/O card was standard. Definitely altered the way I architected certain things.

I could get Oracle to sustain 1.2GB/sec read/write pretty easily -- plus four other I/O channels that supported about 600MB/sec each, was fun figuring out the best way to organize everything such that CPU and I/O were optimally saturated.

Throw a GPGPU or Xeon Phi into the mix as well? Fwooaah, fun times.

adamtj · on Dec 23, 2014

This is very interesting, but not as a CPU saving optimization. I'm sure it does that very well, but that's not why it's Important. Rather, this seems to me like the next step toward the inevitable future of PostgreSQL as arbitrarily scalable, and as _the_ general query engine that ties together whatever physical data stores you happen to use.

It seems obvious to me that pushing the Foreign Data Wrapper layer with work like this is how we eventually break through the RDBMS scalability barrier of the individual host. In the future, I'm sure you'll see similar work where _the_ GPU won't be the GPU and the PCI bus won't be the PCI bus. Rather, they'll be _a_ host and the network. A database service (database cluster in Posgres's nomenclature) will eventually run not just on a single machine, but on a single cluster of machines. Instead of a cluster of machines for redundancy, you'll have a cluster of clusters.

Postgres is really two things in one: a physical layer of bytes in pages in files, and a logical layer of queries on tables of records. The most important piece in the future will be the logical layer. The FDW layer will naturally be extended and generalized until it is fully as powerful as the current physical layer. At that point, it can be made THE api through which the logical layer accesses data. The current physical layer will then be nothing more than the default implementation of that general API.

At that point, we can move whole or partial tables to other hosts. Perhaps the autovacuum daemon will gain a sibling in the autosharding daemon. The query optimizer will need to care not just about disk IO, but network IO and will need to start considering the non-uniform performance characteristics of different tables. Some tables will be driven by Postgres's default physical storage engine. Others will be driven by other RDBMSs, or by NoSQL key/value or document stores, or other data stores. They may be on the same machine or a different one.

Postgres will transform into a query engine on top of whatever data stores best fit your workload. I expect the query engine will learn about columnar stores and be able to mix those in a single query with the row stores, key/value stores and document stores that it already understands. PostgreSQL will be a central point through which you can aggregate, analyze, and manipulate any and all of your data. It needn't be intrusive or disruptive: you can still use a normal redis client for your redis store, but you can also use Postgres to manipulate that data with SQL and to combine it with other tables, whole RDBMSs, other NoSQL stores, spreadsheets, web services, or anything else. Maybe it will even make things like Map/Reduce frameworks redundant.

I don't typically follow Postgres's internal discussions, so maybe this is already being discussed and planned. Or, maybe it's so obvious that nobody even needs to talk about it. Or, perhaps I'm just some wide-eyed idealist who doesn't understand the fundamental problems preventing such a thing from ever being practical.

nrzuk · on Dec 23, 2014

While I absolutely love the concept and really want to buy a graphics card just to play with this on my development box. Find it quite exciting how some applications are utilising graphics processing power.

But I can't help but wonder what the sys admin's response is going to be when I start asking for additional graphics cards being added to his perfectly built 2U database servers!

marcosdumay · on Dec 23, 2014

Size is nothing.

You are also asking the sysadmin to install a closed, unreliable kernel-level piece of software with that GPU.

darkarmani · on Dec 24, 2014

> You are also asking the sysadmin to install a closed, unreliable kernel-level piece of software with that GPU

Unlike all of the closed network equipment they already manage.

fiatmoney · on Dec 23, 2014

NVidia supports particular server-targeted GPUs very well. Buy a Tesla or Quadro, use a supported kernel, and you'll be fine.

foxhill · on Dec 23, 2014

nvidia-server drivers are fairly stable.

it pains me to say that fglrx is still years behind them, however.

ris · on Dec 23, 2014

GPUs are increasingly being added to the same die as CPUs so them being present-by-default in the future is very likely.

plq · on Dec 23, 2014

Which makes me wonder about the influence on the performance per watt ratio of such GPU-based solutions. To my knowledge, beefy cards are pretty power-hungry.

spydum · on Dec 23, 2014

the price per core in terms of watts strongly favors GPU. Even if a card is drawing 250w it's providing 2300+ cores, that is crazy

http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-780...

vardump · on Dec 24, 2014

That GPU has 12 cores, in the same way a desktop CPU has 4 cores. Number of units with concurrent independent execution flow. Very wide and a lot of execution resources yes. Maybe even 5x computing power of a desktop CPU considering clock frequency as well.

"2300+ cores" is a VERY misleading way to represent GPU resources. You could also say GTX780 has 12 cores with 1/3rd clock frequency is an equally unfair and equally "true" way to express it, if you were trying to suggest CPUs are "better".

fitshipit · on Dec 23, 2014

Yes, but each of those cores does very little compared to one core on a CPU.

spydum · on Dec 23, 2014

doing very little times 2300 in parallel is sort of the point though. if you have a parallel optimized job (like sum/agg of many rows/table), it is stupid difficult to make fast on a complex CPU where most of the execution paths remain unused. you must wait for the instruction pipeline to clear, and you can only process so many ops per cycle (whats a popular core count now on a big cpu system? 48-96 threads i think my UCS blades can run). when you are talking THOUSANDS of cores on a weee baby 250w GPU card, of which I can put TWO OF in each system? That is enormously powerful for those parallel tasks.

vardump · on Dec 24, 2014

In that case my 4 core is really -- waves hands -- a 576 core system. 4 cores, maybe 2 AVX 8-wide instructions execute 2 * 8 * 4 and maybe 3 stages are in flight. And 3x the clock. Or something. So I'm getting roughly comparable completely meaningless 2 * 8 * 4 * 3 * 3 cores.

I'm not suggesting CPU resources should be counted like that, but that's closer to have GPU resources are counted. Sure, it sounds impressive, but does that 2304 cores really represent fair truth to, say, 4 CPU cores?

msandford · on Dec 24, 2014

While we're making hand-wavey comparisons, I'll make one.

NVidia GPUs have a theoretical peak of about 3-5 TFlops for 250 watts. http://en.wikipedia.org/wiki/List_of_Nvidia_graphics_process...

Xeons have a theoretical peak of about 0.5-1 TFlops for 150 watts. http://www.microway.com/hpc-tech-tips/intel-xeon-e5-2600-v3-...

Is that completely apples-to-apples? Probably not since the Xeon is probably talking about double precision floating point versus single precision on the GPU. But for a lot of database applications which don't involve money, single precision floats have a sufficient level of accuracy for the performance improvement to be attractive.

Yeah the performance isn't 100x like it used to be but it's still enough that if you have racks and racks full of machines a 3-10x improvement could be really substantial. Going from $10k/mo in rent to $1k/mo in rent at a datacenter could make or break an early stage startup.

Further as things get cheaper they get used a lot more. Scientists have only two models: the ones they can run but don't really like and the ones they want to run. Adding fidelity to modeling codes isn't an absolute good but it's hard to argue that it makes the world worse.

fitshipit · on Dec 23, 2014

This reminds me a little of the Netezza data warehouse appliance's architecture: a query planner in front of lots of little nodes with one disk, one CPU, and an FPGA. Every query is a full table scan, each node flashes the WHERE clause to the FPGA, and slurps the whole disk through the FPGA.

jvickers · on Dec 23, 2014

Does anyone know if or when the GPGPU acceleration will be available in the normal Postgres install?

Is / will this acceleration be switched on by default?

ris · on Dec 23, 2014

Too far off in the distant future to tell.

financequoll · on Dec 24, 2014

This + Amazon RDS would be pretty awesome for mapping.

sjtrny · on Dec 23, 2014

in other breaking news grass is green and water is wet. Obviously throwing more power at the problem ends up with faster execution.

There's a limit to GPGPU acceleration though. It's the tiny amount of RAM. We need to adopt shared memory architecture like those found in games consoles. A single massive pool of RAM would further unlock potential power.

hbogert · on Dec 23, 2014

It's not that simple, when you look at those architectures like AMD's fusion platform, the differences of latencies and bandwidth play a huge role, making it a much more nuanced story. One of the papers showing this: http://link.springer.com/article/10.1007/s00450-012-0209-1#p...

Sanddancer · on Dec 23, 2014

The paper's unfortunately paywalled, but it being two years old, and with AMD's fusion platform having added a few interesting features in the interim, like being able to just throw a pointer over the wall to the GPU, instead of copying the entire data structure over, makes me wonder if the paper needs to be revised to address this. GPGPU is a rapidly changing field, and just a couple years is a big difference.