Any good resources on where GPU DBs offer significant wins? Especially, but not only, if it wins in cost efficiency for some workload.
My naïve impression working at smaller scales is that in a reasonably balanced system, storage is often the bottleneck, and if it can deliver data fast enough the CPU probably won't break a sweat counting or summing or so on. The amount of interest in GPU databases, though, suggests that's not the case for some interesting workloads.
I don't know about practical benchmarks yet, but GPUs have superior parallelism and superior memory bandwidth compared to CPUs.
Classical "join" patterns, like merge-join, hash-join, and other such algorithms, have been run on GPUs with outstanding efficiency. Taking advantage of the parallelism.
A GPU merge-join is a strange beast though. See here for details: https://moderngpu.github.io/join.html . So its a weird algorithm, but it is clearly far more efficient than anything a traditional CPU could ever hope to accomplish.
In any case, it is clear that the high-memory bandwidth of GPUs, coupled with its parallel processors, makes the GPU a superior choice over CPUs for the relational merge-join algorithm.
GPUs probably can't be used effectively in all database operations. But they can at LEAST be used efficiently in "equijoin table", one of the most fundamental and common operations of a modern database. (How many times have you seen "JOIN Table1.id == Table2.id"??) As such, I expect GPUs to eventually be a common accelerator for SQL Databases. The only problem now is building the software to make this possible.
At SQream (a GPU-accelerated data warehouse) we use the GPU for many operations, including sorting, aggregating, joining, transformations, projections, etc.
We augment the GPU algorithms with external CPU algorithms when the GPU implementations aren't ideal or can't run due to memory constraints.
GPU databases are brilliant for cases where the working set can live entirely within the GPU's memory. For most applications with much larger (or more dynamic) working sets, the PCIe bus becomes a significant performance bottleneck. This is their traditional niche.
That said, I've heard anecdotes from people I trust that heavily optimized use of CPU vector instructions is competitive with GPUs for database use cases.
> GPU databases are brilliant for cases where the working set can live entirely within the GPU's memory.
Probably true for current computer software. But there are numerous algorithms that allow groups of nodes to work together on database-joins, even if they don't fit in one node.
Consider Table A (1000 rows), and Table B (1,000,000 rows). Lets say you want to compute A Join B, but B doesn't fit in your memory (lets say you only have room for 5000 rows). Well, you can split Table B into 250 pieces, each with 4000-rows.
TableA (1000 rows) + TableB (4000 rows) is 5000 rows, which fits in memory. :-)
You then compute A join B[0:4000], then A join B[4000:8000], etc. etc. In fact, all 250 of these joins can be done in parallel.
----------
As such, its theoretically possible to perform database joins on parallel systems, even if they don't fit into any particular node's RAM.
If you can afford it, much of the penalty from the pcie bus goes away if you have a system with nvlink. You still need to transfer the data back to the CPU for the final results, but most of the filtering and reduction operations across GPU memory can be done on nvlink only.
I'm clueless on the amount of bandwidth needed for larger applications, will the eventual release of PCIe 4 & 5 have a big impact on this? Or will it still be too slow?
PCIe3 x16, current generation has a bandwidth of 15.75 GB/s (×16)
Assuming you have 16 GB or RAM on the GPU, theoretically ~1 second is all it would take to load the GPU with that amount of data. Unfortunately, when you take into consideration huge data sets, you'll be able to saturate that with 5 M2 SSDs each running at 3200 MB/s, assuming Disk<-RAM DMA->GPU.
Those would also require at least 5 PCIe 2x8 ports on a pretty performant setup. RAM bandwidth is assumed to be around ~40-60GB/s, so hopefully no bottlenecks there.
This is assuming your GPU could swizzle through 16GB of data in a second. They have a theoretical memory bandwidth of between 450-970 GB/s.
Now realistically, per vendor marketing manuals, the fastest DB I've seen allows one to ingest at 3TB / hour ~~ 1GB / second.
So there must be more to it than the theoretically 16GB / second business. PCIe4 x16 doubles the speed to 32GB/s but at this time it looks pointless to me.
Haven't, but it's worth noting that hardware is probably attributable to them edging out Mapd since they're on a 5-node minsky cluster featuring nvlink, hence as arnon said, are benefiting from 9.5x faster transfer from disk than than PCIe 3.0. That blog has not yet tested Mapd's IBM Power version-- would be interesting to see how it compared on that cluster.
- They've built their database on Postgres for query planning, but for any query which does not match what they've accelerated on GPU, they do not have the ability to failover to utilizing postgres on the CPU.
https://youtu.be/oL0IIMQjFrs?t=3260
- Data is brought into GPU memory at table CREATE time, so the cost of transferring data from disk->host RAM->GPU RAM is not reflected. Probably wouldn't work if you want to shuffle data in/out of GPU RAM across changing query workloads. https://youtu.be/oL0IIMQjFrs?t=1310
Note that it's at the top of the list probably because it's running on a cluster. It would be awesome to see such a comparison on some standard hardware, like a large AWS GPU instance (eg1.2xlarge).
Also note that the dataset is 600GB, so it won't fit a sinlge GPU, not even close.
And the Postgres run was on 16GB of RAM and a rather slow SSD in a single drive configuration. Would have been interesting to see the results of either in memory or on a faster storage system.
The cost of GPUs doesn't make sense for the compute they offer.
According the benchmark, the fastest 8 GPU node takes about 0.5 seconds. The cost of that node on AWS is about 24$/hour. The 21 node spark cluster takes 6 seconds. But, it only costs 4$/hour.
An additional benefit with Spark is that it can be used for a lot more variety of operations than a GPU.
This cost disadvantage restricts GPU processing to niche use cases.
> According the benchmark, the fastest 8 GPU node takes about 0.5 seconds. The cost of that node on AWS is about 24$/hour. The 21 node spark cluster takes 6 seconds. But, it only costs 4$/hour.
Using your numbers, the GPU solution has half the cost for similar performance? How does that not make sense?
> This cost disadvantage restricts GPU processing to niche use cases.
> The cost of GPUs doesn't make sense for the compute they offer.
This assumes AWS pricing. You build a farm of GPUs and buy in bulk, you get much better cost basis. GPU farms are becoming more and more of a thing now and definitely less 'niche'.
Im a bit out of date, but lots of databases are moving towards an in-memory model (or in-memory feature sets) which means that hard drive access times aren't a bottleneck. The AWS EC2 instances you see with 1TB+ of RAM are generally aimed at this sort of thing.
Presumably once you have all your data in memory then the CPU becomes a bottleneck again, and if you can ship out the number crunching to GPUs in an efficient manner (i.e. you don't waste loads of time shuffling data between RAM and GPU) then you'll see performance gains.
Yeah, but the very shipping of data back and forth to the GPU is usually a bottleneck no matter how clever you get. Moreover, you're limited to say 8 GPUs per box for a total of 100GB-ish in memory. You can operate on nearly 10TB on the largest AWS instances using CPUs. With AVX512 intrinsics, this translates into some serious potential on large in-memory datasets that renders GPUs less appealing.
As long as you are doing something more complicataed than O(n), the shipping of data is going to be negligible.
In fact, that's why sorting on a GPU is going to almost always be worthwhile. Sorting is a O(n*log(n)) operation, but the transfer is O(n).
The Table-Join is O(n^2) if I remember correctly per join. (If you have 5 tables to join, its a ^2 factor for each one). That means shipping the data (a O(n) operation) is almost always going to be negligible.
So I'd expect both join and sort to be pushed towards GPUs, especially because both joins and sorts have well known parallel algorithms.
No, it's the arithmetic intensity of the operation in a roofline model sense [1] that indicates whether or not the GPU is worth it, and whether you are in a memory bandwidth or compute bound regime.
Asymptotic algorithmic complexity for a serial processor is meaningless here, it provides no indication on how a parallel machine (e.g., a PRAM or other, or perhaps trying to map it to the GPU's unique SIMD model) will perform on the problem.
The arithmetic intensity per byte of memory loaded or stored for sort or join is low. You can exploit memory reuse when data is loaded in the register file for sorting (for a larger working set), but you can do that for sorting on SIMD CPUs in any case with swizzle instructions (e.g., bitonic sorting networks). The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.
> Asymptotic algorithmic complexity for a serial processor is meaningless here
Its not meaningless. Its a legitimate cap. Bitonic sort for example is O(nlog^2(n)) comparisons, which is more comparisons than O(nlog(n)) for a classical mergesort.
-----------
Let me use a more obvious example.
Binary Searching a sorted array should almost NEVER be moved to the GPU. Why? Because it is a O(log(n)) operation, while Memory-transfers is a O(n) operation. In effect: it is more complex to transfer the array than to perform a binary search.
Only if you plan to search the array many, many, many times (such as in a Relational Join operator), will it make sense to transfer the data to the GPU.
-------
In effect, I'm using asympotic complexity to demonstrate that almost all algorithms of complexity O(n) or faster are simply not worth it on a GPU. The data-transfer is a O(n) step. Overall work complexity greater than O(n) seems to benefit GPUs in my experience.
> The GPU is only worth it to exploit the higher memory bandwidth to global memory here, if you're comparing a single CPU to a single GPU.
GPUs have both higher-ram and more numerous arithmetic structures than a CPU.
The $400 Vega64 has HBM2 x2 stacks, for 500GBps to VRAM. It has 4096 shaders which provide over 10TFlops of compute girth.
A $800 Threadripper 2950x has 16core / 32-threads. It provides 4x DDR4 memory controllers for 100GBps bandwidth to VRAM and 0.5 TFlops of compute.
Arithmetic intensity favors GPUs. Memory-intensity favors GPUs. The roofline model says GPUs are better on all aspects. Aka: its broken and wrong to use this model :-)
GPUs are bad at thread divergence. If there's an algorithm with high divergence (ie: Minimax Chess algorithm), it won't be ported to a GPU easily. But if you have a regular problem (sorting, searching, matrix multiplication, table joins, etc. etc.), they tend to work very well on GPUs.
Many, many algorithms have not been ported to GPUs yet however. That's the main downside of using them. But it seems like a great many number of algorithms can in fact, be accelerated with GPUs. I just read a paper that accelerated linked-list traversals on GPUs for example (!!!). It was a prefix-sum over linked lists.
You can't use 10TFlops of compute on a GPU if you can't even feed it data quickly enough. The state of the art for throughput is Nvidia's NVLink and you're capped to a theoretical max of 160GB/sec. Given how trivial most analytics workloads are (computing ratios, reductions like sums, means, and variances, etc.) there's simply no way you're going to effectively max out the compute available on a GPU.
Searching sorted arrays is actually very common in these workloads. Why? Analytics workloads typically operate on timestamped data stored in a sorted fashion where you have perfect or near perfect temporal and spatial locality. Thus even joins tend to be cheap.
With Skylake and AMD Epyc nearing in on 300GB/sec and much better cost efficiency per GB of memory vs. GPU memory the case for GPUs in this application seems dubious.
I will grant you that GPUs have a place in more complex operations like sorts and joins with table scans. They also blow past CPUs when it comes to expensive computations on a dataset (where prefetching can mask latencies nicely).
A good example of a dense sort + join GPU workload would be looking for "Cliques" of Twitter or Facebook users. A Clique of 3 would be three users, A, B, and C, where A follows B, B follows C, and C follows A.
You'd perform this analysis by performing two joins: the follower-followee table on itself three times.
----------
So it really depends on your workload. But I can imagine that someone who is analyzing these kinds of tougher join operations would enjoy GPUs to accelerate the task.
Pcie 3.0 x16 can push almost 16GB/s in each direction. Add 4-6 nvme drives to deliver that much for a total of 32 - 40 pcie lanes. Not really an option on Intel platform that tops at 48 lanes per cpu.. it makes a bit more sense with 128 lanes on AMD epyc.
Alternatively you can saturate pcie lanes with gpus and load data from ram
GPUs have a very high memory bandwidth, and can be used to perform memory-intensive operations (think decompression, for example).
You can load compressed data up to the GPU, decompress it there, and run very complicated mathematical functions. This can be very beneficial when you run a JOIN operation.
Liked this post. Would love more detail on GPU query offloading and on how different the engine looks compared to one that would run queries on the CPU, as that seems to be the primary innovation
Interesting. MapD and this project both use Thrust under the hood and from what I gather, both attempting to address the same issue. Can anyone speak to the differences?
While I originally didn’t get the case for GPU accelerated databases, it makes more and more sense given that the bandwidth speeds between GPU and CPU are steadily increasing, making the GPU an increasingly attractive option since the latency for GPU<->CPU syncs is diminishing.
The MapD (now OmniSci) execution engine is actually built around a JIT compiler that translates SQL to LLVM IR for most operations, but it does use Thrust for a few things like sort. One performance advantage of using a JIT is that the system can fuse operators (like filter, group by and aggregate, and project) into one kernel without the overhead of calling multiple kernels (and particularly the overhead of materializing intermediate results, versus keeping these results in registers). Outputting LLVM IR also makes it easy for us run on CPU as well by for the most part bypassing the need to generate GPU specific code (for example, in CUDA).
It does make coding on the system a bit more difficult, but we found that the performance gains were worth it.
Also I would add that this system seems, at least for now, geared toward solving a specific set of problems Uber is grappling with around time-series data, whereas OmniSci is trying to tackle general purpose OLAP/analytics problems. Not that solving specific use cases is not great/valid, or that they don't plan to expand the functionality of AresDB over time to encompass broader use cases.
Uber's article mentions:
"...but as we evaluated the product, we realized that it did not have critical features for Uber’s use case, such as deduplication."
however, I also see, they used go quite intensively. Maybe, they also did it for the productivity gains while implementing their additional features
Does the LLVM IR output from the JIT compiler have to then be compiled to CUDA PTX with nvcc? I looked a while back for a JIT compiler for CUDA, but didn't find much.
Why not? The Thrust folks have done a lot of good work on implementing highly optimized radix sort on GPU.
That said, there is interesting academic work around GPU sort that achieves even higher performance than Thrust in many scenarios, and we are looking at the feasibility of incorporating a framework we have found particularly promising.
Can thrust sort operate on datasets larger than GPU and CPU memory or does it require manually combining smaller sort operations into larger sorted sequences akin to merge sort?
I'm surprised GPUs were the right call for this use case. Not to say GPUs aren't useful in DBs, but I/O and keeping the GPUs working hard quickly becomes a bottleneck. I'd assume the workloads must be non-trivial in order for this to be superior to general purpose CPUs using SIMD + in-memory data.
I have what may be a completely stupid question, but the imagination in me just posited it to me, so I thought I'd ask:
Can you translate raytracnig to DB relations?
You have a raytracing engine, and many proven approaches to be able to determine what is visible from any given point.
What if the "point" was, in fact, "the query" and from that point - you apply the idea of raytracing to see what pieces of information are the relevant fields in a space of a DB that pulls in all those fields - and the shader translates to the "lens" or "answer" you want to have from that points perspective?
In theory, yes. Rays get bounced around from a light source until they hit a camera (or decay below a threshold). The net intensity is recorded for all the rays in the camera plane. Computing the net intensity is a reduction over all the rays hitting that pixel in the camera plane. Then you have NUM_BOUNCES many map steps to compute the position and intensity of the ray after each bounce. So in theory these map and reduce operations could be expressed in a database context.
In practice, does it make sense? Not really since each ray is not the same amount of work. One ray can go straight to the camera. Another could bounce many, many times going through many objects before hitting the camera or dying out altogether. GPUs are terrible at handling workloads with high thread divergence (some threads run much slower than others).
A GPU accelerated database with custom API for analytics? Jedox does that since 2007 and some of the biggest global companies use it. Jedox is not only an in-memory DB with optional GPU acceleration, it is a complete software suite with powerful tooling for ETL, Excel integration, Web-based spreadsheets and dashboarding.
see https://www.jedox.com and
https://www.jedox.com/en/resources/jedox-gpu-accelerator-bro...
Interesting post with lots of architectural details.
I like how they highlight the different existing things they tried and what the shortcomings were which led to building this, although not sure where I could use this to solve a problem.
How does it compares to ClickHouse ? Isnt creating a proprietary Query Language going to be a problem for Adhoc quries? Why create yet another language when the industry is standardizing on SQL or a subset of SQL
ClickHouse is interesting - it's also an apache project. I am suprised Uber didn't build AresDB on it. The stigma, unfortunately, of coming from Russia makes it hard for the project to gain mindshare in the Valley.
ClickHouse is licensed under the Apache License 2.0, but isn't an "Apache project" as in an ASF project. Maybe that's what you meant—just want to make that clear for other readers.
ClickHouse scan performance is through the roof, but it also seems fairly difficult to operate compared to the alternatives. For example, the concept of "deep storage" in Druid and Pinot makes rebalancing and replication trivial (at the expense of additional storage space). ClickHouse doesn't have that and requires more babysitting. And that's without even going into something like BigQuery, which is on a completely different level regarding operational simplicity if your use cases support it.
Also, if queries are heavily filtered by arbitrary dimensions, then ClickHouse starts to lose its edge compared to fully-indexed systems.
This makes ClickHouse fairly niche IMO, despite being exceptional at what it does.
We mostly use replicated MergeTree, SummingMergeTree, AggregatedMergeTree etc. And yes, we do use Kafka engine to consume directly from Kafka, in additions to standard Clickhouse inserters.
Does Uber use Salesforce? I’m asking because I’m saying no way with all the work their engineering team does, custom this and custom that. I know there is a random blog on the internet that says they did in 2016 but I think the source is wrong. They have smart people and a need for big data quickly, throwing money at enterprise Salesforce just seems backwards to how I see them work. I think it’s cool they push technology and would love to know what the politics of that is like internally and how the higher ups trust the engineers vision.
They show up at Dreamforce but I can’t imagine what they are tracking with it, it seems odd to put ur drivers and users in it so that leaves some odd selling of data they collect.
> GPU technology has advanced significantly over the years, making it a perfect fit for real-time computation and data processing in parallel.
I wouldn’t call GPUs perfect for performing analytics functions. It can be painful to force your data processing algorithms through shaders. What GPUs are is cheap in terms of dollar cost per flop and power consumption.
Presumably they write the algorithm-to-shading munging once, in the query engine, and then clients of the database don't have to worry about it at all. It looks like they didn't even write it, outsourcing it to Nvidia Thrust.
This is cool to see. check out pgstrom for Postgres I’d you haven’t heard about it. Especially if you’re curious about perf gains with a gpu in like for like comparisons.
One peeve, not loving the idea of having to use AQL and not having a command line analytical interface.
Does AQL support windowing functions, CTEs, sub queries ?
> software powered by machine learning models leverages data to predict rider supply and driver demand; and data scientists use data to improve machine learning models for better forecasting.
Why does any of that require real-time analysis? What's the business need for this being instantaneous? Feels very NIH...
A friend, who works for Uber, was once explaining the difference between AirBnb’s tech stack and Uber’s even though they are both “marketplaces”
In short, Uber has a ton more infrastructure because they need to do more in real time. Calculating the price of a trip from point A to point B needs to be done in real time. Then there are other factors like Pool, surge, and various rider preferences that all need to be accounted for.
Uber is sort of unique in how much of their stack nessicates real time analytics
Most businesses don't require real time forecasting, but I would speculate that Uber likely uses it to decide when and where to activate surge pricing.
Ooh nice! I work at a GPU database company (kinetica.com) and we are targeting larger enterprises and trying to replace their legacy store and analytics stacks. We’re releasing our 7.0 database stack for general availability tomorrow.
I wonder if someone should design a DPU for database workloads. I think the GPU is used because it's available not because it's optimize for the type of work people want to do with data.
It turns out that "graphics processing units" are good for more applications than just processing "graphics"... perhaps it would make sense to rename them something more inclusive?
I wouldn't call AMD a champion of OpenCL. Only OpenCL 2.0 is supported, and not completely on ROCm yet. OpenCL 2.0 isn't too hot on the AMDGPU-pro drivers either, and SPIR-V intermediate language doesn't really work well on AMD's GPUs (it either runs slowly or fails to run).
Frankly, it seems like AMD is going to be more successful with HCC and its HIP (CUDA-compatibility layer). It plays a distant 2nd place vs NVidia, but the CUDA-like single source environment is a superior development platform.
--------
OpenCL 2.2 seems to be only supported on Intel platforms. Either Intel GPUs, Intel CPUs (to AVX512), or Intel Altera FPGAs.
If I were to write OpenCL for AMD GPUs, I'd stick with OpenCL1.2 unless there was an absolute need for OpenCL 2.0 features. The OpenCL 1.2 stuff just seems more mature and better tested on AMD GPU Platforms. Hopefully ROCm changes that, but that's my opinion for the current state of affairs.
https://en.wikipedia.org/wiki/OpenCL
"When releasing OpenCL version 2.2, the Khronos Group announced that OpenCL would be merging into Vulkan in the future."
I guess CUDA is a more save and stable bet for the near term future if you need to choose between the alternatives...
Comments on this matter from the Khronos president:
> Khronos has talked about the convergence between OpenCL and Vulkan - a little clumsily as it turns out - as the message has been often misunderstood - our bad. To clarify:
> a. OpenCL is not going away or being absorbed into Vulkan - OpenCL Next development is active
> b. In parallel with a. - it is good for developer choice if Vulkan also grows its compute capabilities
> c. OpenCL can help b. through projects like clspv that test and push the bounds of what OpenCL kernels can be run effectively over a Vulkan run-time - and maybe inspire Vulkan compute capabilities to be expanded.
Very cool! I love geeking out on analytics tech and look forward to studying its design further. My take as I see it so far (please correct me if I'm wrong)-
* As a datapoint Pinot/Druid/Clickhouse can do 1B timeseries on one server. AresDB sounds like it's in the same ballpark here
* Pinot/Druid don't do cross table joins where AresDB can. My understanding is these are at (or near?) sub-second which would be a very distinguishing feature. I'm not sure how this will translate to when distributed mode is built out, as shuffling would become the bottleneck. Maybe there would be some partitioning strategy that within a partition allows arbitrary joining or something?
* Clickhouse can do cross table joins, but aren't going to be sub-second
* AresDB supports event-deduping. I think this can easily be handled by the upstream systems (samza, spark, flink, ..) in lambda
* Reliance on fact/dimension tables.
- This design/encoding is probably to help overcome transfer from memory to GPU, which in my limited experience with Thrust was always the bottleneck.
- High cardinality columns would make dimension tables grow very large and could become un-unmanageable (unless they are somehow trimmable?)
Regarding your second point: your intuition seems good, as Alipay apparently extended Druid to performs joins this way, with good performance [1]. Unfortunately it looks like they won't finish open-sourcing it, but it at least validates the idea.
What existing solutions solve their particular needs ?
And don't forget that rolling out commercial products in an enterprise is often expensive and time consuming e.g. involving legal, procurement, architecture etc.
Well, it does depend company to company, but in a lot of places the role is or includes coming up with innovative ideas and to be smart in choosing which ones to implement.
Did someone say innovative database? PG-Strom with PostgreSQL under the hood (what else?) exists since over 4 years: https://github.com/heterodb/pg-strom
My naïve impression working at smaller scales is that in a reasonably balanced system, storage is often the bottleneck, and if it can deliver data fast enough the CPU probably won't break a sweat counting or summing or so on. The amount of interest in GPU databases, though, suggests that's not the case for some interesting workloads.