While a GPU has excellent memory bandwidth and lots of parallel processors, its connection to the rest of the system makes it not ideal for database work. If you can have bit set indexes on the GPU, maybe. But data transfer and kernel launch will dominate in the timing.
But, dual socket EPYC systems have ~300GB/s of memory bandwidth to main memory. 1-4TB machines are available right now.
Conclusion in brief: in order to "win" when trying to use an off-board accelerator, you need to look at each byte transferred more than once. If you only look at a byte once to compute your result, then it's going to be faster to just do the computation on the CPU.
The reason: the up-front cost of transferring that byte is high. The win comes from reusing that byte many times in the massively parallel architecture on the accelerator.
Still works if your goal is generating heavy OLAP reports, and your reports can be calculated per-shard and then reduced at the end. If your query can be expressed in ten bytes, and the answer can be expressed in ten bytes, but in between there’s several TBs of data to run through in tons and tons of parallel-sequential passes that don’t involve any data that’s not already on the GPU, a GPU DBMS cluster seems optimal.
——
For more regular workloads, though, its seems like a proper GPU-targeted DBMS architecture wouldn’t use single “huge” GPUs, but rather would shard data (and thus workloads) across a cluster of commodity boxes with commodity GPUs, such that each GPU has just enough data held on it for query traffic to that GPU to barely saturate the PCI-e lanes.
Which is another way to say: maybe Intel’s Phi was ahead of its time. Consider a Phi as an entire computer (because it is, and you can program it as such), and a computer containing more than one of them as a cluster backplane with a shared SAN (because it is, and you can access it from the Phi’s as such.) I have a feeling this architecture will see a comeback, either in this form, or in the (roughly equivalent) form of backplane-connected clusters of highly-multicore Single-board Computers.
It can be used to optimize things like counts and other aggregates over a standard postgres table, it also provides a foreign data wrapper column store to optimize certain analytical queries by greatly speeding up GPU transfers. Additionally it can use the GPU as a "co-processor" to do large amounts of row filtering before handing the final work off to the CPU.
I didn't really understand how you managed to get all these extensions working together. Last time I spoke to people at Citus, CitusDB was not fully compatible with cstore_fdw and it's a bit confusing to use TimescaleDB and CitusDB together. Do you mind letting us know the company that you're working at or writing a blog post about it?
Yeah, I can't imagine everything working perfectly when using various combinations of timescale/citus/pipelinedb functionality.
cstore_fdw is... yeah, there might be other edge cases there, but frankly, it's not a particularly good solution as it severely underperforms vs. any mature column store DB. Though - there's an argument for using it in order to avoid more complicated ETL from the production relational DB to an analytics DB.
Perhaps there was a communication glitch, to my knowledge cstore_fdw has always worked with citus, here's an excerpt from the docs:
"Cstore_fdw is developed by Citus Data and can be used in combination with Citus, a postgres extension that intelligently distributes your data and queries across many nodes so your database can scale and your queries are fast."
That may be the case but if you look at the commit history of cstore_fdw on Github, it's not that active. Also, none of the paid Citus customers were using cstore_fdw with Citusdb last year to my knowledge so that was not their focus even though this was the initial idea. The Citus engineer that I asked this question was not even sure if cstore_fdw can be used with the distributed version of Citus.
My conclusion was that most of the Citus users use it for OLTP workloads since PG doesn't really have a distributed query planner that works smoothly with the OLAP workloads.
We pay close attention to Data growth patterns and schema changes to keep around 70% of queries (which actually brings little over than 90% of business revenue) under 50 millisecond. Rest of 30% queries are either of low priority or backend jobs for which latency range of 200 ms to 1.5 seconds is viable for our use case.
My team and I are re-architecting/upgrading our data infrastructure at work, knowing this kind of performance is possible at a scale quite a bit above where we’re at now is good to know. I also really like your approach of keeping your main revenue queries under a hard latency limit.
I am doing most of the stuff mentioned in that paper right now in my project with PipelineDB extension.
PostgreSQL 11 +
CitusDB extension (horizontal scaling and sharding) +
cstore_fdw extnesion (columnar data storage) +
file_fdw (to treat file as table) +
PipelineDB extension (Streaming computations and MV) +
TimescaleDB extension (timeseries data storage)
Also, PG 12 release will have many optimizations for table partitioning, query planning and its parallel execution