Great write-up. SQL is the most underrated language. I am doing most of the stuf...

rkwasny · on May 31, 2019

One thing that's missing is GPU acceleration, like MAPD

Being able to group by a column and do a count(*) in 100ms is mind blowing. I would pay for a MAPD-style backend for postgresql.

sitkack · on May 31, 2019

While a GPU has excellent memory bandwidth and lots of parallel processors, its connection to the rest of the system makes it not ideal for database work. If you can have bit set indexes on the GPU, maybe. But data transfer and kernel launch will dominate in the timing.

But, dual socket EPYC systems have ~300GB/s of memory bandwidth to main memory. 1-4TB machines are available right now.

scott_s · on May 31, 2019

I wrote a paper on pretty much that problem: "Evaluation of Streaming Aggregation on Parallel Hardware Architectures", https://www.scott-a-s.com/files/debs2010.pdf

Conclusion in brief: in order to "win" when trying to use an off-board accelerator, you need to look at each byte transferred more than once. If you only look at a byte once to compute your result, then it's going to be faster to just do the computation on the CPU.

The reason: the up-front cost of transferring that byte is high. The win comes from reusing that byte many times in the massively parallel architecture on the accelerator.

derefr · on May 31, 2019

Still works if your goal is generating heavy OLAP reports, and your reports can be calculated per-shard and then reduced at the end. If your query can be expressed in ten bytes, and the answer can be expressed in ten bytes, but in between there’s several TBs of data to run through in tons and tons of parallel-sequential passes that don’t involve any data that’s not already on the GPU, a GPU DBMS cluster seems optimal.

——

For more regular workloads, though, its seems like a proper GPU-targeted DBMS architecture wouldn’t use single “huge” GPUs, but rather would shard data (and thus workloads) across a cluster of commodity boxes with commodity GPUs, such that each GPU has just enough data held on it for query traffic to that GPU to barely saturate the PCI-e lanes.

Which is another way to say: maybe Intel’s Phi was ahead of its time. Consider a Phi as an entire computer (because it is, and you can program it as such), and a computer containing more than one of them as a cluster backplane with a shared SAN (because it is, and you can access it from the Phi’s as such.) I have a feeling this architecture will see a comeback, either in this form, or in the (roughly equivalent) form of backplane-connected clusters of highly-multicore Single-board Computers.

asgiknionio · on May 31, 2019

Take a look at this:

https://www.amd.com/en/products/professional-graphics/radeon...

With that technology, maybe it would be possible to keep a complete replica on the GPU.

michelpp · on May 31, 2019

pg-strom does that and more:

https://github.com/heterodb/pg-strom

It can be used to optimize things like counts and other aggregates over a standard postgres table, it also provides a foreign data wrapper column store to optimize certain analytical queries by greatly speeding up GPU transfers. Additionally it can use the GPU as a "co-processor" to do large amounts of row filtering before handing the final work off to the CPU.

throwaway082729 · on May 31, 2019

Isn't BlazingDB working on something like this?

buremba · on May 31, 2019

What's your data volume?

truth_seeker · on May 31, 2019

Around 7 TB and growing

buremba · on May 31, 2019

I didn't really understand how you managed to get all these extensions working together. Last time I spoke to people at Citus, CitusDB was not fully compatible with cstore_fdw and it's a bit confusing to use TimescaleDB and CitusDB together. Do you mind letting us know the company that you're working at or writing a blog post about it?

massaman_yams · on June 11, 2019

Yeah, I can't imagine everything working perfectly when using various combinations of timescale/citus/pipelinedb functionality.

cstore_fdw is... yeah, there might be other edge cases there, but frankly, it's not a particularly good solution as it severely underperforms vs. any mature column store DB. Though - there's an argument for using it in order to avoid more complicated ETL from the production relational DB to an analytics DB.

michelpp · on May 31, 2019

Perhaps there was a communication glitch, to my knowledge cstore_fdw has always worked with citus, here's an excerpt from the docs:

"Cstore_fdw is developed by Citus Data and can be used in combination with Citus, a postgres extension that intelligently distributes your data and queries across many nodes so your database can scale and your queries are fast."

https://github.com/citusdata/cstore_fdw

buremba · on June 1, 2019

That may be the case but if you look at the commit history of cstore_fdw on Github, it's not that active. Also, none of the paid Citus customers were using cstore_fdw with Citusdb last year to my knowledge so that was not their focus even though this was the initial idea. The Citus engineer that I asked this question was not even sure if cstore_fdw can be used with the distributed version of Citus.

My conclusion was that most of the Citus users use it for OLTP workloads since PG doesn't really have a distributed query planner that works smoothly with the OLAP workloads.

FridgeSeal · on May 31, 2019

That’s cool, what sort of query latency do you get? (For a variety of queries).

truth_seeker · on May 31, 2019

We pay close attention to Data growth patterns and schema changes to keep around 70% of queries (which actually brings little over than 90% of business revenue) under 50 millisecond. Rest of 30% queries are either of low priority or backend jobs for which latency range of 200 ms to 1.5 seconds is viable for our use case.

FridgeSeal · on May 31, 2019

Oh wow that’s really quite good.

My team and I are re-architecting/upgrading our data infrastructure at work, knowing this kind of performance is possible at a scale quite a bit above where we’re at now is good to know. I also really like your approach of keeping your main revenue queries under a hard latency limit.