One SQL to Rule Them All: Management of Streams and Tables

truth_seeker · on May 31, 2019

Great write-up. SQL is the most underrated language.

I am doing most of the stuff mentioned in that paper right now in my project with PipelineDB extension.

PostgreSQL 11 +

CitusDB extension (horizontal scaling and sharding) +

cstore_fdw extnesion (columnar data storage) +

file_fdw (to treat file as table) +

PipelineDB extension (Streaming computations and MV) +

TimescaleDB extension (timeseries data storage)

Also, PG 12 release will have many optimizations for table partitioning, query planning and its parallel execution

rkwasny · on May 31, 2019

One thing that's missing is GPU acceleration, like MAPD

Being able to group by a column and do a count(*) in 100ms is mind blowing. I would pay for a MAPD-style backend for postgresql.

sitkack · on May 31, 2019

While a GPU has excellent memory bandwidth and lots of parallel processors, its connection to the rest of the system makes it not ideal for database work. If you can have bit set indexes on the GPU, maybe. But data transfer and kernel launch will dominate in the timing.

But, dual socket EPYC systems have ~300GB/s of memory bandwidth to main memory. 1-4TB machines are available right now.

scott_s · on May 31, 2019

I wrote a paper on pretty much that problem: "Evaluation of Streaming Aggregation on Parallel Hardware Architectures", https://www.scott-a-s.com/files/debs2010.pdf

Conclusion in brief: in order to "win" when trying to use an off-board accelerator, you need to look at each byte transferred more than once. If you only look at a byte once to compute your result, then it's going to be faster to just do the computation on the CPU.

The reason: the up-front cost of transferring that byte is high. The win comes from reusing that byte many times in the massively parallel architecture on the accelerator.

derefr · on May 31, 2019

Still works if your goal is generating heavy OLAP reports, and your reports can be calculated per-shard and then reduced at the end. If your query can be expressed in ten bytes, and the answer can be expressed in ten bytes, but in between there’s several TBs of data to run through in tons and tons of parallel-sequential passes that don’t involve any data that’s not already on the GPU, a GPU DBMS cluster seems optimal.

——

For more regular workloads, though, its seems like a proper GPU-targeted DBMS architecture wouldn’t use single “huge” GPUs, but rather would shard data (and thus workloads) across a cluster of commodity boxes with commodity GPUs, such that each GPU has just enough data held on it for query traffic to that GPU to barely saturate the PCI-e lanes.

Which is another way to say: maybe Intel’s Phi was ahead of its time. Consider a Phi as an entire computer (because it is, and you can program it as such), and a computer containing more than one of them as a cluster backplane with a shared SAN (because it is, and you can access it from the Phi’s as such.) I have a feeling this architecture will see a comeback, either in this form, or in the (roughly equivalent) form of backplane-connected clusters of highly-multicore Single-board Computers.

asgiknionio · on May 31, 2019

Take a look at this:

https://www.amd.com/en/products/professional-graphics/radeon...

With that technology, maybe it would be possible to keep a complete replica on the GPU.

michelpp · on May 31, 2019

pg-strom does that and more:

https://github.com/heterodb/pg-strom

It can be used to optimize things like counts and other aggregates over a standard postgres table, it also provides a foreign data wrapper column store to optimize certain analytical queries by greatly speeding up GPU transfers. Additionally it can use the GPU as a "co-processor" to do large amounts of row filtering before handing the final work off to the CPU.

throwaway082729 · on May 31, 2019

Isn't BlazingDB working on something like this?

buremba · on May 31, 2019

What's your data volume?

truth_seeker · on May 31, 2019

Around 7 TB and growing

buremba · on May 31, 2019

I didn't really understand how you managed to get all these extensions working together. Last time I spoke to people at Citus, CitusDB was not fully compatible with cstore_fdw and it's a bit confusing to use TimescaleDB and CitusDB together. Do you mind letting us know the company that you're working at or writing a blog post about it?

massaman_yams · on June 11, 2019

Yeah, I can't imagine everything working perfectly when using various combinations of timescale/citus/pipelinedb functionality.

cstore_fdw is... yeah, there might be other edge cases there, but frankly, it's not a particularly good solution as it severely underperforms vs. any mature column store DB. Though - there's an argument for using it in order to avoid more complicated ETL from the production relational DB to an analytics DB.

michelpp · on May 31, 2019

Perhaps there was a communication glitch, to my knowledge cstore_fdw has always worked with citus, here's an excerpt from the docs:

"Cstore_fdw is developed by Citus Data and can be used in combination with Citus, a postgres extension that intelligently distributes your data and queries across many nodes so your database can scale and your queries are fast."

https://github.com/citusdata/cstore_fdw

buremba · on June 1, 2019

That may be the case but if you look at the commit history of cstore_fdw on Github, it's not that active. Also, none of the paid Citus customers were using cstore_fdw with Citusdb last year to my knowledge so that was not their focus even though this was the initial idea. The Citus engineer that I asked this question was not even sure if cstore_fdw can be used with the distributed version of Citus.

My conclusion was that most of the Citus users use it for OLTP workloads since PG doesn't really have a distributed query planner that works smoothly with the OLAP workloads.

FridgeSeal · on May 31, 2019

That’s cool, what sort of query latency do you get? (For a variety of queries).

truth_seeker · on May 31, 2019

We pay close attention to Data growth patterns and schema changes to keep around 70% of queries (which actually brings little over than 90% of business revenue) under 50 millisecond. Rest of 30% queries are either of low priority or backend jobs for which latency range of 200 ms to 1.5 seconds is viable for our use case.

FridgeSeal · on May 31, 2019

Oh wow that’s really quite good.

My team and I are re-architecting/upgrading our data infrastructure at work, knowing this kind of performance is possible at a scale quite a bit above where we’re at now is good to know. I also really like your approach of keeping your main revenue queries under a hard latency limit.

dwenzek · on May 31, 2019

A really interesting proposal to unify stream processing and relational querying and to extend SQL with a semantics over time-varying relations.

In particular, the authors (from the Apache Beam/Flink/Calcite projects) stress on the need to:

* smooth the frontier between streams and tables by using a single concept of time-varying relations;

* make explicit the difference between the event and processing times, using monotonic watermarks to tie the two;

* control when these continuously varying relations are materialized, avoiding unnecessary work and producing only timely outcomes.

These ideas are not new per se, but there are here pushed further and nicely combined.

However, I'm wondering if SQL is the right tool for the task. For instance, the listing 2 seems complex when compared to the query expressed in plain English. In particular, I disagree that "anyone who understands enough SQL to solve a problem in a non-streaming context still has the knowledge required to solve the problem in a streaming context as well.

A must read paper by who is keen on stream processing, Spark, Flink, Beam, Kafka Stream ...

asavinov · on May 31, 2019

> However, I'm wondering if SQL is the right tool for the task. For instance, the listing 2 seems complex when compared to the query expressed in plain English

For me Listing 2 is more complex and less intuitive than Listing 1 (CQL). Yet, it is a general problem when we try to adopt relational (and SQL) concepts for solving such kind of tasks (windowing, grouping, aggregation etc.) One solution is to switch to some kind of column algebra rather than relational algebra as described in [1] and [2] (it has also been applied to stream processing).

[1] Joins vs. Links or Relational Join Considered Harmful: https://www.researchgate.net/publication/301764816_Joins_vs_...

[2] From Group-By to Accumulation: Data Aggregation Revisited: https://www.researchgate.net/publication/316551218_From_Grou...

dwenzek · on June 1, 2019

Thanks for the two links. I will have a look.

michaelmior · on May 31, 2019

I agree that listing 2 is a little verbose, but once you understand the syntax (not much different from vanilla SQL), it's pretty powerful. I think writing queries this way will ultimately allow for more expressiveness than something more concise. That isn't to say that this particular SQL is necessarily the best, but I think it works well for the problem at hand.

dwenzek · on June 1, 2019

Sure, SQL is powerful and I agree that any alternative would also bring its own complexity. However, my concerns are more about encapsulation rather than expressiveness and conciseness. Even if a watermark is used the same way in different queries, I will have to use the same query clause again and again; as I already have to with the join conditions. Not a fundamental issue, but something which makes me more difficult to focus on the kernel of a query.

EdwardDiego · on May 31, 2019

> However, I'm wondering if SQL is the right tool for the task.

In my anecdotal experience, yep, it's a data analysis language everyone in the industry knows - BAs, product managers, developers, and data scientists.

dwenzek · on June 1, 2019

Having a standard SQL extension definitely eases adoption and ability to migrate from one tool to another. But rather than adding and adding features to SQL; we maybe need a simpler way to extend/compose features.

On that point, I see a contradiction in the future work section of the paper where it said that "Experience has also shown that pre-built solutions are never sufficient for all use cases; ultimately, users should be able to utilize the power of SQL to describe their own custom-windowing TVFs". But, how can a SQL user add its own custom operator?

ddmd · on May 31, 2019

It seems that the main proposal is to make SQL watermark-aware. This additional "knowledge" will allow for producing groups and windows taking into account some late and out-of-order records.

d__k · on May 31, 2019

> A time-varying relation is just what it says: a relation that changes over time ... Critically, the full suite of existing SQL operators remain valid on time-varying relations

Any relation by definition is supposed to change in time. So why do we need to introduce a special kind of relations? The authors essentially confirm that there is no difference between classical relations and "time-varying" relations.

Also, it is not clear how does it relate to EPL (in CEP) which is also based on SQL.

drunkpotato · on May 31, 2019

According to relational theory, relations are immutable sets of tuples. The way you “change” a relation is by assigning a new relation to the name, e.g.:

R1 <- R1 union R2

Is the relational algebra way of “inserting” R2 into R1. Of course, most RDBMS’s (and the SQL family of languages) are only reasonably faithful implementations of relational theory under some practical constraints.

But those differences re important, and I think the authors are proposing a new relational model (though I don’t think this is a totally new idea; I’m not a database researcher). One that incorporates time and history of a relation as first-class citizens of the theory, that allows sql to operate natively on both traditional relational-theoretic relations and streams of data. Pretty cool! Can anyone who knows more than me comment on how this relates to Kafka tables, which seems to me a similar concept?

solidsnack9000 · on May 31, 2019

Relations in most systems don’t implicitly store the time a change occurred or the past versions of a row, nor do they have a model for building a “now table” from an “event table”. When people talk about time-varying tables or temporal tables, they are describing event sourcing in a SQL context.

iblaine · on May 31, 2019

Is the point of this paper that SQL should be used to in streaming data pipelines? Aren’t people doing this already?

Nano2rad · on May 31, 2019

Not yet time to combine sql and nosql.

chrisjc · on May 31, 2019

What does this article/paper have to do with "NoSQL"?

SmokeGS · on May 31, 2019

Commenting to read again.

grzm · on May 31, 2019

You can use the favorite link under the submission title to effectively bookmark it. Your favorites are available in your profile.

scott_s · on May 31, 2019

Or an upvote.

SmokeGS · on May 31, 2019

thanks grzm