Hacker News new | past | comments | ask | show | jobs | submit login
Query Engines: Push vs. Pull (justinjaffray.com)
148 points by jsnell on May 1, 2021 | hide | past | favorite | 35 comments



I'm doing a rewrite[0] of OctoSQL[1], partly to make it push-based instead of pull-based, and for now it's been much simpler to work with, won't get into the detailed reasoning in this comment.

However, if somebody is interested, this is a great paper on how to architect a query compiler: https://www.cs.purdue.edu/homes/rompf/papers/tahboub-sigmod1...

I found it very enlightening and simple to implement. The paper also delves into details why the push-based architecture has more room for optimizations.

[0]:https://github.com/cube2222/octosql/tree/redesign

[1]:https://github.com/cube2222/octosql/


The truth is you need almost everything to do push and pull. Just as pull alone does "forks" inefficiently, push alone does "joins" inefficiently.

People shy away from this conclusion because doing everything both ways is quite hard and requires fancier abstractions to make nice than people are familiar with.


> The truth is you need almost everything to do push and pull. Just as pull alone does "forks" inefficiently, push alone does "joins" inefficiently.

That's really is a simple and intuitive summary. Thank you for that insight!


Thanks, glad to share it!


This is exactly why data patterns like Observables (https://github.com/tc39/proposal-observable) and ReactiveX (http://reactivex.io/) are so powerful. You can design your interface to be subjective to both. :)


It's a bit more complicated than this, because Observables in both frameworks are only required to implement push-style interfaces (that's the public interface of event listeners by design), and if a consumer wishes to pull at its own rate from an arbitrary source, the only general solution is to create an infinite buffer that materializes the pusher's outputs, which is the exact performance tradeoff described in the (excellent) OP article.

http://reactivex.io/documentation/operators/backpressure.htm... describes ReactiveX's approach to the problem in great detail. To quote the section that essentially reaches the same conclusion as the grandparent post pointing out how "forks" (using the same pull-style data source twice) inherently require buffering/materialization or sampling:

> When a cold Observable is multicast (when it is converted into a connectable Observable and its Connect method is called), it effectively becomes hot and for the purposes of backpressure and flow-control it should be treated as a hot Observable. Cold Observables are ideal for the reactive pull model of backpressure implemented by some implementations of ReactiveX (which is described elsewhere). Hot Observables typically do not cope well with a reactive pull model, and are better candidates for other flow control strategies, such as the use of the operators described on this page, or operators like Buffer, Sample, Debounce, or Window.

There are great visuals if you expand the RxJS examples in the link above.


Mind explaining the fork vs joins inefficiency thing a bit further?

I feel like there's some wisdom here, but I don't think I grocked it


With forks (in the article DAGs vs trees), pure pulling means you pull the same thing twice.

With joins, pushing only works if both sides push at the same time.

In general, joins work by a push on one side triggering a pull on the other side, and then both are combined.

For forks there should be some subscribe mechanism to turn pulls into pushes that can cheaply fan out to all subscribers.

(Side note, the mitigations ought to be completely symmetric. This is a sign there is more to the problem that I am missing.)


I believe joins work without simultaneous pushing. One side can build first, the other side can probe in a pushed fashion after but not necessarily simultaneously.

“Trigger” is not necessarily the relationship, between build and probe sides though it could be. The DAG paths that produce/push the probe can be triggered by the completion of the build. “Happens after” is more strictly apt.


People have been vacillating on this forever and the truth is... (drum roll) It depends.


The DuckDB folks are migrating from pull to push and put together this interesting documentation of their reasoning! They use a vectorized model instead of a compiled one, so it's another interesting comparison point. It seems like push will simplify how they handle parallelism.

https://github.com/duckdb/duckdb/issues/1583


(I'm the author of this post) That's a great document I hadn't seen, thanks for sharing!


> This is the oldest and most well-known query execution model, and was first proposed in 1994.

Is the author younger than 25. Not picturing a world of databases before 1994? Did the world even exist?

I kid but regardless of when a paper describing some process was published the “oldest” query execution model is decidedly a lot older than 26 years.


Author updated it to:

> This is the oldest and most well-known query execution model, and is named for the paper which standardized its conventions in 1994.

Since the paper references previous databases, I'd rather assume incorrect expression from the blog author than ignorance.


Yes my intention was that it's more or less the reference version of that model, certainly databases like System R existed many years before Volcano :) I appreciate the correction.


Seriously though, the article was very well written. I knew it was a cheap shot. Sadly for some of us 1994 doesn’t seem like that long ago.


It continually disappoints me in IT how the same things are rediscovered over and over again. I suppose its because you can get an IT job without much IT education, just make some web sites and you're an expert.


Nicely explained. I like the idle operators/row dichotomy. At a system level push based systems suffer from the same weakness as event driven systems: incompleteness. You can never be sure you haven't missed an event because it was dropped during system startup, network partition, failover, queue full, buffer overflow, packet drop, etc. This doesn't always matter, but with a pull based system you know the result is based on all the data, but as you point out, that has a cost too.


Could you please explain at more length how push/pull plays a role here?

With push-based you'd still usually use lossless transports, which could also have a "closing packet" which tells the consumer that "this stream is done", so I don't see how pull-based is better here.

Re queue full: If I understand you correctly, queues are only a problem when you start having multiple parallel stages one after the other, with queues in-between. Then you have queueing both with push-based and pull-based architectures.


I am only guessing as to the earlier poster's intent. But, I took the earlier comment to be about distributed system application architecture. The statements make some sense to me if implicitly considering boundaries between the application and DB query engine, rather than the internal elements of the query engine itself.

Through this lens, a pull-based architecture is your conventional N-tier application where the stateless application issues queries and consumes results. The individual consumer (and producer) agents are independent and decoupled, allowed to handle failures with simple restart/retry while in control of their own transactions.

Conversely, a push-based architecture is often rendered into some enterprise message bus with queues and brokers. The applications can only subscribe to the queues and pub/sub system, but lack any real control fault recovery or transactional boundaries.


Ah yes, you're totally right, when I think about this in terms of i.e. Prometheus vs Graphite instead of query engines then it makes much more sense.

Thanks!


This sounds a bit like Jon Gjenset's thesis project Noria

> Noria observes that, by having developers provide the set of queries their application will make in advance, the database can be smarter about how to execute those queries. In particular, it can choose to pre-compute, and incrementally maintain, the results for queries. This allows Noria to answer those queries quickly, and essentially obviates the need for application caches.

https://pdos.csail.mit.edu/projects/noria.html


This is a gem, for me, for its conciseness:

> Crossing a boundary from a pull system to a push system requires polling its state, and crossing a boundary from a push system to a pull system requires materialization of its state.


Indexed aka Materialized Views in Microsoft and Oracle have been doing “push” for two decades now since the infamous TPCC debacle. It seems strange not to mention them (if only to rule them out).


I like the use of generator functions for the pull-based example- the syntax closely mirrors the push-based example, which drives home the point that the difference between push and pull lies entirely in whether you compose operators via an iterator API or a callback API. You can also see this in how the two usage examples are inside-out versions of each other.

When you look at it from the perspective of a language designer or compiler, things start to look even more similar:

* Values that live across iterations are stored in a closure (push) or generator object (pull)

* In the first half of an iteration, before producing a result, operators dispatch to their producers via a return address on the call stack (push) or callbacks (pull)

* In the second half of an iteration, while building up a result, operators dispatch to their consumers via callbacks (push) or a return address on the call stack (pull)

The difference the article mentions between "unrolling" the two approaches comes from the level of abstraction where inlining is performed. The push model produces natural-looking output when inlining the closures. The less-natural result of inlining the pull model's `next` functions is where the control flow in loops shows up. (See for example the performance benefits of adding a push-model complement to Rust's usual pull-model Iterator trait: https://github.com/rust-lang/rust/pull/42782#issuecomment-30...)

But, you can get the same natural-looking output from the pull model if you perform inlining at the level of generators! This is a less familiar transformation than function inlining, but it's still fundamentally the same idea, of erasing API boundaries and ABI overhead- which in this case is a more straightforward formulation of the various loop optimizations that can sometimes get rid of the control-flow-in-loops produced by inlining `next` functions.

The difference around DAGs is similar. In the push model, a DAG just means a producer calling into more than one consumer callback. In the pull model, this is tricky because you can't just yield (i.e. return from `next`) to more than one place- return addresses form a stack! (And thus the dual problem for the push model shows up for operations with multiple inputs, like the merge join mentioned in the article.)

Overall I think a more flexible approach might be to generalize from `yield` and iterator APIs, to algebraic effects (or, lower level, delimited continuations)- these let you build both push and pull-like APIs for a single generator-like function, as well as hybrid approaches that combine aspects of both, by removing the requirement for a single privileged call stack.


I'm new to this terminology, but it looks a bit like lazily computing a data stream vs. eagerly computing it, doesn't it? Or am I wrong.


It is a little a bit like that. But the regular formulations of "eager" and "lazy" assume expressions that are evaluated once (or at at least if evaluated the again will yield the same result).

It's better to think of this stuff with constantly changing underlying data, which adds a lot of "color" to the domain.


i was just reading about lean manufacturing and push vs pull is also used there .. i wonder how far is it from query systems


What caught my attention is the claim of push based queries to be easier to generate imperative code, where I could learn about how do that?


Both the linked papers (Neumann[0] and Shaikhha et al[1]) go into this a bit. I personally found figures 3 and 4 in the Neumann paper particularly helpful.

[0] https://www.vldb.org/pvldb/vol4/p539-neumann.pdf

[1] https://arxiv.org/pdf/1610.09166.pdf


LINQ also seems to have a push-based implementation, given its FROM-WHERE-SELECT syntax: https://docs.microsoft.com/en-us/dotnet/csharp/programming-g...


I don't think that is correct. Underlying much of LINQ is IEnumerable<T>, which implements the classic Current and MoveNext() iterator pattern. The foreach keyword uses IEnumerable<T> to "pull" elements one at a time.

LINQ's syntax is just syntactic sugar on top of chained method calls to various interface extension methods.


Also IQueryable, and also the choice of ASTs vs closures, and also lazily evaluated, which means a different plan with a different strategy may intercede


Ahh, you are right. My eyes somehow drifted away from the _pull_ example with yield above the "basic push-based query engine" header, that is the one which I've found familiar.


LINQ is a way of representing queries using monadic construction, with lambdas passed to each operator either directly as closures (for LINQ in memory) or as syntax trees (for execution on databases or elsewhere).

Thanks to the monadic abstractions, they are actually agnostic as to whether the implementation is push or pull or a hybrid. The query is only evaluated on demand, at which point the entire graph is available and can be planned however.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: