This is very intriguing, I cannot help but wonder how this operation would have ...

vgt · on Sept 1, 2016

While both Spark and BigQuery do the shuffle step in-memory, there are some differences[1]:

- BigQuery's execution is pipelined (don't wait for step 1 to finish to start step 2)

- BigQuery's in-memory shuffler is not co-tenant to compute

And, of course, it's one thing to have software and hardware. BigQuery provides a fully-managed, fully-encrypted, HA, redundant, and constantly seamlessly maintaned and upgraded service [2].

[1]https://cloud.google.com/blog/big-data/2016/08/in-memory-que...

[2]https://cloud.google.com/blog/big-data/2016/08/google-bigque...

(disc: work on BigQuery)

ap22213 · on Sept 1, 2016

Would pipelining help much when the processing job is CPU bound (all cores maxed out)?

Sorry - what does co-tenant to compute mean?

vgt · on Sept 1, 2016

Pipelining helps with hanging chads, tail latency of work steps. If you have a slow worker (due to, say, data skew), entire job slows down. All other workers are sitting idle, waiting for the one worker to finish their piece. Read [0] to see what Dataflow does, and BigQuery/Dremel do very similar stuff to deal with this issue. BigQuery also doesn't have to wait for ALL workers to finish step 1 before proceeding to step 2.

By co-tenant to compute, I mean that processing nodes themselves handle the shuffle in Spark. This can cause non-obvious bottlenecks. BigQuery handles shuffle outside of the processing nodes [1].

[0] https://cloud.google.com/blog/big-data/2016/05/no-shard-left...

[1] https://cloud.google.com/blog/big-data/2016/08/in-memory-que...

paulasmuth · on Sept 1, 2016

I can't speak for big query, but for our open-source BigQuery alternative EventQL [0].

On the upside, massively parallel SQL databases allow you to express what probably took facebook a lot of code in a single SQL query.

On the downside, with any database that supports streaming inserts and dynamic resharding, you will have to eventually write each entry more than once. The more manual approach with spark prevents this.

[0] https://eventql.io

vgt · on Sept 1, 2016

It's exciting to see new innovative open source contributions to the big data space. Since BigQuery has moved on in many ways from the ColumnIO and Dremel concepts described in the Dremel paper[0] with new versions of many of these components (like in-memory shuffle, new storage, new ingest, execution engine, etc), I'd love to learn where EventQL places itself in that gradient.

[0]http://static.googleusercontent.com/media/research.google.co...

danpalmer · on Sept 1, 2016

My guess is that it wouldn't be too bad, but that at the volumes Facebook are doing, I suspect a lot of their speed is coming from custom optimisations they can make for their use-cases, which Big Query wouldn't have.