Hacker News new | past | comments | ask | show | jobs | submit login

This is very intriguing, I cannot help but wonder how this operation would have performed on something like Google Big Query. I know that it is highly unlikely that Facebook would ever load their data to the Google Cloud Platform, but it would be an interesting comparison.



While both Spark and BigQuery do the shuffle step in-memory, there are some differences[1]:

- BigQuery's execution is pipelined (don't wait for step 1 to finish to start step 2)

- BigQuery's in-memory shuffler is not co-tenant to compute

And, of course, it's one thing to have software and hardware. BigQuery provides a fully-managed, fully-encrypted, HA, redundant, and constantly seamlessly maintaned and upgraded service [2].

[1]https://cloud.google.com/blog/big-data/2016/08/in-memory-que...

[2]https://cloud.google.com/blog/big-data/2016/08/google-bigque...

(disc: work on BigQuery)


Would pipelining help much when the processing job is CPU bound (all cores maxed out)?

Sorry - what does co-tenant to compute mean?


Pipelining helps with hanging chads, tail latency of work steps. If you have a slow worker (due to, say, data skew), entire job slows down. All other workers are sitting idle, waiting for the one worker to finish their piece. Read [0] to see what Dataflow does, and BigQuery/Dremel do very similar stuff to deal with this issue. BigQuery also doesn't have to wait for ALL workers to finish step 1 before proceeding to step 2.

By co-tenant to compute, I mean that processing nodes themselves handle the shuffle in Spark. This can cause non-obvious bottlenecks. BigQuery handles shuffle outside of the processing nodes [1].

[0] https://cloud.google.com/blog/big-data/2016/05/no-shard-left...

[1] https://cloud.google.com/blog/big-data/2016/08/in-memory-que...


I can't speak for big query, but for our open-source BigQuery alternative EventQL [0].

On the upside, massively parallel SQL databases allow you to express what probably took facebook a lot of code in a single SQL query.

On the downside, with any database that supports streaming inserts and dynamic resharding, you will have to eventually write each entry more than once. The more manual approach with spark prevents this.

[0] https://eventql.io


It's exciting to see new innovative open source contributions to the big data space. Since BigQuery has moved on in many ways from the ColumnIO and Dremel concepts described in the Dremel paper[0] with new versions of many of these components (like in-memory shuffle, new storage, new ingest, execution engine, etc), I'd love to learn where EventQL places itself in that gradient.

[0]http://static.googleusercontent.com/media/research.google.co...


My guess is that it wouldn't be too bad, but that at the volumes Facebook are doing, I suspect a lot of their speed is coming from custom optimisations they can make for their use-cases, which Big Query wouldn't have.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: