"While running on 20 TB of input, we discovered that we were generating too many...

ap22213 · on Sept 1, 2016

"Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort."

"As far as we know, this is the largest real-world Spark job attempted in terms of shuffle data size"

I'm far, far from a world class engineer, but I regularly do 90 TiB shuffle sorts. I must seriously be missing something, here.

tveita · on Sept 1, 2016

Have you run into any of the issues mentioned of the article? Some of them are regressions, which version of Spark were you running?

Out of the linked issues these all seem like they would be "easy" to hit given enough data:

https://issues.apache.org/jira/browse/SPARK-13279

https://issues.apache.org/jira/browse/SPARK-13850

https://issues.apache.org/jira/browse/SPARK-13958

https://issues.apache.org/jira/browse/SPARK-14363

sigzero · on Sept 1, 2016

Are you using Spark? That's the context.

ap22213 · on Sept 1, 2016

liancheng · on Sept 2, 2016

Coalescing down to a smaller partition number does decrease the number of output files. But it also decreases parallelism, which isn't expected when processing so large a dataset.

Coalescing makes more sense when some stage of the pipeline dramatically shrinks the amount of data (e.g. grep-ing error logs from all log files) so that successive stages can easily handle the rest of the data with much fewer executors.

(disc.: Spark committer)

placeybordeaux · on Sept 1, 2016

Yeah I didn't quite get that part either.