Hacker News new | past | comments | ask | show | jobs | submit login

"While running on 20 TB of input, we discovered that we were generating too many output files (each sized around 100 MB) due to the large number of tasks."

I could be completely missing something here, but to decrease the number of output files you can coalesce() the RDD before you write. For example, let's say you have a 20 node cluster, each with 10 executors, and the RDD action is split into 20000 tasks. You may end up with 20000 partitions (or more). However, you can coalesce, and reduce the number down to 200 partitions. Or, if necessary, you could even shuffle (across VMs but within the node) down to 20 partitions, if you're really motivated.

What am I missing?




"Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort."

"As far as we know, this is the largest real-world Spark job attempted in terms of shuffle data size"

I'm far, far from a world class engineer, but I regularly do 90 TiB shuffle sorts. I must seriously be missing something, here.


Have you run into any of the issues mentioned of the article? Some of them are regressions, which version of Spark were you running?

Out of the linked issues these all seem like they would be "easy" to hit given enough data:

https://issues.apache.org/jira/browse/SPARK-13279

https://issues.apache.org/jira/browse/SPARK-13850

https://issues.apache.org/jira/browse/SPARK-13958

https://issues.apache.org/jira/browse/SPARK-14363


Are you using Spark? That's the context.


Yes


Coalescing down to a smaller partition number does decrease the number of output files. But it also decreases parallelism, which isn't expected when processing so large a dataset.

Coalescing makes more sense when some stage of the pipeline dramatically shrinks the amount of data (e.g. grep-ing error logs from all log files) so that successive stages can easily handle the rest of the data with much fewer executors.

(disc.: Spark committer)


Yeah I didn't quite get that part either.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: