You know, sometimes you are in that uncomfortable spot where you have too much d...

makmanalp · on April 3, 2018

This - honestly depending on the task hundreds of GB can be still the "single computer" realm because it's just not worth it to set up a cluster in terms of time and money and also administration overhead. However parallel + out of core computation doesn't necessarily imply a cluster: single-node Spark or something like dask works fine if you're in the python world.

super_mario · on April 3, 2018

Setting up ad hoc (aka standalone) Spark cluster with a bunch of machines you have control over is ridiculously trivial task though. It's as easy as running spark --master=x where you designate one machine as master. All others started with --master=x become slaves of x. Then you just submit jobs to x and that's all.

bitL · on April 3, 2018

Spark is slow though. On the other hand, Pandas is also extraordinarily slow :D

walshemj · on April 4, 2018

Then you remote into a workstation as some one else in this thread said they did.