Hacker News new | past | comments | ask | show | jobs | submit login

You know, sometimes you are in that uncomfortable spot where you have too much data for a single laptop but too little to justify running a whole computing cluster.

That is the kind of spot where you max out everything you can max out and just go take a break when something intensive is running.




This - honestly depending on the task hundreds of GB can be still the "single computer" realm because it's just not worth it to set up a cluster in terms of time and money and also administration overhead. However parallel + out of core computation doesn't necessarily imply a cluster: single-node Spark or something like dask works fine if you're in the python world.


Setting up ad hoc (aka standalone) Spark cluster with a bunch of machines you have control over is ridiculously trivial task though. It's as easy as running spark --master=x where you designate one machine as master. All others started with --master=x become slaves of x. Then you just submit jobs to x and that's all.


Spark is slow though. On the other hand, Pandas is also extraordinarily slow :D


Then you remote into a workstation as some one else in this thread said they did.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: