Hacker News new | past | comments | ask | show | jobs | submit login

I really like using Slurm, the documentation is great (https://slurm.schedmd.com) and the model is pretty straightforward, at least for the mostly-single-node jobs I used it for.

You can launch a job(s) via command-line, config in Bash comments, REST APIs, linking to their library, and I think a few more ways.

I found it pretty easy to setup and admin. Scaling in the cloud was way less developed when I used it, so I just hacked in a simple script that allowed scaling up and down based on the job queue size.

What do you like better and for what use-case? Mine was for a group of researchers training models, and the feature I desired most was an approximately fair distribution of resources (cores, GPU hours, etc.).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: