A practitioner's guide to testing and running GPU clusters

gdiamos · 2024-08-16T05:36:21 1723786581

Glad to see the use of SLURM

The number of times I see people trying to reinvent the HPC wheel astounds me

gdevenyi · 2024-08-16T12:07:12 1723810032

Sadly the fine article yada yadas the installation and integration of the GPUs with the scheduling of SLURM

csris · 2024-08-16T15:34:45 1723822485

Together's Slurm configs model GPUs as GRES resources [1]. Jobs then specify the number of GPUs that they need with flags like `--gres=gpu:8`.

[1]: https://slurm.schedmd.com/gres.html

The-Toon · 2024-08-16T04:56:49 1723784209

Sorry to hijack the thread, but how would one get into managing GPU clusters? Modern GPUs are expensive, so it seems difficult to build a homelab to play around with them. Is learning how to run software on a cluster at the end-users level + playing around with VMs enough experience to enter the field?

michaelt · 2024-08-16T09:26:25 1723800385

> Modern GPUs are expensive, so it seems difficult to build a homelab to play around with them.

Simulate a system with multiple high-end GPUs by setting up a system with one low-end GPU, breaking all the video outputs, and plugging it into a timeswitch that makes it lose power 3 times a week.

Learn about industry norms by deciding it's crucial you have read access to production data, but at the same time that your users are doing ad-hoc experimentation and they can barely meet the code maturity requirements of a test environment.

Fill your storage with training data for a project, then have that project "de-prioritised" so the data isn't being used, but it also can't be deleted. Reorganise the department so it's not even clear whose data it is any more.

Broaden your experience to the entire mlops lifecycle by memorising the crucial shibboleth: "I don't understand labelbox's pricing"

bluedino · 2024-08-16T12:36:35 1723811795

This is a little too realistic. You left out the quarreling and politics between users of your system that is short on resources.

marcyb5st · 2024-08-16T06:40:10 1723790410

(Disclaimer, I'm working for Google) In my opinion this is a great question. I believe you can go two ways:

1) Get your hands on few physical computers and "old" gpus (like nVidia 1000 series or something like that). Put them as part of a K8S cluster and you have an amazing setup to play around with. Try to squeeze as much flops from your hardware as you can. Bonus points if you also architect around failures and test that by pulling off network cables.

2) Using some Cloud provider use preemptible/spot instances with a bunch of GPUs for few hours at the time. Not sure with other clouds, but with GCP you can create a GKE nodepool that only uses spot instances and in conjuction with cluster autoscaler makes what I described very easy and you don't really have to clean much after you're done messing around for the day. GPUs like K80, T4, or P4s are relatively cheap and, if you use them for just 10s of hours a month, you can get away with a bill of 10s of dollars [1].

Either option works fine (IMHO).

Another option I am unsure about because I never tried it is to use something like [2] to multiplex your GPU(s) so that you can pretend you have more GPUs to mess around with. However, if your goal is to learn how to manage/write software for multi-gpus/multi-machine clusters this is somewhat limiting because 1) it doesn't teach you much about data locality/compact placement since transfers between virtual GPUs will be extremely fast (i.e. they are sharing the same VRAM pool after all) and 2) you will still have a single machine (even if you are using multiple virtual K8s nodes).

[1] 3 instances with 4 GPUs each used for 24h in a month cost you 46$ according to: https://cloud.google.com/products/calculator/?hl=en&dl=CiRhY...

[2] https://github.com/NVIDIA/k8s-device-plugin

immibis · 2024-08-16T11:02:49 1723806169

I don't know about GPU clusters specifically but when the purpose is to learn clusters you typically build a cluster out of the cheapest nodes you can get, even if the whole cluster's performance is less than a single machine with sane specifications. Non-GPU clusters for education are often built from single-board computers (Raspberry Pis or even cheaper). So that would be one way to approach it - find a bunch of really old machines with really old GPUs that are worthless for doing actual work on. Some single-board computers even have PCIe slots, so you can install a GPU into an SBC (often with only 1x performance). You could try nodes with iGPUs if that's suitable for your use case (they use different memory topology so it might not be). You could only put one GPU in each node, instead of as many as possible like a real cluster would have; if inter-GPU communication on the same node is an important part of your experiment, then you could only put 2 GPUs in each node. You might want to scale down the network to keep a similar performance ratio to a real high-end system - maybe force each port to 100Mbps.

latchkey · 2024-08-16T07:06:31 1723791991

There isn't really a school for this stuff. The way I learned was to go work for a company that was building out GPU compute.

It is a lot more than just software, especially on the high end of things.

gessha · 2024-08-16T16:16:33 1723824993

Meanwhile in job seeking land:

- obligatory “Senior” in title

- requires 3-5 years of building physical GPU infrastructure

latchkey · 2024-08-16T16:28:19 1723825699

Straight up, these are excuses.

I never finished college. I also never let what was written job descriptions stop me from applying for a job that I wanted. I'm sure that the denials I have received have been all because of my own lack of performance during the interview.

I believe strongly in the whole "If there is a will, there is a way." If you get denied for one interview and you really want the job, you should try again at a later date. Find out what caused you to fail, fix it, and come back stronger.

I'm guessing that the number of people on the planet who've been hands-on in building large scale physical GPU infrastructure are in the low thousands. It isn't some huge field. We, as an industry, need people who really want to do this stuff, and can learn it quickly.

ProTip: you don't need experience with GPUs. I had zero when I started deploying 150,000 of them. What I had was an innate ability to figure shit out based on my other experiences, and that is what got me hired in the first place. I took ownership over the project and made it happen, no matter what it took. That's what people are looking for. Be a doer, not a talker.

timzaman · 2024-08-16T04:53:43 1723784023

decent starter guide for 28 nodes scale. Would be cute to do a follow up around how to do health checks. Eg catching your transceivers from overheating, etc.