> I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda I put "...

> I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda

I put "remote" in quotes because they're not direct equivalents but from a practical standpoint it's the alternate current approach.

> they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

IME this has changed quite a bit. Between improved support for torch FSDP, Deepspeed, and especially HF Accelerate wrapping of each with transformer models it's been a while since I've had to put much (if any) work in.

That said if you're running random training scripts it likely won't "just work" but given larger models becoming more common I see a lot more torchrun, accelerate, deepspeed, etc in READMEs and code.

> This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter)

Remotely, as in over the internet? 400gb ethernet is already too slow vs PCIe5 x16 (forget SXM). A 10gb internet connection is 40x slower (plus latency impacts).

Remote development via the internet with scuda would be impossibly and completely uselessly slow.