Hacker News new | past | comments | ask | show | jobs | submit login

> Training a single AI model can cost hundreds of thousands of dollars (or more) in compute resources

Why don't they buy their own hardware for this part? The training process doesn't need to be auto-scalable or failure-resistant or distributed across the world. The value proposition of cloud hosting doesn't seem to make sense here. Surely at this price the answer isn't just "it's more convenient"?




because you are trading speed for cash.

Say you have $8M in funding, and you need to train a model to do x

You can either:

a) gain access to a system that scale ondemand and allows instant, actionable results.

b) hire a infrastructure person, someone to write a K8s deployment system. Another person to come in a throw that all away. Another person to negotiate and buy the hardware, and another to install it.

Option b is can be the cheapest in the long term, but it carries the most risk of failing before you've even trained a single model. It also costs time, and if speed to market is your thing, then you're shit out of luck.


Why in the world do you need a Kubernetes deployment system to run a single, manual, one-time (or a handful of times), high-compute job?


Because when all you have is a hammer, everything looks like a nail.

We have become so DevOps and cloud dependent that everyone has forgotten how to just run big systems cheaply and efficiently.


Because that high-compute job needs to be distributed on many, many machines, and if you're using cheap preemptible instances you have to handle machines dropping off and joining in while you're running that single job.

It's definitely not something that you can launch manually - perhaps Kubernetes is not the best solution, but you definitely need some automation.


because its not a one time operation.

also, how else do you sensibly deploy and manage a multi-stage programme on >500 nodes?

I mean we use AWS batch, which is far superior for this sort of thing. SLURM might work for real steel, as would tractor from pixar.


If you're in a position where you need to train a large network: first, I feel bad for you. second, you'll need additional machines to train in a reasonable amount of time.

ML distributed training is all about increasing training velocity and searching for good hyperparameters




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: