> Training a single AI model can cost hundreds of thousands of dollars (or more)...

KaiserPro · on Feb 24, 2020

because you are trading speed for cash.

Say you have $8M in funding, and you need to train a model to do x

You can either:

a) gain access to a system that scale ondemand and allows instant, actionable results.

b) hire a infrastructure person, someone to write a K8s deployment system. Another person to come in a throw that all away. Another person to negotiate and buy the hardware, and another to install it.

Option b is can be the cheapest in the long term, but it carries the most risk of failing before you've even trained a single model. It also costs time, and if speed to market is your thing, then you're shit out of luck.

_bxg1 · on Feb 24, 2020

Why in the world do you need a Kubernetes deployment system to run a single, manual, one-time (or a handful of times), high-compute job?

dsl · on Feb 24, 2020

Because when all you have is a hammer, everything looks like a nail.

We have become so DevOps and cloud dependent that everyone has forgotten how to just run big systems cheaply and efficiently.

PeterisP · on Feb 24, 2020

Because that high-compute job needs to be distributed on many, many machines, and if you're using cheap preemptible instances you have to handle machines dropping off and joining in while you're running that single job.

It's definitely not something that you can launch manually - perhaps Kubernetes is not the best solution, but you definitely need some automation.

KaiserPro · on Feb 25, 2020

because its not a one time operation.

also, how else do you sensibly deploy and manage a multi-stage programme on >500 nodes?

I mean we use AWS batch, which is far superior for this sort of thing. SLURM might work for real steel, as would tractor from pixar.

GaryNumanVevo · on Feb 24, 2020

If you're in a position where you need to train a large network: first, I feel bad for you. second, you'll need additional machines to train in a reasonable amount of time.

ML distributed training is all about increasing training velocity and searching for good hyperparameters