> Training a single AI model can cost hundreds of thousands of dollars (or more) in compute resources
Why don't they buy their own hardware for this part? The training process doesn't need to be auto-scalable or failure-resistant or distributed across the world. The value proposition of cloud hosting doesn't seem to make sense here. Surely at this price the answer isn't just "it's more convenient"?
Say you have $8M in funding, and you need to train a model to do x
You can either:
a) gain access to a system that scale ondemand and allows instant, actionable results.
b) hire a infrastructure person, someone to write a K8s deployment system. Another person to come in a throw that all away. Another person to negotiate and buy the hardware, and another to install it.
Option b is can be the cheapest in the long term, but it carries the most risk of failing before you've even trained a single model. It also costs time, and if speed to market is your thing, then you're shit out of luck.
Because that high-compute job needs to be distributed on many, many machines, and if you're using cheap preemptible instances you have to handle machines dropping off and joining in while you're running that single job.
It's definitely not something that you can launch manually - perhaps Kubernetes is not the best solution, but you definitely need some automation.
If you're in a position where you need to train a large network: first, I feel bad for you. second, you'll need additional machines to train in a reasonable amount of time.
ML distributed training is all about increasing training velocity and searching for good hyperparameters
Why don't they buy their own hardware for this part? The training process doesn't need to be auto-scalable or failure-resistant or distributed across the world. The value proposition of cloud hosting doesn't seem to make sense here. Surely at this price the answer isn't just "it's more convenient"?