I have a custom ML (PyTorch) model that I would like to set up as a service / API - it should be able to receive an input any time and promptly return an output. It should be able to scale up automatically to thousands of requests per second. The model itself takes around a minute to load, an inference step takes around 100ms. The model is being called only from my product's backend, so I have a bit of control over request volume.
I've been searching around and haven't found a clear standard/best way to do this.
Here are some of the options I've considered:
- Algorithmia (came across this yesterday, unsure how good it is and have some questions about the licensing)
- Something fancy with Kubernetes
- Write a load balancer and manually spin up new instances when needed.
Right now I'm leaning towards Algorithmia as it seems to be cost-effective and basically designed to do what I want. But I'm unsure how it handles long model loading times, or if the major cloud providers have similar services.
I'm quite new to this kind of architecture and would appreciate some thoughts on the best way to accomplish this!
Cortex automates all of the devops work—from containerizing your model, to orchestrating Kubernetes, to autoscaling instances to meet demands. We have a bunch of PyTorch examples in our repo, if you're interested: https://github.com/cortexlabs/cortex/tree/master/examples