Show HN: Cortex – Open-source alternative to SageMaker for model serving

Lemaxoxo · on April 14, 2020

Hey there, great work. I used Cortex as an inspiration when designing chantilly: https://github.com/creme-ml/chantilly, which is a much less ambitious solution tailored towards online machine learning models. Keep up the good work.

itsderek23 · on April 14, 2020

This certainly looks like a cleaner way to deploy an ML model than SageMaker. Couple of questions:

* Is this really for more intensive model inference applications that need a cluster? It feels like for a lot of my models, a cluster is overkill.

* A lot of the ML deployment (Cortex, SageMaker, etc) don't see to rely on first pushing changes to version control, then deploying from there. Is there any reason for this? I can't come up for a reason why this shouldn't be the default. For example, this is how Heroku works for web apps (and this is a web app at the end of the day).

calebkaiser · on April 14, 2020

You're 100% right that Cortex is designed for the production use-case. A lot of our users are running Cortex for "small" production use cases, since the Cortex cluster can include just a single EC2 instance for model serving (autoscaling allows deployed APIs to scale down to 1 replica). For ML use-cases that don't need an API (a lot of data analysis work, for example), Cortex is probably overkill.

As for your second question, we definitely want to integrate tightly with version control systems. Since right now we are 100% open source and don't offer a manged service, we don't have a place to run the webook listeners. That said, most of our users version control their code/configuration (we do that with our examples as well: https://github.com/cortexlabs/cortex/examples), and it should be straightforward to integrate Cortex into an existing CI/CD workflow; the Cortex CLI just needs to be installed, and then running `cortex deploy` with the updated code/configuration will trigger a rolling update.

If you're referring to version control for the actual model files, Cortex is un-opinionated as to where those hosted, so long as they can be accessed by your Predictor (what we call the Python file that initializes your model and serves predictions). If you're interested in implementing version control with your models, I'd recommend checking out DVC.

cgoltsev · on April 16, 2020

Is it possible to partner with you to offer a managed service for Cortex? We are looking at your solution to offer our clients for deployment.

itsderek23 · on April 14, 2020

Great Caleb - makes sense. Thanks!

bobosha · on April 14, 2020

Has anyone used Cortex in production?

- Could you share your experiences?

- why would one choose this over docker for instance?

calebkaiser · on April 14, 2020

I'm sure others will comment, but in the meantime, some people have written up their experiences using Cortex in production. I'd point you to AI Dungeon: https://medium.com/@aidungeon/how-we-scaled-ai-dungeon-2-to-...

We also have a pretty active Gitter channel: https://gitter.im/cortexlabs/cortex

As for your second question, Cortex uses Docker to containerize models. The rest of Cortex's features (deploying models as microservices, orchestrating an inference cluster, autoscaling, prediction monitoring, etc.) are outside Docker's scope.

bobosha · on April 14, 2020

>orchestrating an inference cluster, autoscaling, prediction monitoring,

Does this approach preclude the need for queuing (a la RabbitMQ) and/or a load balancer?

calebkaiser · on April 14, 2020

Yep! Cortex deploys load balancers on AWS and manages queueing.

bobosha · on April 14, 2020

This is super-exciting! I didn't know it could be this easy!

How do you handle API authentication? Is there a module that interfaces with AWS API gateway? or external API authentication?

calebkaiser · on April 14, 2020

Right now, users handle API auth by using AWS API gateway in front of Cortex, but incorporating AWS API Gateway into Cortex to automate this is on our short term roadmap.

wikibob · on April 14, 2020

The name Cortex is in use for the scalable Prometheus storage backend: https://github.com/cortexproject/cortex

ignoramous · on April 14, 2020

... and for a lot of other things: https://en.wikipedia.org/wiki/Cortex

dodata · on April 14, 2020

One of the things that has deterred me from SageMaker is how expensive it can be for a side project. Real-time endpoints start at $40-$50 per month, which would be a bit too much for a low-budget project on the side. I love the idea of using an open-source alternative, but I noticed that all of the systems combined for Cortex would be a bit more expensive. Do you have any tips on how to keep a model deployed cheaply for a side project using Cortex? Id be fine with a little bit of latency on the first request, similar to how Heroku's free dynos work.

calebkaiser · on April 14, 2020

In general, Cortex will be significantly cheaper because you're only paying AWS for EC2 (the bulk of the bill) and the other AWS services used (a much smaller portion of the bill). With SageMaker, you're paying the EC2 bill plus a ~40% premium.

To keep the AWS bill as low as possible, Cortex supports inference on spot instances, which are unused instances that AWS sells at a steep (as in 90%) discount. The drawback is that AWS can reclaim the instance when needed, but with ML inference failover isn't as big of a deal, since you typically don't need to preserve state.

If you use spot instances, choose the cheapest instance type possible, and keep your autoscalers minimum replicas to 1 (meaning it won't keep many replicas idling), you should be able to deploy the model pretty cheaply. Significantly cheaper than with SageMaker, at the very least.

There's some more info here: https://www.cortex.dev/cluster-management/spot-instances

cameronfraser · on April 14, 2020

Why would I use this over deploying the model to a lambda function aside from lack of GPU? (not trying to be confrontational, genuinely don't know) Won't lambda functions scale as needed? How does this compare cost wise?

calebkaiser · on April 14, 2020

Great question. We actually experimented with Lambda before ever building Cortex. We ran into several issues, the three easiest to list are:

1. Size limits. Lambda limits deployment packages to 250 mb uncompressed, and puts an upper bound on memory of 3,008 mb. That's not nearly big enough for a lot of models, particularly bigger deep learning models.

2. As you mentioned, GPU inference is supported on Lambda, and for many models, GPUs are necessary for serving with acceptable latency.

3. Lambda instances can only serve one request at a time. With how slow ML inference can be—especially if you need to call another API or preform some IO request—it's easy to lock up Lambda instances for full seconds just to serve one prediction.

The TL;DR is that while Lambda works for some use-cases, it in general lacks the flexibility and customizability needed for most inference use-cases.

_____smurf_____ · on April 14, 2020

How is this compared to KubeFlow?

calebkaiser · on April 14, 2020

The simplest way to put it is that Kubeflow (whose team we have a ton of respect for) is a tool for helping devops engineers build their ML deployment platform on kubernetes, whereas Cortex is an ML deployment platform. Kubeflow plugs into an existing k8s cluster, whereas Cortex abstracts k8s (and automates AWS-layer devops too).

With Cortex, we wanted to build something so that developers can take a trained model—regardless of if it's trained by their DS team or if it is a pre-trained model—and deploy it as a production API without needing to understand k8s. Because Cortex manages the k8s cluster, we can do the legwork for features like spot instances, request-based cluster autoscaling, GPU support, etc, and expose them as simple yaml configuration.