Cortex – An ML model deployment platform that runs in your AWS account

sandGorgon · on Aug 1, 2019

You should be using Terraform as your configuration management language. Please do not invent your own. Pretty much your yaml is a pseudo-configuration management language.

That will become one of the biggest blockers because of the amount of automation that already exists.

By leveraging Terraform, you will also have the added benefit of getting all the other pieces of AWS/GCP/azure components for free - and is rock stable and production tested.

ospillinger · on Aug 1, 2019

Thanks for the feedback! We aren't trying to invent another infrastructure provisioning language, and I agree that Terraform would be the right choice if that was the case. Our YAML is more similar to the configuration of deployment tools like Netlify or CircleCI. We use CloudFormation and Kubernetes under the hood but our goal is to provide a much higher abstraction for data scientists / ML engineers.

sandGorgon · on Aug 1, 2019

Not entirely. The abstractions are different between infrastructure deployment.... versus configuration yml of circleci.

The declaration of deployment state is a very BIG and hard problem that has had millions of collective man hours spent over decades. I urge you not to think of it as a simple configuration.

In fact it is so hard that AWS has to build a new language on top of typescript ..versus cloudformation templates that it already had.

https://docs.aws.amazon.com/cdk/latest/guide/home.html

What you are building makes sense - I would drop cloudformation and surface Terraform right till the top.

So the way to use your tool is to install and use a new Terraform "provider".

ospillinger · on Aug 1, 2019

The Terraform provider idea is interesting, I'll think about it more carefully. Almost all of our deployment configuration under the hood is done with Kubernetes (which is focused on the declaration of deployment state). We modeled our configuration after Kubernetes for that reason, and we want to go beyond low-level infrastructure configuration by allowing users to configure prediction tracking, model retraining thresholds, and other more ML specific features using the same declarative paradigm and in the same configuration files.

sandGorgon · on Aug 1, 2019

Terraform has first class support for K8s - https://www.terraform.io/docs/providers/kubernetes/index.htm... In fact, i would say that's what Terraform was built around, so there's nothing more maintained than the k8s provider.

In addition, it has EKS (https://www.terraform.io/docs/providers/aws/r/eks_cluster.ht...) as well as GKE providers (https://www.terraform.io/docs/providers/google/r/container_c...) in case you are so inclined.

dankohn1 · on Aug 1, 2019

Terraform slightly predates Kubernetes, May 21 vs. June 6, 2014. They were developed independently.

Links at: https://landscape.cncf.io/category=automation-configuration,...

joseph · on Aug 1, 2019

Well, CDK actually produces CloudFormation templates. Sorry, but I always feel the urge to jump in when people claim Terraform should be used instead of CloudFormation because of personal preferences. If you are AWS native and already using CloudFormation, I see no reason to switch. CloudFormation provides a ton of functionality out of the box and Amazon handles it for you. Rollbacks alone are a huge reason one might want to use it over Terraform.

sandGorgon · on Aug 1, 2019

well the reason to switch from CDK to Terraform is that your infrastructure management becomes a lot more cloud agnostic.

That's basically what Terraform is for anyway. If you wanted, you could have scripted using the AWS SDK in python or something.

if that's not a concern, then i suppose CDK is as good as any (probably better, since its in Typescript...but then so is Pulumi)

fake-name · on Aug 1, 2019

You should be using an existing scripting language as your configuration language.

Seriously, Every single fucking stupid infrastructure-deployment-tool/"platform" whatever has it's own, dumb in-house language that winds up basically re-implementing the programming language the tool is written in badly.

  - Puppet: Has a stupid ad-hoc config language.   
  - Terraform: Has a stupid ad-hoc config language.   
  - SaltStack: Has a stupid ad-hoc config language.   
  - Ansible: Has a stupid ad-hoc config language.

If you're even considering implementing a tool like this, use a goddamn existing language for your configuration files.

You don't need to use the entire language, but at least use the language's lexer/parser (cf. json/javascript). That way, all existing tooling for the language will work for the config files (ask me about how saltstack happily breaks their API because you're not "supposed" to use it, despite the fact that they have public docs for it). Additionally, people won't need to figure out all the stupid corner cases in your terrible piecemeal configuration language.

Additionally, by making your configuration language an actual language, you also simplify a lot of the system design, because the configuration can act directly against your API. This means using your tool from other tools becomes much more straightforward, because the only interface you actually need is the API.

sandGorgon · on Aug 1, 2019

You are right - Terraform does have the ecosystem, but the new kid on the block is Pulumi.

Pulumi = Terraform in Typescript. That's good as well - but i was not sure if the OP is familiar with Pulumi

chucky_z · on Aug 1, 2019

Pulumi made the mistake of immediately making remote state a paid-only feature. Even if it's not, from all the recent marketing I looked at everything useful required payment; for getting started with a project that's a non-starter.

On top of that, most of the worst parts of Terraform are no longer an issue with 0.12.

reilly3000 · on Aug 1, 2019

Its completely possible to host your state in S3 or a filesystem; it takes a bit of setup and there may be a few rough edges, but the effort or subscription is completely worth it. The secrets management alone makes it worth it, but their programming model is definitely the future. I think the fact that AWS just released their Cloud Development Kit is strong validation of the approach.

icedchai · on Aug 1, 2019

Yes!! Glad someone said this. Most of these languages, Terraform "HCL" specifically, are awful for anyone familiar with a real language.

mlevental · on Aug 1, 2019

https://github.com/cloudtools/troposphere

personjerry · on Aug 1, 2019

Man I can't way to pay for a ML dev service that handles the infrastructure dev service that I pay for to host my ML dev services.

tixocloud · on Aug 1, 2019

We're actually building infrastructure to serve ML as a Saas and would love to get your input.

metabeard · on Aug 1, 2019

whoosh ;)

scribu · on July 31, 2019

A comparison with AWS's own offering (SageMaker Model Serving) would be helpful.

ospillinger · on July 31, 2019

We solve a similar problem to SageMaker but we are focused on developer experience and flexibility.

- Deployments are defined with declarative configuration and no custom Docker images are required (although you can use your own if you want)

- You have full access to the instances, autoscaling groups, security groups, etc

- Less tied to AWS (GCP support is in the works)

- We are working on higher level features like prediction monitoring, alerting, and model retraining

- It's open source and free vs SageMaker's ~40% markup

jmcminis · on Aug 1, 2019

Compare and contrast with Seldon on Kubeflow?

ospillinger · on Aug 1, 2019

My understanding is that Seldon and Kubeflow are more geared towards infrastructure engineers. Our goal is to hide the infrastructure tooling so that Kuberentes, Docker, or AWS expertise isn’t required. Cortex installs with one command, models are deployed with minimal declarative configuration, autoscaling works by default, and you don’t need to build Docker images / manage a registry.

jmcminis · on Aug 1, 2019

Thanks!

I bet you could get Cortex running on Kubeflow pretty easily since it's all K8s anyway.

ospillinger · on Aug 1, 2019

Good idea, I definitely think it's doable.

hamolton · on July 31, 2019

Has anyone used this? It looks useful!