Kubeflow – Machine Learning Toolkit for Kubernetes

TheIronYuppie · on Dec 8, 2017

Hi! I’m David Aronchick, PM on Kubeflow, I’m happy to answer any questions! I was one of the early PMs on Kubernetes, and we very much want to make this a community project, so please join us in thinking about what’s next!

- GH: https://GitHub.com/Google/Kubeflow

- kubeflow-Discuss: https://groups.google.com/forum/m/#!forum/kubeflow-discuss

NOTE: The name "flow" does not refer directly to TensorFlow; if anything it's a nod at all the river themes that pop up in the ML community (eg FBLearner Flow)

Disclosure: I work at Google on Kubeflow

ethbro · on Dec 8, 2017

Kudos on the release, and thanks for your and the team's work!

Any thoughts on this vs managed ml-engine? Cost-aside, seems like this nibbles on the smaller scale "but ml tooling is too hard" use cases?

TheIronYuppie · on Dec 8, 2017

Thank you! We still love Google Cloud ML engine - it's perfect for those who want to run in the cloud and want a layer of abstraction. This is for people who want portable stacks and a bit more control; and/or want to use their Kubernetes deployments (particularly on-premise or multi-purpose).

Does that help?

Disclosure: I work at Googlr on Kubeflow

ecnahc515 · on Dec 8, 2017

Do you foresee some kind of integration between Google Cloud ML engine and Tensorflow on k8s in the future?

TheIronYuppie · on Dec 8, 2017

We're always ready to talk roadmap - anything in particular you'd like to see integration-wise?

Disclosure: I work at Google on Kubeflow

ethbro · on Dec 8, 2017

Off the top of my head, maybe a maintained "ml-engine aligned" kubeflow setup, to the extent that's possible.

The use case I'm think of is an ml dev team building on kubeflow and proving a system. Then wanting to transfer it to a non-engineering team, yet wash their hands of any ongoing infrastructure ops responsibility.

Knowing that a "ml-engine aligned" kubeflow config would transfer cleanly (including associated bells and whistles) would make that a much more attractive option.

Caveat: I'll admit I'm not keeping up on what's in the managed offering, but I'm assuming there are a number of value-adds of the type that end users like (visualizations, etc).

TheIronYuppie · on Dec 8, 2017

Yes, this is EXACTLY what we're trying to do! However, it's a bit early, so I can't say when or where we'll be able to get to it. Also, I should be clear, though I'm from Google, we would really like the same story to work with other cloud's hosted offerings as well, but we'll need their support to do so!

Disclosure: I work at Google on Kubeflow

pacala · on Dec 8, 2017

Great to hear this not tied to TensorFlow! How would one use a different DL platform, say PyTorch or DyNet?

TheIronYuppie · on Dec 8, 2017

The steps would basically be:

- Containerize the DL platform

- Create a k8s manifest (similar to our CRD if necessary)

- Create a service endpoint

- integrate all that into the JH deployment

This is less hard than it sounds, but we'd love help! We only started with TF because that's what we know.

Disclosure: I work at Google on Kubeflow

minerals · on Dec 8, 2017

Interesting, though I don't see how it is better than a plain docker image over kubernetes? Not much of a hassle now too. And how is it different from what DL4J is already doing with Zeppelin and supporting both Keras, TF, MXNet and PyTorch on the way?

bonesss · on Dec 8, 2017

> ...how [is it] better than a plain docker image over kubernetes?

Scalability for people with existing on-premise (or cloud based), kubernetes workflows, especially once it comes to training or heavy crunching.

That's not to say that Docker Machine/Swarm/Compose couldn't handle the same, but it's an extra step for kubernetes users and pushes people onto a slightly different toolchain than minikube->K8s.

TheIronYuppie · on Dec 8, 2017

Correct! Many folks have more complicated deployments in the cloud, and we're trying to align (as close as humanly possible) your on-prem stack with your cloud stack, to minimize the pain in migration.

If you have a single container, and a simple pipeline, this may be a bit more than you need. We've just found that there are normally 5 or more services/systems that people wire together to create an ML stack, and that's what we're trying to solve for/simplify.

Disclosure: I work at Google on Kubeflow

colek42 · on Dec 8, 2017

Are you at kubecon this week?

TheIronYuppie · on Dec 8, 2017

I am! You can reach me at aronchick (at) Google if you'd like to meet or have questions.

ericand · on Dec 7, 2017

Looks like different components could be added in the future, but not clear how.

The following are included:

- A JupyterHub to create & manage interactive Jupyter notebooks.

- A Tensorflow Training Controller that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting.

- A TF Serving container.

TheIronYuppie · on Dec 8, 2017

Correct! We're currently thinking a lot about orchestration of the various components but for now, our goal is to use the native loose coupling between services available in K8s. So if you wanted Spark for data processing, for example, you could start a service, and the deployment, and feed that into the TF CRD.

Disclosure: I work at Google

humanfromearth · on Dec 8, 2017

Any chance to make this a helm chart?

yuvipanda · on Dec 8, 2017

Chime in at https://github.com/google/kubeflow/issues/23 :)

TheIronYuppie · on Dec 8, 2017

We're looking at a bunch of deployment packaging solutions, helm is probably one of the top ones!

Disclosure: I work at Google on Kubeflow

mountainerd · on Dec 7, 2017

Pretty interesting. I'm guessing this is something that Google uses internally for their Kubernetes workflows.

TheIronYuppie · on Dec 8, 2017

Its very close to how we think about ML internally, but not what we use. Your best bet to read that is look at the TFX paper[1] which describes our internal thoughts in great detail. (Though Kubeflow is not designed to be an externalization of TFX, we're very much working in collaboration with that team)

[1] http://www.kdd.org/kdd2017/papers/view/tfx-a-tensorflow-base...

Disclosure: I work at Google on Kubeflow

quadrature · on Dec 8, 2017

Does google intend on open sourcing TFX ?. I only ask because we're building a lot of the same infrastructure.

TheIronYuppie · on Dec 8, 2017

We're absolutely looking at it! Please join our discussion, we'd love to talk about what you're building and if we can help and/or what you'd like us to OSS.

Disclosure: I work at Google on Kubeflow

quadrature · on Dec 8, 2017

Sure, is there an issue/doc/pr to comment on ?.

TheIronYuppie · on Dec 8, 2017

No, would you mind adding one?

nikolay · on Dec 7, 2017

I am at KubeCon 2017 in Austin, TX and, yeah, based on the presentation, it looks like an internal tool they just opened to the public with some bold goals.

TheIronYuppie · on Dec 8, 2017

Hi! It was actually designed from the start to be an extension of GitHub.com/tensor flow/k8s and then it took on larger goals = Making an entire ML stack (of any ML framework) easy to use, portable and composable on k8s.

Disclosure: I work at Google on Kubeflow

smooc · on Dec 7, 2017

This looks pretty cool. Is it dependent on Google’s kubernetes or can it be run on Openshift or DC/OS as well?

ericand · on Dec 7, 2017

It says it runs in "in any environment in which Kubernetes runs." So as long as you are asking if it runs on Openshift's Kubernetes, than yes.

TheIronYuppie · on Dec 8, 2017

Absolutely! Redhat are already contributing :)

Disclosure: I work at Google on Kubeflow

Voloskaya · on Dec 8, 2017

The controller part is a Custom Resource Definition. It can run on any cloud.

However, to benefit from GPUs you need to configure the controller correctly. Default configuration are there for GCP and Azure, but you would need to do that manually for other cloud (not that it is very hard)

TheIronYuppie · on Dec 8, 2017

Correct, anytime you bleed through hardware, it requires some setup, sadly

Disclosure: I work at Google on Kubeflow

sub-mod · on Dec 11, 2017

With Openshift you can also exploit the S2I feature(This is not based on CRD.)You can refer this Blog on some ways which this can be done for TF. https://blog.openshift.com/openshift-commons-briefing-110-co...

TheIronYuppie · on Dec 8, 2017

Yep, anywhere k8s runs! (Or that's the idea anyway). If it doesn't, please file a bug!

Disclosure: I work at Google on Kubeflow