Using Spark and Zeppelin to Process Big Data on Kubernetes 1.2

minimaxir · on March 30, 2016

I really wish there were more tutorials like this on how to set up Spark and other Big Data tools (TensorFlow) on cloud computing, as that personally has been the primary barrier to starting work with extra large amounts of data. (Most current tutorials require running a ton of console commands that are obsolete.)

I took a Spark course on eDX last year, but the environment was set up using a customized Vagrant config with no real-world use. I definitely prefer the Kubernetes approach.

mindcrime · on March 30, 2016

I really wish there were more tutorials like this on how to set up Spark and other Big Data tools (TensorFlow) on cloud computing, as that personally has been the primary barrier to starting work with extra large amounts of data.

Just as an FYI, we[1] are working on an open source, cloud based Machine Learning / Big Data platform that might be of interest to you. It's not all ready yet, but when it is, there will be a simple REST API that allows you to define the kind of setup you want, "push a button" and have it all deployed. Our initial backend is AWS with plain jane EC2 nodes, but it will be possible to extend it to other configurations as well.

Right now we deploy a Spark/Hadoop Cluster with Apache SystemML, Mahout and MLLib installed. Zeppelin will be coming to the stack, as will other tools like TensorFlow, SparkR, CaffeOnSpark, etc.

We'll be offering our own hosted service based on this, but it'll be open source so you can deploy it in an environment of your own if you wish.

We'll do a "Show HN" when we have something ready, so keep an eye out if that sounds interesting.

I also plan to write up some tutorial and documentation based on our experiences building this out, but the priority right now is getting it built. :-)

[1]: http://www.fogbeam.com

TheIronYuppie · on March 30, 2016

Disclosure: I work at Google on Kubernetes.

If you'd like to do this in containers/Kubernetes, we'd love to highlight your work! Kubernetes runs great on AWS (as well as GCP, Azure and elsewhere), so no cloud migration required.

jaz46 · on March 30, 2016

Shameless plug: Pachyderm is another way to run big data workloads in Kubernetes. github.com/pachyderm/pachyderm. We spoke at KubeCon SF last year and v1.0 is coming out next month!

TheIronYuppie · on March 30, 2016

Disclosure: I work at Google on Kubernetes

Great to hear, congrats on reaching 1.0! Please do reach out when you get there, we'd love to show off your work to the community.

mindcrime · on March 30, 2016

Awesome, we'll definitely be looking into that as we progress. "Version 1" is going straight to raw VM's as we saw it as the path of least resistance to getting an MVP out (esp. since nobody here has done much work with containers yet) but the container route is definitely something we'll be pursuing at some point.

boulos · on March 31, 2016

What's wrong with say Dataproc (https://cloud.google.com/dataproc/) instead of roll your own Spark? Or our (just announced) Machine Learning service based on TensorFlow (https://cloud.google.com/ml/)?

I'm all for roll your own if you're building one of these services or have existing infrastructure, but I personally like the simplicity of "here, you set this up".

Disclosure: I work at Google on Compute Engine (which underlies all of these).

ryanl-ee · on March 30, 2016

Have you checked out DCOS? Once you have your cluster set up, your path to installing Spark is running "dcos package install spark" on your laptop.

minimaxir · on March 30, 2016

Huh, no I haven't. Although the package support appears to be more narrow than with Kubernetes/Docker, but DCOS is a possibility too.

ryanl-ee · on March 31, 2016

Well, the included packages in the "dcos package" CLI repo are just the prepackaged frameworks. You can actually deploy Kubernetes on top of DCOS in the same way - it natively supports Docker containers as an execution method.

ptrincr · on March 31, 2016

You should also check out Ambari, fantastic open source Hadoop/spark cluster management tool.

mindcrime · on March 31, 2016

Second the motion. Ambari is actually one of the key elements of the tech stack we're using for our service. With the "Blueprints" feature and the REST API, it's a really convenient way to automate the provisioning of Spark/Hadoop clusters.

Even better, Ambari isn't actually limited to installing just Hadoop/Spark, etc. In principle, you could extend it to take care of installing pretty much anything.

thinkmassive · on March 30, 2016

If you're interested in Hortonworks Data Platform then check out Cloudbreak. It deploys HDP to major cloud platforms using Docker, and everything is fully open source.

http://hortonworks.com/hadoop/cloudbreak/

Disclosure: I work for Hortonworks

RobGoretsky · on March 31, 2016

Using EMR on AWS makes it as easy as a few clicks in the AWS console to spin up a cluster running Spark and Zeppelin. Plus you get direct access to any data stored in S3 on that same AWS account..

TheIronYuppie · on March 30, 2016

Disclosure: I work at Google on Kubernetes

We totally agree - we'd love to help folks get started with common frameworks. Did you have any, in particular, that you'd like us to work on next?

BrandonBradley · on March 31, 2016

I've found Spark on Mesos 'fairly easy' to setup. Fairly easy after quite of bit of pain and configuration management.

TheIronYuppie · on March 30, 2016

Disclosure: I work at Google on Kubernetes

Lots of folks have been interested in how to run common stacks on Kubernetes; this is a great example of how to run a really common scenario (i.e. using your cluster to do Spark processing).

manojlds · on March 30, 2016

IS anyone actually using it (Spark on k8s) for intensive loads though?

rcpt · on March 31, 2016

How would one go about this using hdfs instead of relying on gs, s3 etc. for storage? Would hdfs run as a separate k8s service?

boulos · on March 31, 2016

Given the failure mode of Datanodes in Hadoop (lose X replicas, adios) you'd probably want something like the upcoming PetSet in Kubernetes 1.3. The ability to lose nodes and have that take out your job is why we push so hard on having people use our HDFS connector for GCS: it's really nice to have a high bandwidth shared object store that won't "crash".

faizshah · on March 30, 2016

Semi-related but does anyone know if zeppelin supports java yet?

markvdb · on March 31, 2016

Totally off topic, but...

Am I the only one to think of https://en.wikipedia.org/wiki/Hindenburg_disaster when seeing the words "spark" and "zeppelin" together? (This type of air ship is called a "zeppelin" in my native language...)