I really wish there were more tutorials like this on how to set up Spark and other Big Data tools (TensorFlow) on cloud computing, as that personally has been the primary barrier to starting work with extra large amounts of data. (Most current tutorials require running a ton of console commands that are obsolete.)
I took a Spark course on eDX last year, but the environment was set up using a customized Vagrant config with no real-world use. I definitely prefer the Kubernetes approach.
I really wish there were more tutorials like this on how to set up Spark and other Big Data tools (TensorFlow) on cloud computing, as that personally has been the primary barrier to starting work with extra large amounts of data.
Just as an FYI, we[1] are working on an open source, cloud based Machine Learning / Big Data platform that might be of interest to you. It's not all ready yet, but when it is, there will be a simple REST API that allows you to define the kind of setup you want, "push a button" and have it all deployed. Our initial backend is AWS with plain jane EC2 nodes, but it will be possible to extend it to other configurations as well.
Right now we deploy a Spark/Hadoop Cluster with Apache SystemML, Mahout and MLLib installed. Zeppelin will be coming to the stack, as will other tools like TensorFlow, SparkR, CaffeOnSpark, etc.
We'll be offering our own hosted service based on this, but it'll be open source so you can deploy it in an environment of your own if you wish.
We'll do a "Show HN" when we have something ready, so keep an eye out if that sounds interesting.
I also plan to write up some tutorial and documentation based on our experiences building this out, but the priority right now is getting it built. :-)
If you'd like to do this in containers/Kubernetes, we'd love to highlight your work! Kubernetes runs great on AWS (as well as GCP, Azure and elsewhere), so no cloud migration required.
Shameless plug:
Pachyderm is another way to run big data workloads in Kubernetes. github.com/pachyderm/pachyderm. We spoke at KubeCon SF last year and v1.0 is coming out next month!
Awesome, we'll definitely be looking into that as we progress. "Version 1" is going straight to raw VM's as we saw it as the path of least resistance to getting an MVP out (esp. since nobody here has done much work with containers yet) but the container route is definitely something we'll be pursuing at some point.
I'm all for roll your own if you're building one of these services or have existing infrastructure, but I personally like the simplicity of "here, you set this up".
Disclosure: I work at Google on Compute Engine (which underlies all of these).
Well, the included packages in the "dcos package" CLI repo are just the prepackaged frameworks. You can actually deploy Kubernetes on top of DCOS in the same way - it natively supports Docker containers as an execution method.
Second the motion. Ambari is actually one of the key elements of the tech stack we're using for our service. With the "Blueprints" feature and the REST API, it's a really convenient way to automate the provisioning of Spark/Hadoop clusters.
Even better, Ambari isn't actually limited to installing just Hadoop/Spark, etc. In principle, you could extend it to take care of installing pretty much anything.
If you're interested in Hortonworks Data Platform then check out Cloudbreak. It deploys HDP to major cloud platforms using Docker, and everything is fully open source.
Using EMR on AWS makes it as easy as a few clicks in the AWS console to spin up a cluster running Spark and Zeppelin. Plus you get direct access to any data stored in S3 on that same AWS account..
Lots of folks have been interested in how to run common stacks on Kubernetes; this is a great example of how to run a really common scenario (i.e. using your cluster to do Spark processing).
Given the failure mode of Datanodes in Hadoop (lose X replicas, adios) you'd probably want something like the upcoming PetSet in Kubernetes 1.3. The ability to lose nodes and have that take out your job is why we push so hard on having people use our HDFS connector for GCS: it's really nice to have a high bandwidth shared object store that won't "crash".
Am I the only one to think of https://en.wikipedia.org/wiki/Hindenburg_disaster when seeing the words "spark" and "zeppelin" together? (This type of air ship is called a "zeppelin" in my native language...)
I took a Spark course on eDX last year, but the environment was set up using a customized Vagrant config with no real-world use. I definitely prefer the Kubernetes approach.