Deeplearning4j does this already. It has a huge community, a Scala API and does model import from Keras. It's important to note that Spark is not an efficient computation later -- it's best if used for fast ETL. If you get that wrong, your going to be training slow.
I respect what intel is trying to do here, but it's going to take a lot more than "we built stuff" to get anyone to switch let alone build a community around.
To be fair to intel, I can't wait to see what they do with accelerators and phi, but I need to see more results first.
Competition in the space is definitely needed :D.
We have yet to see fpgas and nervana acquisition really play out as well.
Benchmarks we ran ourselves show that we're faster than TensorFlow using multi-GPUs for a non-trivial image processing task: https://github.com/deeplearning4j/dl4j-benchmark. That's the best apples to apples we have for the moment.
When you're aiming to put deep learning into production, a bunch of other things are important too, notably integrations. DL4J comes with integrations for Hadoop, Kafka and ElasticSearch as well as Spark. In the inference stage, we autoscale elastically as a micro-service using Lagom and a REST API. Most frameworks are just libs that don't solve problems deeper in the workflow. Our tools include data pipelines with DataVec (reusable data preprocessing), to model evaluation with Arbiter and a GUI for heuristics during training.
I have to correct chris here. He is talking about a lot of features that are in our enterprise version SKIL.
We will offer a limited developer version of SKIL for free.
Think of SKIL as similar to gitlab or github enterprise.
In SKIL we also have auto provisioning of a cluster and a higher level interface for running deep learning workloads. It auto configures most of the parameters like the spark worker native library path and setting up things like a training UI as well as installation of the mkl and cudnn libraries.
Optionally, you can also run a version of this with DC/OS and co where there is a packaged spark.
What we do have in dl4j is the raw components you can use to create these things such as datavec and dl4j-streaming which covers our integration with kafka.
Sure :D. It's still in my interest to disclose we partner with nvidia pretty closely though. I would hope to see competition here but we have a large vested interest in gpus succeeding. Thanks for the sentiment though!
I started experimenting with DL4J about a month ago. The getting started example apps are actually pretty good. You can pretty much clone the repo and run them.
This is a classic vanity deep-learning framework that Intel built due to NIH syndrome. It's like DSSTNE. Doomed to be abandoned. I can't a worse way to position a deep learning library than to say: this only works on CPUs. When you look at Intel's track record with software, especially their Trusted Analytics Platform this year, BigDL's prospects are poor. I'm just waiting for IBM to copy this move and come out with yet another deep learning lib: YADLL.
Comparing Spark and TensorFlow is sort of like comparing Numpy and Pandas. There is some overlap, but they are pretty different things.
Spark is a big data manipulation tool, which comes with a somewhat-adequate machine learning library. TensorFlow is an optimised math library with machine learning operations built on it.
Spark doesn't support GPU operations (although as you note Databricks has proprietary extensions on their own cluster). DeepLearning4J and various other libraries do similar things.
However, if you are building your own Neural Network architectures then TF (which has highly optimised distributed training mode) is more useful.
for someone just getting started on Spark - what do you mean "somewhat adequate" ? Because I see MLLib (https://spark.apache.org/docs/2.0.2/mllib-guide.html) and a quick glance shows me a lot of overlap with tensorflow.
At google, their graph processing system (Expander) and deep learning framework (tensorflow) are separate systems. Spark looks to be built from the graph side (RDD) first and is now getting ML components.
MLLib seems awesome, but the devil is in the detail. Example that have burnt me include things like using LogisticRegression for classification only supports binary classification, the LibSVM support only support import, the GBT implantation is weak compared to eg XGBoost etc.
A lot of the time it is fine though.
Graph support.. hmm. GraphX is ok, but there are lots of things that eg NetworkX has the GraphX doesn't. In my experience, we've started a lot of projects with GraphX and abandoned them because GraphX's implementations didn't have the features we needed.
BTW, RDDs aren't graphs. I think you might be confusing the Spark directed-acyclic-graph (DAG) execution model with graph processing.
TensorFlow doesn't have as many general purpose ML algorithms. For example, I don't think there is a Random Forest in TF, and for 90% of ML problems RF is what you need.
But if you are doing Neural Network stuff then TF is exactly what you need.
I'll point out that this is TF Contrib Learn, not TF Learn[1], or one of many other places where things might be implemented. Makes things a bit confusing.
thanks for that comment. Indeed we are looking at general purpose ML (gbm and logit regression being our primary usecases). I was not looking at RDD but rather Graphframes.
I see that you worked with GraphX and abandoned it. This is disappointing - we were really looking forward to Spark Graphframes with HBase as the oltp data store for graph data.
In your situation, how did you overcome the problems in Spark ? Did you use an accompanying toolkit to augment spark or did you build your own (hopefully not!).
I think it's very hard to give general advice in this area. You are best off prototyping a deep spike into what you need, and seeing where things don't work.
Graph stuff is generally hard, so I don't think there is a magic bullet here.
I mean, even just the Spark-using-HBase bit is non-trivial to do in a way that provides adequate performance. There are 3(?) different connectors, with pluses and minuses for each one. Making sure data locality is working will depend on you YARzn or Mesos setup, and debugging that is a nightmare.
In our case, we prefilter data in Spark then load into NetworkX. Works ok, mostly.
our data sets have massively grown over the laat few months and now need a bigger solution. I think we will start off with a hosted solution like EMR - performance is not super critical right now (batch mode training)... but developer productivity is key.
Yes, certainly label propagation type algorithms are more suited to Spark than TensorFlow (although of course the fast matrix operations in TF could work well for this).
Spark is vastly different. Tensorflow focuses more on numerical computing and is a low level tool.
Spark is more focused on "counting at scale with a functional DSL". Hence its focus on things like ETL and columnar processing ala dataframes.
As far as spark doing "deep learning" what you should mean here is: "libraries in the ecosystem leverage spark as a data access layer for doing the real numerical compute"
Spark can count things with functional programming. It's not meant for heavy numerical operations. They are working on this where they can but you really can't beat a gpu or good ole simd instructions on hardware.
Spark doesn't do GPU acceleration, which is super important if you don't have a lot of spare CPU capacity. The one time I tried to train a DL model on CPU it was 48x slower, and with communication overhead it would have taken more than 48 cores to match the single GPH. And given you want to do hyperparameter search on top of that, those CPU cores start adding up.
Yes the Databricks stuff is available on their cluster. It isn't part of Spark the Open Source project.
Yes, as I said elsewhere there are plenty of projects to enable GPU usage via Spark. Have you actually tried them though? I have (eg https://github.com/IBMSparkGPU/GPUEnabler/issues/25 ) and there are... issues.
Spark is just a data access layer here. It's not even remotely gpu friendly. Most people also still relies on mesos or yarn for running distributed. The library you're using matters alot. Mesos just added gpu support:
http://mesos.apache.org/documentation/latest/gpu-support/
Yarn can sorta support it with node labeling for job completion but it's still kinda hacky.
When spark can (without "production ready" buzzwords) run gpus like this out of the box then we're talking.
For now spark needs a companion library to work with gpus though.
Intel is pushing their own chips for deep learning (i.e. Xeon and Xeon Phi). They claim that using MKL on these chips gets comparable performance to using Caffe/cuDNN on a high-end GPU.
Phi may yet win out, I saw a talk from an NLP researcher about Seq2Seq models and someone asked him what his wish for hardware was and it was for GPU cores to not be as stuck in lock step with each other control-wise. Not sure if he knew about the Phi's though.
https://deeplearning4j.org https://github.com/deeplearning4j/ScalNet https://deeplearning4j.org/model-import-keras https://gitter.im/deeplearning4j/deeplearning4j