SystemML: Machine learning made easier (open source)

probdist · on Nov 5, 2015

I'm slightly concerned about the Hadoop/Spark + machine learning ecosystem right now.

There seem to be a lot of technologies and projects being built up on the core foundation of in memory distributed computation. MLlib in Spark, H2O, Apache Mahout/Samsara, probably many others.

My impression is Samsara and SystemML's DML are being designed to supply the primitives you need to build a machine learning model with less awareness of the underlying system model. So this is ostensibly to save the developer of a new algorithm the headache of thinking how their algorithm needs to relate to the distributed computation model like one should consider for developing with H2O or Spark. However, It seems like many common off the shelf algos are already available and implemented indicating that perhaps working with Spark or H2O backends directly is not that hard to produce performant ML codes.

As a computational scientist learning a DSL to accelerate development of an algorithm of mine into production via Spark seems like a potential distraction from just getting better with Spark.

urlwolf · on Nov 5, 2015

The current Big data ecosystem is a difficult place to be for companies that bet the house on hadoop (cloudera, Hortonworks, MapR) for sure.

There are new technologies coming at a breakneck pace. Check out apache Flink if you need streams and the microbatches in Spark don't do it for you. The ML lib there is not that far ahead yet though.

rectang · on Nov 5, 2015

This project just entered the Apache Incubator 2 days ago. http://wiki.apache.org/incubator/SystemML

holdenk · on Nov 5, 2015

If anyone is interested in working on this (or other exciting Spark related things) @ IBM please reach out to me ( my HN username with the letter k removed at us.ibm.com )

michaelsbradley · on Nov 5, 2015

Has there been any work toward integration with IPython/Jupyter?

nl · on Nov 5, 2015

Getting IPython/Jupyter connected to Spark isn't hard.

I did it following a combination of these two guide:

http://ramhiser.com/2015/02/01/configuring-ipython-notebook-...

http://thepowerofdata.io/configuring-jupyteripython-notebook...

Edit, if this is interesting then... https://news.ycombinator.com/item?id=10496385

acangiano · on Nov 5, 2015

My team at IBM is developing a tool that integrates several key data science open source technologies all in one. It's called the Data Scientist Workbench (http://datascientistworkbench.com). It doesn't currently support SystemML yet, but it already integrates iPython/R/Scala with Spark, for example. The idea is to have all the tools you'll need for data analysis and visualization, in the cloud, directly accessible anytime from your browser. We plan to include SystemML in the future.

draven · on Nov 5, 2015

The readme in the github repo has more information: https://github.com/SparkTC/systemml

niels_olson · on Nov 5, 2015

> written in Java

gotta go, see ya!