Tribuo, a Machine Learning Library for Java

craigacp · on Sept 15, 2020

The announcement blog talks a little about why we built this library, and where we think it sits in the ML ecosystem - https://blogs.oracle.com/datascience/tribuo-java-machine-lea...

stevehiehn · on Sept 15, 2020

Hmm, the first thing that comes to mind is DL4J https://deeplearning4j.org/ I guess DL4J is NNet focused and maybe Tribuo is a toolkit, not sure.

bratao · on Sept 15, 2020

Looks very interesting! Another Java Library for ML is Smile(https://github.com/haifengl/smile). I deeply recommend it!

hakjink · on Sept 15, 2020

I want to use smile but their license is prohibiting the use of it at my company. I understand the project authors' intentions to promote contributions to the project, but the requirements to get commercial license is a deal breaker for many.

mumblemumble · on Sept 15, 2020

I suspect the (L)GPL is a big part of why Java has more-or-less completely ceded the data space to Python. Corporate policy makes it painful for me to take a dependency on GPL (including LGPL), too, and that makes building and maintaining data applications in Java an absolute minefield, because so many projects are copyleft. Even the packages that are Apache, MIT or BSD licensed often require you to plug in netlib-java (LGPL) if you want decent performance.

It's encouraging to see a project like Tribuo come along, but it also feels like too little, too late. I'm already well underway on jumping ship and migrating to Python, and have yet to encounter any particular reason why I should look back.

johnc1231 · on Sept 16, 2020

I don't think netlib-java is LGPL. Where does it say that it is?

mumblemumble · on Sept 16, 2020

Sorry, I was misremembering the problem. It's not netlib-java that is LGPL. It's some of the native math libraries that plug into netlib-java.

shock · on Sept 16, 2020

> [...] but the requirements to get commercial license is a deal breaker for many.

I don't understand why this has become the norm. Would your company not use smile to, presumably, make money?

jarym · on Sept 15, 2020

This looks excellent, will take a bit of time to go through and understand. The announcement blog is really helpful actually and explains the problem you're solving well (and one I am intimately familiar with so I see value here).

Congratulations!

latenightcoding · on Sept 15, 2020

It would be cool if they add isolation forests to their anomaly detection algorithms. I'm yet to meet someone who uses one-class SVMs in production

craigacp · on Sept 15, 2020

We're implementing the extra trees algorithm at the moment, an isolation forest is only a small amount of code from there.

londogard · on Sept 15, 2020

Hi,

What would you say differentiates you from Smile which includes a simplistic datagrame, visualisation and support for CBLAS etc.

Is speed on par?

craigacp · on Sept 15, 2020

We have a strong focus on provenance, Tribuo models capture their input and output domains, along with the necessary configuration to rebuild a model. Tribuo's also more object oriented, nothing returns a bare float or int, you always get a strongly typed prediction object back which you can use without looking things up. Tribuo is also happy to integrate with other ML libraries on the JVM like TensorFlow and XGBoost, providing the same provenance/tracking benefits as standard Tribuo models, and we contribute fixes back to those projects to help support the ecosystem. Plus we can load models trained in Python via ONNX.

To your direct question, I've not benchmarked Smile against Tribuo. We are very interested in the upcoming Java Vector API - https://openjdk.java.net/jeps/338 - targeted at Java 16, which will let us accelerate computations which C2 or Graal don't autovectorise.

suyash · on Sept 15, 2020

Happy to answer any questions you may have about Tribuo.

nikhilgk · on Sept 15, 2020

This looks really interesting! Kudos and thanks for the good work. Some questions:

- What does the future road map look like?

- Are you planning on adding more algorithms ?

- Any plans to bring in dataset and dataframe handling capabilities such as in numpy/pandas etc?

- What other interop features with other languages/platforms are planned?

- Any plans for AutoML features?

craigacp · on Sept 15, 2020

- Short term roadmap is here - https://github.com/oracle/tribuo/blob/main/docs/Roadmap.md, longer term we'd like to see what the community wants.

- Yep.

- There are various efforts on the JVM to build multidimensional arrays, we're talking to many of them to try and figure out a strategy for the whole platform. Ditto for dataframes, though Apache Arrow looks like a good baseline.

- We're not looking at other languages outside of the JVM at the moment, but we're continuing to contribute to Tensorflow Java and ONNX Runtime to improve their Java support. We could look at pytorch inference support based on their Java API, but that overlaps pretty well with the things that ONNX Runtime supports. Do you have any suggestions?

- Not beyond hyperparameter tuning.

nikhilgk · on Sept 15, 2020

> We're not looking at other languages outside of the JVM at the moment, but we're continuing to contribute to Tensorflow Java and ONNX Runtime to improve their Java support. We could look at pytorch inference support based on their Java API, but that overlaps pretty well with the things that ONNX Runtime supports. Do you have any suggestions?

Many models are deployed as restful endpoints, so a quick and easy ways to deploy models as services with containers or serverless providers will be very useful - although admittedly, you might not want that in the core project, could be a good sidecar project to this. Given your focus on model provenance, extending that beyond to model deployment and life cycle management tools such as MLFlow could also be very useful

suyash · on Sept 15, 2020

Yes, I'll be building some demo's showing just that using cloud services.

SiempreViernes · on Sept 15, 2020

What do you use to visualise results and explore features?

craigacp · on Sept 15, 2020

We're thinking about the visualisation story. At the moment we use notebooks with the IJava kernel to explore things (our tutorials are jupyter notebooks), but we're still working on the plotting angle.

RocketSyntax · on Sept 15, 2020

What is the use case for non-Scala? Apps written in Java?