I want to use smile but their license is prohibiting the use of it at my company. I understand the project authors' intentions to promote contributions to the project, but the requirements to get commercial license is a deal breaker for many.
I suspect the (L)GPL is a big part of why Java has more-or-less completely ceded the data space to Python. Corporate policy makes it painful for me to take a dependency on GPL (including LGPL), too, and that makes building and maintaining data applications in Java an absolute minefield, because so many projects are copyleft. Even the packages that are Apache, MIT or BSD licensed often require you to plug in netlib-java (LGPL) if you want decent performance.
It's encouraging to see a project like Tribuo come along, but it also feels like too little, too late. I'm already well underway on jumping ship and migrating to Python, and have yet to encounter any particular reason why I should look back.
This looks excellent, will take a bit of time to go through and understand. The announcement blog is really helpful actually and explains the problem you're solving well (and one I am intimately familiar with so I see value here).
We have a strong focus on provenance, Tribuo models capture their input and output domains, along with the necessary configuration to rebuild a model. Tribuo's also more object oriented, nothing returns a bare float or int, you always get a strongly typed prediction object back which you can use without looking things up. Tribuo is also happy to integrate with other ML libraries on the JVM like TensorFlow and XGBoost, providing the same provenance/tracking benefits as standard Tribuo models, and we contribute fixes back to those projects to help support the ecosystem. Plus we can load models trained in Python via ONNX.
To your direct question, I've not benchmarked Smile against Tribuo. We are very interested in the upcoming Java Vector API - https://openjdk.java.net/jeps/338 - targeted at Java 16, which will let us accelerate computations which C2 or Graal don't autovectorise.
- There are various efforts on the JVM to build multidimensional arrays, we're talking to many of them to try and figure out a strategy for the whole platform. Ditto for dataframes, though Apache Arrow looks like a good baseline.
- We're not looking at other languages outside of the JVM at the moment, but we're continuing to contribute to Tensorflow Java and ONNX Runtime to improve their Java support. We could look at pytorch inference support based on their Java API, but that overlaps pretty well with the things that ONNX Runtime supports. Do you have any suggestions?
> We're not looking at other languages outside of the JVM at the moment, but we're continuing to contribute to Tensorflow Java and ONNX Runtime to improve their Java support. We could look at pytorch inference support based on their Java API, but that overlaps pretty well with the things that ONNX Runtime supports. Do you have any suggestions?
Many models are deployed as restful endpoints, so a quick and easy ways to deploy models as services with containers or serverless providers will be very useful - although admittedly, you might not want that in the core project, could be a good sidecar project to this. Given your focus on model provenance, extending that beyond to model deployment and life cycle management tools such as MLFlow could also be very useful
We're thinking about the visualisation story. At the moment we use notebooks with the IJava kernel to explore things (our tutorials are jupyter notebooks), but we're still working on the plotting angle.