A foundation for scikit-learn at Inria

hetspookjee · on Sept 19, 2018

Scikit-learn is a very nicely written library and I can use plenty of superlatives to describe the wonderous API of scikit-learn.

One thing I can't recommend enough is to extend their Transfomers base class in such a way that you implement their fit and transform methods. A simple example can be viewed here: https://gitlab.com/timelord/sklearn_transformers

which allows you to put your transformers into the scikit-learn Pipelines and GridSearchCV (and more). The way scikit-learn leverages multiple cores is by using joblib and Dask extends this implementation to effortlessly scale the scikit-learn pipelines onto a cluster of servers. https://distributed.readthedocs.io/en/latest/joblib.html

By writing your own data transformations in the transformer format you can, by extension, leverage this g great ecosystem.

I think it's a great time to be a data scientist / engineer now.

zeec123 · on Sept 19, 2018

Unfortunately scikit-learn is a mess without an alternative.

There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background. Compare this to e.g. the API of google/guava.

For example https://www.reddit.com/r/statistics/comments/8de54s/is_r_bet...

   Case in point, sklearn doesn't have a bootstrap crossvalidator despite the bootstrap being one of the most
   important statistical tools of the last two decades. In fact, they used to, but it was removed. 
   Weird right?
   ...
   > We don't remove the sklearn.cross_validation.Bootstrap class because few people are using it, 
   > but because too many people are using something that is non-standard (I made it up) and very very 
   > likely not what they expect if they just read its name. 
   > At best it is causing confusion when our users read the docstring and/or its source code. 
   > At worse it causes silent modeling errors in our users code base.
   ...
   Oh man, I thought of another great example. I bet you had no idea that 
   sklearn.linear_model.LogisticRegression is L2 penalized by default. 
   "But if that's the case, why didn't they make this explicit by calling it RidgeClassifier instead?" 
   Maybe because sklearn has a Ridge object already, but it exclusively performs regression? 
   Who knows (also... why L2 instead of L1? Yeesh). Anyway, if you want to just do unpenalized 
   logistic regression, you have to set the C argument to an arbitrarily high value, 
   which can cause problems. Is this discussed in the documentation? 
   Nope, not at all. Just on stackoverflow and github. 
   Is this opaque and unnecessarily convoluted for such a basic and crucial technique? Yup.

Or the following: https://www.reddit.com/r/haskell/comments/7brsuu/machine_lea...

bob_bob_bob · on Sept 19, 2018

It's true that scikit-learn was started and originally written mostly by PhD students (most were in fact CS PhDs), and the API they designed is amazing! A lot of the python ML ecosystem has adopted it and uses it - fit, predict, transform. I don't think any language has something comparable.

4 years ago they removed a misleading class - and even at the time the documentation was clear about what it was doing. I'm not sure how this reveals some huge flaw about scikit-learn. At best it shows that the contributors can realize their mistakes and solve them, without even needing people to point it out? That's great!

Also pointing to a bad implementation 4 years ago, for a project which has since then had way more funding for engineering time, and who's use has exploded, seems a bit misleading.

zeec123 · on Sept 20, 2018

See the second link I posted. Even the most basic 3 functionalities are bad designed. If X is your input space and Y your output space then fit should (after each call) return a function X->Y and not modify some internal state.

Have you ever tried looked at the pipeline cross validation, where you have to pass a dict of parameters to the function with underscore prefixes for each stage in the pipeline? Do this and you never call the API design amazing again.

There are examples for other bad design choices as well.

You are right, there is no alternative at the moment. Maybe julia lang will do better job, we will see.

neuromantik8086 · on Sept 19, 2018

> I can understand this, since most of it was probably written by PhD students; ... many of them without a CS background.

Of the top 4 contributors to scikit-learn, 3 have computer science graduate degrees and 1 has a degree in physics, so I'm not sure that the lack of a "CS background" is the root cause of the majority of the problems with the codebase (perceived or actual).

It may more be related to the nature of academic code in general, since most of it is supposed to be more proof-of-concept rather than general use worthy (e.g., why Google's original code was refactored by Jeff Dean).

notafraudster · on Sept 19, 2018

TBF, the major alternative to scikit-learn, doing it in R, would have you doing model <- glm(formula, family=binomial(link=logit)); predict(model, type="response"). R also doesn't make bootstrapping intuitive (there are a number of fairly easy to use packages, but nothing that's as canned as you'd hope it would be).

Granted I still use R for ML stuff, and I concede that silently regularizing regressions, and only being able to avoid doing so by hacking the penalty parameter, is terrifying.

rjdagost · on Sept 19, 2018

scikit-learn does have its share of inconsistencies and strange omissions, but I have found it to be much, much easier to use than the alternatives.

zeec123 · on Sept 19, 2018

Ease of use might be the criteria when you are a student. However as soon as you start to depend on it for a living you realise that scikit-learn made enough serious mistakes such to have lost my trust in it and I am forced to pay the ~$15.000 for matlab until some alternative is available.

rjdagost · on Sept 19, 2018

Often I will do some prototyping with scikit-learn and then write my own implementation in numpy / scipy for something that goes into production. But I have used scikit-learn in production as well without issue. I have used MATLAB a bit and it is quite nice for figuring things out / prototyping. But the issue I have with it is that it's not typically intended for production software. So then you often need to reimplement your MATLAB code in whatever "platform" language you're using. That's why I've largely transitioned away from MATLAB.

screye · on Sept 20, 2018

Matlab does have code generation though. Dunno if the library you use supports it, but most code can be exported to C++ or Cuda code.

hetspookjee · on Sept 19, 2018

I was not aware of the links you shared pointing out the inconsistencies. I wonder how the authors themselves respond about these reddit posts (if at all?). Thank you for sharing!

Despite that, it does have some implementations that made it stick out for me across all other languages, such as the fit / transform / predict API spread across the library, and the useage of joblib as back-end for speedup - this allows their models to be easily scaleable on clusters with the use of Dask.

I still have confidence that their most used functions (e.g. RandomForest) and models are still correctly implemented and provide great value in that regard.

hiker512 · on Sept 20, 2018

The fact that LogisticRegression uses L2 is stated _very_ clearly in the documentation. Maybe someone who is not on mobile wants to check since when?

I'm not sure if the backends in use actually allow for a non-regularized use. I would assume so, but does someone know?

elcomet · on Sept 19, 2018

What's wrong with predict_proba? The names seems very clear and self-explanatory to me.

xvilka · on Sept 19, 2018

I thought INRIA uses OCaml everywhere and would choose Owl[1] (OCaml library for numeric scientific computing and machine learning) as a project for this kind of foundation.

[1] https://github.com/owlbarn/owl

pyrale · on Sept 19, 2018

Inria is a public institution dedicated to research. There are many labs and people with separate goals. They are no more dedicated to ocaml than MIT is dedicated to emacs.

globberz · on Sept 19, 2018

An even better comparison would be with, say, the NSF. I am sure that this or that technology has been developed by NSF-funded researchers, but it would be absurd to assume that NSF-funded researchers in MIT use and promote the same things as NSF-funded researchers in Caltech because they're both affiliated with the NSF.

black_puppydog · on Sept 19, 2018

I personally only know one team in our building here that uses Coq (written in OCaml) and all the rest use (depending on their field) C/C++ (robotics/ROS), Matlab, python (tensorflow/pytorch/sklearn,...)

Then again, it's not like I know the whole building, let alone the other parts of Inria ...

p4bl0 · on Sept 19, 2018

Inria is a public research institute. Researchers working there are responsible for so much more than OCaml, just citing programming languages on top of my head there are also Bigloo, Hop, Pharo, and Coq. There are probably others and there are of course many more projects that are not programming languages.

xvilka · on Sept 19, 2018

Well, Coq is written in OCaml after all, so my point applies here.

archgoon · on Sept 19, 2018

For those curious, here are the languages used in the projects

bigloo - written in C and scheme

hop - js and scheme

pharo - small talk and C

Coq - written in ocaml

It appears that Inria does not exclusively use ocaml for their projects.

11235813213455 · on Sept 19, 2018

I'd love this same kind of library in nodejs