The scikit-learn cargo cults

gleenn · on April 23, 2021

The author's beef seems to be "people use similar terminology across similar libraries/frameworks/platforms but they don't behave identically and represent subtly different things". Maybe I don't do enough data scienceing, but isn't this super common? Like, if I write a parser... I'd probably call the main function "parse", or if I'm writing a database connector, I'd probably call the function "connect" to do the connecting. I personally wouldn't expect those to work identically or mean the same exact abstraction. I personally love when things are named similarly so I can grok the meaning in a new codebase more quickly, even if things don't transfer identically.

Jugurtha · on April 25, 2021

>I personally love when things are named similarly so I can grok the meaning in a new codebase more quickly, even if things don't transfer identically.

I think it is a sign of good design when things are named similarly, as it makes it easy to use. I'm not surprised it acts differently than something else, because that surprise implies an expectation that shouldn't be there, even more so when there is absolutely no standard both need to comply with, which is the case in the current ML/DS context.

There are surprises when people have different implementations of a spec or whitepaper and people accept even those. Not saying it's a good thing. But having that expectation for standardless things just because the entities names are similar is too much to ask for. It would be great if they did for interchangeability, but it's okay if they don't. More work (bridges, adapters, and what not), but different people took a shot at something.

The field is not mature enough but going in that direction with attempts at formats, interfaces, protocols either for data (protobuf, dataframes, events) or models (PMML, PFA, ONNX, etc).

rubatuga · on April 24, 2021

The author doesn’t really know what cargo cult is. It means doing things similar to other groups and expecting an unrealistically positive result. Not only do you have to prove that other ML libraries were imitating sklearn, but that copying it wasn’t useful. Like another commenter said, naming the functions: “fit” and “predict” are simply common names to easily convey meaning. It certainly has the positive effect of letting me know what the functions do. If that’s cargo culting, then so is any program that has a “main” or “init” function with different arguments. Also, to refute their last point, PyTorch is too low level to have a fit function, not because they aren’t trying to cargo cult.

huac · on April 23, 2021

SKL being first does not afford it a monopoly on ML object design. Nor should other libraries necessarily seek to emulate what came first (or support pickling...)

gyrovagueGeist · on April 25, 2021

Huh, didn’t think I’d see the writer of The Northern Caves on the top of HN.

Back to this post: I’ve written some nearest neighbor code and definitely felt some pressure to make the API sklearn compatible. But I don’t think it’s as bad as the post claims in practice.

Highly recommend checking out the posters other work. Its a lot of fun,

nightpool · on April 25, 2021

To be clear, nostalgebraist has been doing ML professionally for years, so he's working with these APIs all day, every day. If he has a complaint about them, I expect it to be well-grounded in months-to-years of full-time experience, not just idle speculation that doesn't pan out in practice

jwilber · on April 24, 2021

“ Sagemaker “Estimators” do not have anything to do with fitting or predicting anything. The SDK is not supplying you with any machine learning code here.”

The author is confusing the sagemaker service with the mxnet deep learning library (which sagemaker provides access to). Basically everything they wrote in that section is flat out incorrect.

nostalgebraist · on April 25, 2021

You are confusing Sagemaker (the AWS product) with the Sagemaker python SDK.

The Sagemaker python SDK provides a usability wrapper around API calls (etc) that trigger Sagemaker (the product) to do things. In the usage example I screenshotted, these include running a user-provided script "train.py" in a Docker container with mxnet installed.

All the ML happens inside the Docker container, and this process does not any use code from the SDK. The SDK contains non-ML code intended to help the user kick off such jobs. The SDK's Estimator class is purely a helper for using Docker and the Sagemaker API, and is only useful when all the ML code has already been written.

maldeh · on April 24, 2021

Yeah, given that the article started off establishing how an Estimator was basically an interface with simple rules about supporting "fit" and "predict" and how it could contain anything or do anything, I thought the argument laid out here would be about how these derivative implementations broke these rules.

The rest of the article instead seems to have lost the plot though, somehow finding fault with various derivative or concrete implementations of this interface, for A) being inextensible implementations and not transitive interfaces themselves, as though "be anything do anything" no longer applied; or B) not being perfectly aligned with sklearn estimator details that the author didn't really identify as essential, like not following some sklearn-specific parameter naming rule or not being serializable via pickle (like seriously, pickle support is often not appropriate for production, why should this be a required pattern! It's not even a requirement of the interface unless you read between the lines like the author implies is essential to be at parity.) As other commenters outlined here, it assumes that sklearn's contract is absolute, as though other libraries couldn't reinterpret the core principles.

The arguments against Tensorflow or Sagemaker's interfaces especially stretch quite a bit - what exactly is so offensive about these implementations given the very rules that the author establishes in this article? All "fit" is supposed to do is update internal state as the author asserts, but what precludes implementations of this interface from using cloud-based compute resources to achieve this end? And what about the fact that a docker container is deployed to the cloud by this command makes "fit" a lie? And honestly, what does the author have in mind for an estimator implementation that uses cloud resources like GCP TPUs or AWS EC2 that is also somehow more correct or pure than these implementations?

More than anything, the author's dismissal of the value that GCP and AWS's implementations bring in eliminating infrastructure management via their Estimator implementations (equating it to "simply" writing Dockerfiles or running Docker containers on the cloud like there's no setup involved) implies that they're thoroughly disconnected from the realities of ML devops on the cloud. They're free to run their purist single-core sklearn estimators on their laptops as much as they'd like though (unless Dask somehow gets a pass from these arbitrary rules around how estimators can and cannot be used).

nostalgebraist · on April 25, 2021

> implies that they're thoroughly disconnected from the realities of ML devops on the cloud

FWIW, I deal with "the realities of ML devops on the cloud" nearly every day -- both at work and in hobby projects.

The comments here make me think I failed to get my main intent across in the post. I actually agree with many of the concrete claims you make here, but they have little to do with the arguments I saw myself as making.

The miscommunication was apparently so complete that if I tried to dig into specific points you make here, I'd end up effectively re-writing my post all over again. It was kind of exhausting to write the first time, so I'd prefer not to.

That said, as an example of something I didn't mean to say: I definitely don't think the sklearn API ought to be standard across ML, certainly not the pickling part! It is a well-designed API that's just right for its own limited context, and ought to inspire others to develop other APIs that are similarly well-designed for their contexts.

I only added the comment about Keras and pickle because I had quoted a tweet that literally said Keras was sklearn API compatible, and felt sort of obligated to point out that this was strictly false. Insofar as this relates to the larger point at all, it does so only as evidence that people like Chollet don't have a deep understanding of the thing they say they're inspired by.