Very interested article. I've often railed against putting your thumb on the sca...

Very interested article. I've often railed against putting your thumb on the scale (or even worse, second-guessing) machine learning models by applying too many so-called "business rules," especially post hoc rules. If the model doesn't learn on its own what you consider to be obvious structure of the data, then either you've chosen the completely wrong model and it won't be able to learn non-obvious truths either, OR your expectations were wrong and the "obvious" structure isn't real. Ineed, the model discovering, entirely on its own, the same structure as a human analyst is often the first evidence we see that the model works! In any case it does you no good to try and force it to fit your preconceptions with post hoc adjustments. Either fix your preconceptions (if they are mistaken) or switch to a model which naturally agrees with you.

Sutton takes an even more extreme point of view, suggesting that most human feature engineering is similarly a waste of time. It's hard to argue with if you know the history: some of the best computer vision algorithms use exactly two mathematical operations, convolution (which itself only requires addition and multiplication) and the max(a,b) function. (This is true because both ReLU and MaxPool can be implemented with max(), and because a fully connected layer is a special case of a convolution.) A similar story occurred in speech recognition, with human designed features like phonemes and MFCC are giving way to end-to-end learning. Indeed, even general purpose fully connected neural networks started to work much better once the biologically-motivated sigmoid() and tanh() were replaced with the much simpler ReLU function, which is is just ReLU(x) = max(x, 0). What really made the difference was leveraging GPUs, using more data, automating hyperparameter selection, and so on.

I'm not sure if there's really a lesson there, or if this trend will hold indefinitely, and I'm not sure why the lesson would be "bitter" even if it holds. Certainly opinions are mixed. One the one hand, many researchers such as Andrew Ng are big proponents of end-to-end learning; on the other hand, no one can currently conceive of training a self-driving car that way. But avoiding domain-specific, human-engineered features may be a viable guiding philosophy for making big, across-the-board advances in machine learning.