Max pooling tests if a feature occurs anywhere in a certain area, rather than being sensitive to the exact location.
ReLUs fit combinations of piece-wise linear functions. Whereas sigmoids are more nonlinear and can be harder to optimize. They were originally continuous approximations of binary threshold functions.
All these things can approximate each other. Neurons can approximate the max function, and ReLUs can approximate sigmoids. So there really isn't much to fret over.
It's like asking for a theory of which programming language is better. In practice they will have different advantages in different domains, but they are all Turing complete.
It's like asking for a theory of which programming language is better.
There's nothing wrong with just ignoring programming-language theory and just deciding on one, seat of the pants style. But this is because programming as it exists now is a static "art form" with only marginal progress expected.
However, assuming deep learning currently works unexplainably well and one aims to scientifically explain that good working, one would want an explanation which guides one's approach to extending the process.
I've done a bit of applied math, where knowing which kind of function to pull out of one's toolbox for which situation was the really-smart-people's purview, a fairly well guarded folk-knowledge, actually. I'm used to the "little bit of this, little bit of that" kind of explanation for which functions to use when and why. If one weighs them long enough, I assume one can intuitively figure out what to do.
But if we're aiming to advance fundamentally beyond the state-of-the-art, we would aim to quantify these advantages and disadvantages, to automate one more layer. So here we really should know and have a "real" theory here.
Do you think no one is trying? Should researchers just ignore all results until the underlying theory is found? What if we don't find it for another 50 years? I find it incredibly hard to be critical in this situation.
> Max pooling tests if a feature occurs anywhere in a certain area, rather than being sensitive to the exact location.
From Geoff Hinton's AMA on Reddit: The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
Hinton doesn't like that pooling loses track of the exact locations where features are located, and just tests if a feature occurs in some area.
The basic effect of this is to decrease the resolution, so it's more tractable to operate on. Without pooling you are stuck with a huge resolution at each layer.
ReLUs fit combinations of piece-wise linear functions. Whereas sigmoids are more nonlinear and can be harder to optimize. They were originally continuous approximations of binary threshold functions.
All these things can approximate each other. Neurons can approximate the max function, and ReLUs can approximate sigmoids. So there really isn't much to fret over.
It's like asking for a theory of which programming language is better. In practice they will have different advantages in different domains, but they are all Turing complete.