tl;dr 1. The typical deep neural network tutorial introduces deep networks as co...

tl;dr

1. The typical deep neural network tutorial introduces deep networks as compositions of nonlinearities and affine transforms.

2. In fact, a deep network with relu activation simplifies to a linear combination of affine transformations with compact support. But, why would affine transformations be useful?

3. After recent discussions on Twitter it occurred to me that the reason why they work is that they are actually first-order Taylor approximations of a suitable analytic function.

4. What is really cool about this is that by this logic partial derivatives, i.e. Jacobians, are computational primitives for both inference and learning.

5. I think this also provides insight into how deep networks approximate functions. They approximate the intrinsic geometry of a relation using piece-wise linear functions.

This works because a suitable polynomial approximation exists and all polynomials are locally Lipschitz.