I had a really good time adapting Karpatny's blog post to python myself but it didn't give me sufficient understanding so i continued with [1], then [2] and finally deciphering [3].
Thanks partycoder! Points taken; will make some changes.
Regarding numerical gradients, named it so to differentiate it from analytical gradients, which leverage formulas from calculus. The "numerical" ones are calculated using (f(x+h)-f(x))/h every time.
Sadly, this CamelCase style is too entrenched in the academia and is very often present in books that use Python but written by academics, not professional Python developers.
Yeah! Another way of seeing it is that the derivative is a small (infinitesimal) perturbation around a region of interest:
Any input that isn't maximal will be some finite distance away from the maximum, so any small enough perturbation won't change it (thus it has zero derivative). If we change the entry which is maximal, though, then the maximum changes proportionally to it (with proportionality constant 1), so we're done and the derivative is one for the maximal entry [0] and zero for any other ones.
---
[0] If there is more than one maximal entry, then any convex combination for the entries that are maximal is a valid "derivative-like" operator (i.e. subgradient).
[1] https://mattmazur.com/2015/03/17/a-step-by-step-backpropagat... [2] http://peterroelants.github.io/posts/neural_network_implemen... [3] https://iamtrask.github.io/2015/07/12/basic-python-network/