It works because well, its actually pretty primitive at its core. Whole learning process is actually pretty brutal. Doing millions of interations w/ random (and semi-random) adjustments.
Honestly once you understand maximum-likelihood estimation, empirical risk minimization, automatic differentiation, and stochastic gradient descent, it's not that much of a surprise it works.