It seems reasonable that when you decompose the representation of the target fun...

It seems reasonable that when you decompose the representation of the target function you are trying to learn into more layers, each of the layers can be "more gently" modulated to emulate the target function as observed through the training data. Differential techniques might therefore more readily find good approximations to the target function. Once the structure has been captured by finely slicing it into more layers, this structure can be compressed down into a shallower representation.