I'd say the benefits come from the information content of the features or lack t...

I'd say the benefits come from the information content of the features or lack thereof. When you have uninformative features like pixel colors or word identities, there's nothing for traditional methods to work with. You have to start with feature engineering and pruning before decision trees or linear classifiers have a chance.

Most of the wins under the "deep learning" umbrella come from extracting meaning from homogenous features like "the pixel at x-2,y+1 has red=123" or "the word at n+1 is 'king'". That's why we see latent variable embeddings like word2vec come from the DL world even though they're not deep.

When you want to include highly informative features in a deep network, it's often better to feed them into a separate logistic model, as shown in the Tensorflow wide-deep tutorials.