If you're looking for a bit of an unconventional entry point, I recommend the seminal text 'Elements of Information Theory' by T. Cover (skipping chapters like Network Information/Gaussian channel should be fine), paired with David MacKay's 'Information Theory, Inference and Learning Algorithms'. Both seem available online:
They cover some fundamentals of what optimal inference looks like, why current methods work, etc (in a very abstract way by understanding Kolmogorov complexity and its theorems and in a more concrete way in MacKay's text). Another good theoretical partner could be the 'Learning from data' course, yet a little more applied: (also available for free)
http://www.cs-114.org/wp-content/uploads/2015/01/Elements_of...
http://www.inference.org.uk/itprnn/book.pdf
They cover some fundamentals of what optimal inference looks like, why current methods work, etc (in a very abstract way by understanding Kolmogorov complexity and its theorems and in a more concrete way in MacKay's text). Another good theoretical partner could be the 'Learning from data' course, yet a little more applied: (also available for free)
https://work.caltech.edu/telecourse.html
Excellent lecturer/material (to give a glimpse, take lecture 6: 'Theory of Generalization -- how an infinite model can learn from a finite sample').
Afterward I would move to modern developments (deep learning, or whatever interests you), but you'll be well equipped.