A common type of example involves relatively small or uninformative datasets. Say you flip a coin a few times and only get heads. Your maximum likelihood (frequentist) estimate is that the coin will always land heads. In a Bayesian setting, if you have a (say uniform) prior on the probability that the coin lands heads, your maximum a posteriori estimate of this probability will be non-zero, but will get continue to get smaller if you continue only seeing heads.
The above example is contrived, but makes more sense in the case of language modelling. Since a bag-of-words vector, containing say counts of words seen in a document, is typically sparse (most documents only contain a limited portion of the full vocabulary), a frequentist estimate of word probability will say that certain words can never occur, just because it's never seen them. The Bayesian estimate will still assign some non-zero chance of seeing that word.
Practically speaking, this leads to the idea of "smoothing" in tf-idf (text-frequency-inverse-document-frequency) vectors, by adding 1 to document frequencies. You don't need Bayesian statistics to do this, but maybe you never would have thought of it otherwise!
>Say you flip a coin a few times and only get heads. Your maximum likelihood (frequentist) estimate is that the coin will always land heads. In a Bayesian setting, if you have a (say uniform) prior on the probability that the coin lands heads, your maximum a posteriori estimate of this probability will be non-zero, but will get continue to get smaller if you continue only seeing heads.
Not quite. If you have a uniform prior, there will be no difference between MAP and MLE.
>From the vantage point of Bayesian inference, MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.
Help me see how this requires a _thorough_ study of Bayesian methods.
What you describe seems to be (additive/Laplace) smoothing [1], a fairly basic concept. And I don't see how this is specifically Bayesian.
Then, one common critique of "Bayesian stats work well with small samples" is that with small data (weak signal), you get your prior back, so you haven't learned anything, and the only thing you're left with is your "bias", erm, prior.
I am still keen to learn why one needs a thorough study of Bayesian methods to help in practical machine learning tasks (meaning, on top of a bit of statistical or data science curriculum plus on the job experience).
The above example is contrived, but makes more sense in the case of language modelling. Since a bag-of-words vector, containing say counts of words seen in a document, is typically sparse (most documents only contain a limited portion of the full vocabulary), a frequentist estimate of word probability will say that certain words can never occur, just because it's never seen them. The Bayesian estimate will still assign some non-zero chance of seeing that word.
Practically speaking, this leads to the idea of "smoothing" in tf-idf (text-frequency-inverse-document-frequency) vectors, by adding 1 to document frequencies. You don't need Bayesian statistics to do this, but maybe you never would have thought of it otherwise!