A common type of example involves relatively small or uninformative datasets. Sa...

astrofinch · on May 1, 2020

>Say you flip a coin a few times and only get heads. Your maximum likelihood (frequentist) estimate is that the coin will always land heads. In a Bayesian setting, if you have a (say uniform) prior on the probability that the coin lands heads, your maximum a posteriori estimate of this probability will be non-zero, but will get continue to get smaller if you continue only seeing heads.

Not quite. If you have a uniform prior, there will be no difference between MAP and MLE.

>From the vantage point of Bayesian inference, MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.

https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

More discussion here:

https://stats.stackexchange.com/questions/64259/how-does-a-u...

antipaul · on May 2, 2020

Help me see how this requires a _thorough_ study of Bayesian methods.

What you describe seems to be (additive/Laplace) smoothing [1], a fairly basic concept. And I don't see how this is specifically Bayesian.

Then, one common critique of "Bayesian stats work well with small samples" is that with small data (weak signal), you get your prior back, so you haven't learned anything, and the only thing you're left with is your "bias", erm, prior.

I am still keen to learn why one needs a thorough study of Bayesian methods to help in practical machine learning tasks (meaning, on top of a bit of statistical or data science curriculum plus on the job experience).

[1] https://en.wikipedia.org/wiki/Additive_smoothing