> By Bayesian methods I mean methods which also describe everything with statist...

> By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.

Ah I see yea, if you’re talking about hyperpriors and marginalization and MCMC I totally agree; these are really valuable in science, and this is the “full Bayesian solution” but with many caveats, one big one being that it’s very easy to be overly confident in the results of some unwieldy hierarchical model and ignore the fact that the priors (or hyperpriors) are often times fudge factors because it’s so damn hard to articulate your prior belief as a bona fide distribution. A lot of times you run into a “garbage in garbage out” problem and it might not be obvious right away since we are probably busy patting ourselves on the back and popping champagne bottles because we’ve gotten our MCMC chains to converge.

> For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.

Yea right — nothing wrong with cross validation in a lot of cases. Since yea we could go with the “full Bayesian solution” and marginalize over some distribution of our hyperprior, but that very well might not give you anything more than cross validation (or maybe will give you something worse if you have a bad hyperprior).

> Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere.

Agreed, my only argument is that it’s helpful to know where all of these terms come from in a general sense rather than treat each one as some ad hoc solution.

> For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.

Agree and that’s exactly the sort of valuable insight that’s helpful to have (though maybe this one isn’t really an insight from Bayesian stats).

> As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.

Sure, if you know MAP and what it means, then you know what a posterior means and you know that every model you fit and loss function you use comes from that, and you know if you care about more than a mode or mean of the posterior you can use lots of tools to deal with that. I’ve never really encountered MCMC in the work I do and i don’t think it would really bring any additional insight into what we’re doing, but sometimes we care about uncertainty and Laplace approximation or variational inference can do the job just fine.

> By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.

While I would agree MCMC and it’s variants are certainly not a good idea a lot of the time because they require a lot of time and computation, very rarely they are (you want insights from your data to use internally or something and you really want to explore implications of many different assumptions), and HMC etcetera can deal with very high dimensional settings (> millions of parameters). But generally I agree, if you’re talking about “scalable solutions” it usually means MCMC is not really what you want to be doing.