I still think the other comment was mistaken. At a given moment just before draw...

p4wnc6 · on Jan 5, 2016

In short, when the posterior is arrived at as the average of a bunch of other distributions, as in Bayesian Model Averaging, it is absolutely fine to say that the posterior is itself an average (it is the average of a bunch of other distributions). So a posterior can be both a distribution and also an average.

This is why I don't agree with your claim, "The posterior is not remotely like idea averaging." It's exactly like averaging when you're talking about model averaging and I think model averaging is a very useful and good analogue to idea averaging in cases where there is not genuine mutual exclusivity between ideas.

yummyfajitas · on Jan 5, 2016

You are misunderstanding both BMA and Gaussian mixtures. The posterior on the world state is a distribution over Models x Model Params.

To make predictions about the world, you compute an nexpected value (average) over Models. The explicit assumption of both GM and BMA is that only one gaussian or only one model is correct - you just don't know which one, and therefore need to average over your uncertainty to take into account all possibilities.

Idea averaging (as described in the article) is about averaging over world states not models.

p4wnc6 · on Jan 5, 2016

> The explicit assumption of both GM and BMA is that only one gaussian or only one model is correct - you just don't know which one, and therefore need to average over your uncertainty to take into account all possibilities.

This is so wrong I don't even know where to start. In fact, this is basically the notion of frequentism! One of the very most fundamental ideas of Bayesian reasoning is that there is no one true set of parameters nor is there one true model. There is only your state of knowledge about the space of all possible parameter sets or all possible models. I'm very surprised to see you, of all people, claiming this. Even a cursory Google search of BMA fundamentals disconfirms what you are saying, e.g. [1]

[1] http://www.stat.colostate.edu/~jah/papers/statsci.pdf

> Madigan and Raftery (1994) note that averaging over all the models in this fashion provides better average predictive ability, as measured by a logarithmic scoring rule, than using any single model Mj, conditional on M. Considerable empirical evidence now exists to support this theoretical claim...

and so on.

yummyfajitas · on Jan 6, 2016

This is so wrong I don't even know where to start. In fact, this is basically the notion of frequentism!

Frequentism doesn't even allow you to represent your belief with a probability distribution.

One of the very most fundamental ideas of Bayesian reasoning is that there is no one true set of parameters nor is there one true model. There is only your state of knowledge about the space of all possible parameter sets or all possible models.

Bayesian reasoning says there is one true set of parameters/model, you just don't know which one it is. The posterior distribution allows you to represent relative degrees of belief and figure out which model/parameter is more likely to be true.

Assuming you gather enough data, your posterior distribution will eventually approximate a delta function centered around that one true model. This is also what happens when you do BMA or gaussian mixtures.

The fact that model averaging provides better predictive ability doesn't contradict what I said.

GFK_of_xmaspast · on Jan 6, 2016

> Assuming you gather enough data, your posterior distribution will eventually approximate a delta function centered around that one true model. This is also what happens when you do BMA or gaussian mixtures.

What happens when your data is actually distributed according to a mixture of gaussians (c.f. Pearson's crabs).

yummyfajitas · on Jan 6, 2016

In a situation like that, you have a space of models M and a space of populations P. The state space is M x P, and the model M generates the population P (probabilistically). In Pearson's crab example, there are some models where an individual crab can come from one or more gaussians.

The generative model (first choose gaussian, then draw from it) is the specific model. As you gather a sufficiently large sample from the population P, eventually you'll converge to a delta-function on the space of M.

So in Pearson's crab example, given enough data, you'll eventually either converge to the single model of 2 gaussian crab species, or you'll converge to a 3 gaussian crab species model, or you'll converge to a weibull single species model (assuming you put all 3 of those models into your prior).