I think you are mistaken. Yes, a posterior is a probability distribution, which is exactly taking each possible outcome in accordance with its probability. I didn't say anything about collapsing the distribution down to a single point estimate like the posterior mean, or the MAP estimate, or any other single statistic. I am saying that I hold in my head a bunch of beliefs all at the same time (thus they are "blended" or "averaged" together), each in accordance with the amount of credibility it deserves. I think it's totally fine to speak about this as a type of uncollapsed "averaging" process, and indeed when you do something like hierarchical models where you supply a metaprior distinguishing between two alternative models, we widely talk about such things as model averaging even in the statistics literature. It seems like a rather misguided nitpick to me to insist that the English word "average" cannot be invoked unless it specifically coincides with the statistics word "mean".
Further, you're simply wrong about Bayesianism being about averages of utilities. The best account of Bayesian probability as a mapping of plausibilities to the unit interval is in Jaynes' The Logic of Science, but there is also a brief account of it in David Mumford's essay The Dawning of the Age of Stochasticity and e.g. in the introduction of Bayesian estimation supersedes the t-test by Kruschke [1] (where he explicitly describes it in terms of reallocation of credibility). Further, a la Jaynes, I think the right way to understand probability at all is in a mind-projection fallacy sense of the term: it describes your state of ignorance about the uncertain item.
It's not about utilities of actions -- utilities can be modeled with probabilities if the utility of certain actions is uncertain, but that is different than the base concept of probability being about which outcomes are more valued.
No, it's you who are mistaken. The posterior is not remotely like idea averaging.
Idea averaging is about taking multiple ideas and merging them into one. The Bayesian posterior is merely assigning probabilities to each individual idea. I.e., idea averaging is "knife + spoon = spoon with sharp edge". Bayesian posterior is "50% knife is best, 50% spoon is best => evidence => 5% knife is best, 95% spoon is best".
The latter is what the article is saying we should do; don't build a new idea incorporating everyone's contributions, instead just figure out which individual idea is the best and do it.
I still think the other comment was mistaken. At a given moment just before drawing a sample from some random variable X, that random variable can have a lot of different possible outcomes. What do I believe X is? I believe X is every different possibility in X's sample space. I simultaneously believe X is a lot of different things. But the degree of belief for each thing is governed by the probability for that outcome being the one I draw. In this sense, my cumulative belief about X is an average of all those different beliefs. That's not the same thing as saying that I have to carry around just one belief that is equal to the posterior mean (or any other point estimate).
> idea averaging is "knife + spoon = spoon with sharp edge". Bayesian posterior is "50% knife is best, 50% spoon is best => evidence => 5% knife is best, 95% spoon is best".
There's a lot wrong with the above quote from your comment. First, you're confounding two very different things. One thing is whether two ideas are mutually exclusive (e.g. the product is fundamentally non-functional if it is a spoon with sharp edge, so blending the two ideas is not in the set of feasible solutions). I already granted in my first comment that genuine mutual exclusivity can be a reason to avoid idea averaging -- but it's just overstated: true mutual exclusivity like that doesn't happen as much as people argue it does.
Secondly, and more importantly, you could break up your budget to pursue the knife solution and the spoon solution independently. What fraction of your budget should you allocate to the knife part? That is, what if 'knife' and 'spoon' are two distinct models that you are considering, and instead of doing crappy model selection to choose merely one of them, you want to produce a model that is a superposition of the two -- what metaprior probability should you choose that the 'knife' model will be the one that succeeds, versus the 'spoon' model?
Maybe something more concrete would be helpful, just think of a Gaussian Mixture Model. You have many different models and you're not choosing just one. You are blending them (and people call this model averaging) by using whatever each different component model's prior probability is to govern what the overall outcome is.
Maybe we are just talking past one another, but I really think the way you keep emphasizing the idea "posterior is a distribution" makes me think you continue to misunderstand me.
Another way to put it is that you are averaging over a space of distributions -- sort of like the Dirichlet distribution. The thing you end up with, the thing that is the result of computing an average, is itself a probability distribution, that comes as an average over a bunch of other probability distributions.
In short, when the posterior is arrived at as the average of a bunch of other distributions, as in Bayesian Model Averaging, it is absolutely fine to say that the posterior is itself an average (it is the average of a bunch of other distributions). So a posterior can be both a distribution and also an average.
This is why I don't agree with your claim, "The posterior is not remotely like idea averaging." It's exactly like averaging when you're talking about model averaging and I think model averaging is a very useful and good analogue to idea averaging in cases where there is not genuine mutual exclusivity between ideas.
You are misunderstanding both BMA and Gaussian mixtures. The posterior on the world state is a distribution over Models x Model Params.
To make predictions about the world, you compute an nexpected value (average) over Models. The explicit assumption of both GM and BMA is that only one gaussian or only one model is correct - you just don't know which one, and therefore need to average over your uncertainty to take into account all possibilities.
Idea averaging (as described in the article) is about averaging over world states not models.
> The explicit assumption of both GM and BMA is that only one gaussian or only one model is correct - you just don't know which one, and therefore need to average over your uncertainty to take into account all possibilities.
This is so wrong I don't even know where to start. In fact, this is basically the notion of frequentism! One of the very most fundamental ideas of Bayesian reasoning is that there is no one true set of parameters nor is there one true model. There is only your state of knowledge about the space of all possible parameter sets or all possible models. I'm very surprised to see you, of all people, claiming this. Even a cursory Google search of BMA fundamentals disconfirms what you are saying, e.g. [1]
> Madigan and Raftery (1994) note that averaging
over all the models in this fashion provides better
average predictive ability, as measured by a logarithmic
scoring rule, than using any single model
Mj, conditional on M. Considerable empirical evidence
now exists to support this theoretical claim...
This is so wrong I don't even know where to start. In fact, this is basically the notion of frequentism!
Frequentism doesn't even allow you to represent your belief with a probability distribution.
One of the very most fundamental ideas of Bayesian reasoning is that there is no one true set of parameters nor is there one true model. There is only your state of knowledge about the space of all possible parameter sets or all possible models.
Bayesian reasoning says there is one true set of parameters/model, you just don't know which one it is. The posterior distribution allows you to represent relative degrees of belief and figure out which model/parameter is more likely to be true.
Assuming you gather enough data, your posterior distribution will eventually approximate a delta function centered around that one true model. This is also what happens when you do BMA or gaussian mixtures.
The fact that model averaging provides better predictive ability doesn't contradict what I said.
> Assuming you gather enough data, your posterior distribution will eventually approximate a delta function centered around that one true model. This is also what happens when you do BMA or gaussian mixtures.
What happens when your data is actually distributed according to a mixture of gaussians (c.f. Pearson's crabs).
In a situation like that, you have a space of models M and a space of populations P. The state space is M x P, and the model M generates the population P (probabilistically). In Pearson's crab example, there are some models where an individual crab can come from one or more gaussians.
The generative model (first choose gaussian, then draw from it) is the specific model. As you gather a sufficiently large sample from the population P, eventually you'll converge to a delta-function on the space of M.
So in Pearson's crab example, given enough data, you'll eventually either converge to the single model of 2 gaussian crab species, or you'll converge to a 3 gaussian crab species model, or you'll converge to a weibull single species model (assuming you put all 3 of those models into your prior).
I read the original (andrewxhill's) post as making a point analogous to the "fork in the road" example: it's fine to maintain a full posterior on ideas, but if you need to choose, you should put in the effort to choose the best idea, not blindly try to merge them into a single idea that might combine the worst of both worlds. The latter would be like searching the forest between the two roads instead of just choosing a branch. Under this reading the point seems quite sensible to me.
You seem to have read the post quite differently, in a way that causes it to seem totally wrong.
I actually do think it's just a fact that "average" in both colloquial and mathematical usage means something like "to collapse multiple values to a single typical value" (even Bayesian model averaging collapses a distribution over probabilities into a single probability). But even if it were genuinely ambiguous what andrewxhill meant, you generally get a lot more out of a post by choosing the reading that allows it to be insightful over the reading the causes it to be nonsensical.
First, model averaging is a bit different than when you describe it as (at least in my reading, but maybe you mean "single probability" differently than I read it):
> even Bayesian model averaging collapses a distribution over probabilities into a single probability
as I mentioned in the child comment to the other reply to this comment, model averaging creates an average value, but the type of that value is a distribution. That is, you have a distribution over distributions (each coming from a different model) and the effect of averaging does not reduce you down to a point estimate, rather it reduces you down to just one distribution.
This is why it's totally fair to say a posterior can be an average. It is the average of a bunch of other distributions. I think if I had said it that way in my first comment, it would have removed some confusion.
But it is important, because the criticism that "a posterior isn't an average" is very wrong. A posterior most certainly can be the average of some other stuff, if that other stuff was itself a bunch of distributions -- and that's exactly what I am trying to talk about.
But to your other point, about the 'sensible' vs. 'not sensible' readings, I mostly agree. However, the problem is who gets to decide when two ideas fall into the "knife-vs-spoon-clearly-exclusive" category, or when it's more gray than that, and the choice is not so black and white, and there is not a need to over-commit to just one approach?
The reason the OP post strikes me as problematic is that it seems like a matter of opinion, or in the worst case a matter of bureaucratic/dictatorial mandate, as to when ideas are of the type that can be averaged vs. when they are not.
I'd generally like people to be more humble about it and tend to believe that conventional wisdom and model averaging are better, at least as a first heuristic, than deeply committing to just one thing. That way there might be less urgency to rush into the claim that some debate is "knife-vs-spoon".
I sort of see the whole "knife-vs-spoon" thing as a kind of Godwin's law of brainstorming. Once you invoke the "knife-vs-spoon-so-we-can't-average" claim, it's like game over and all useful intellectual discussion dies and everyone just either picks Team Knife or Team Spoon and then the political battles start. Unless it's really mutually exclusive, I'd rather that doesn't happen.
Further, you're simply wrong about Bayesianism being about averages of utilities. The best account of Bayesian probability as a mapping of plausibilities to the unit interval is in Jaynes' The Logic of Science, but there is also a brief account of it in David Mumford's essay The Dawning of the Age of Stochasticity and e.g. in the introduction of Bayesian estimation supersedes the t-test by Kruschke [1] (where he explicitly describes it in terms of reallocation of credibility). Further, a la Jaynes, I think the right way to understand probability at all is in a mind-projection fallacy sense of the term: it describes your state of ignorance about the uncertain item.
It's not about utilities of actions -- utilities can be modeled with probabilities if the utility of certain actions is uncertain, but that is different than the base concept of probability being about which outcomes are more valued.
[1] http://www.indiana.edu/~kruschke/BEST/BEST.pdf