Hacker News new | past | comments | ask | show | jobs | submit login

I still think the other comment was mistaken. At a given moment just before drawing a sample from some random variable X, that random variable can have a lot of different possible outcomes. What do I believe X is? I believe X is every different possibility in X's sample space. I simultaneously believe X is a lot of different things. But the degree of belief for each thing is governed by the probability for that outcome being the one I draw. In this sense, my cumulative belief about X is an average of all those different beliefs. That's not the same thing as saying that I have to carry around just one belief that is equal to the posterior mean (or any other point estimate).

> idea averaging is "knife + spoon = spoon with sharp edge". Bayesian posterior is "50% knife is best, 50% spoon is best => evidence => 5% knife is best, 95% spoon is best".

There's a lot wrong with the above quote from your comment. First, you're confounding two very different things. One thing is whether two ideas are mutually exclusive (e.g. the product is fundamentally non-functional if it is a spoon with sharp edge, so blending the two ideas is not in the set of feasible solutions). I already granted in my first comment that genuine mutual exclusivity can be a reason to avoid idea averaging -- but it's just overstated: true mutual exclusivity like that doesn't happen as much as people argue it does.

Secondly, and more importantly, you could break up your budget to pursue the knife solution and the spoon solution independently. What fraction of your budget should you allocate to the knife part? That is, what if 'knife' and 'spoon' are two distinct models that you are considering, and instead of doing crappy model selection to choose merely one of them, you want to produce a model that is a superposition of the two -- what metaprior probability should you choose that the 'knife' model will be the one that succeeds, versus the 'spoon' model?

Maybe something more concrete would be helpful, just think of a Gaussian Mixture Model. You have many different models and you're not choosing just one. You are blending them (and people call this model averaging) by using whatever each different component model's prior probability is to govern what the overall outcome is.

Maybe we are just talking past one another, but I really think the way you keep emphasizing the idea "posterior is a distribution" makes me think you continue to misunderstand me.

Another way to put it is that you are averaging over a space of distributions -- sort of like the Dirichlet distribution. The thing you end up with, the thing that is the result of computing an average, is itself a probability distribution, that comes as an average over a bunch of other probability distributions.




In short, when the posterior is arrived at as the average of a bunch of other distributions, as in Bayesian Model Averaging, it is absolutely fine to say that the posterior is itself an average (it is the average of a bunch of other distributions). So a posterior can be both a distribution and also an average.

This is why I don't agree with your claim, "The posterior is not remotely like idea averaging." It's exactly like averaging when you're talking about model averaging and I think model averaging is a very useful and good analogue to idea averaging in cases where there is not genuine mutual exclusivity between ideas.


You are misunderstanding both BMA and Gaussian mixtures. The posterior on the world state is a distribution over Models x Model Params.

To make predictions about the world, you compute an nexpected value (average) over Models. The explicit assumption of both GM and BMA is that only one gaussian or only one model is correct - you just don't know which one, and therefore need to average over your uncertainty to take into account all possibilities.

Idea averaging (as described in the article) is about averaging over world states not models.


> The explicit assumption of both GM and BMA is that only one gaussian or only one model is correct - you just don't know which one, and therefore need to average over your uncertainty to take into account all possibilities.

This is so wrong I don't even know where to start. In fact, this is basically the notion of frequentism! One of the very most fundamental ideas of Bayesian reasoning is that there is no one true set of parameters nor is there one true model. There is only your state of knowledge about the space of all possible parameter sets or all possible models. I'm very surprised to see you, of all people, claiming this. Even a cursory Google search of BMA fundamentals disconfirms what you are saying, e.g. [1]

[1] http://www.stat.colostate.edu/~jah/papers/statsci.pdf

> Madigan and Raftery (1994) note that averaging over all the models in this fashion provides better average predictive ability, as measured by a logarithmic scoring rule, than using any single model Mj, conditional on M. Considerable empirical evidence now exists to support this theoretical claim...

and so on.


This is so wrong I don't even know where to start. In fact, this is basically the notion of frequentism!

Frequentism doesn't even allow you to represent your belief with a probability distribution.

One of the very most fundamental ideas of Bayesian reasoning is that there is no one true set of parameters nor is there one true model. There is only your state of knowledge about the space of all possible parameter sets or all possible models.

Bayesian reasoning says there is one true set of parameters/model, you just don't know which one it is. The posterior distribution allows you to represent relative degrees of belief and figure out which model/parameter is more likely to be true.

Assuming you gather enough data, your posterior distribution will eventually approximate a delta function centered around that one true model. This is also what happens when you do BMA or gaussian mixtures.

The fact that model averaging provides better predictive ability doesn't contradict what I said.


> Assuming you gather enough data, your posterior distribution will eventually approximate a delta function centered around that one true model. This is also what happens when you do BMA or gaussian mixtures.

What happens when your data is actually distributed according to a mixture of gaussians (c.f. Pearson's crabs).


In a situation like that, you have a space of models M and a space of populations P. The state space is M x P, and the model M generates the population P (probabilistically). In Pearson's crab example, there are some models where an individual crab can come from one or more gaussians.

The generative model (first choose gaussian, then draw from it) is the specific model. As you gather a sufficiently large sample from the population P, eventually you'll converge to a delta-function on the space of M.

So in Pearson's crab example, given enough data, you'll eventually either converge to the single model of 2 gaussian crab species, or you'll converge to a 3 gaussian crab species model, or you'll converge to a weibull single species model (assuming you put all 3 of those models into your prior).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: