> although the additional evaluations near the mode would strictly add more cost.
No, one evaluation near the mode is more efficient than one evaluation at the boundary (because the density is higher).
I agree that the region immediately around the mode is naturally "avoided" (because the volume is small). But your paper makes it look like we increase efficiency by explicitely avoiding it and concentrating somewhere else. That's what I found confusing.
"No, one evaluation near the mode is more efficient than one evaluation at the boundary (because the density is higher)."
Incorrect in general. Firstly, one evaluation anywhere does not yield any reasonably accurate estimate of expectations. We'll always need an ensemble of evaluations, in which case the relevant question really is "how should I distribute these evaluations". And for any scalable algorithm none of them will be near the mode.
This is often hard to grok because people implicit fall back to the example of a gaussian density where the mode and the Hessian at the mode fully characterize the density which can then be used to compute analytic integrals. But for a general target distribution we do not have any of that structure and instead have to consider general computational strategies.
"I agree that the region immediately around the mode is naturally "avoided" (because the volume is small). But your paper makes it look like we increase efficiency by explicitely avoiding it and concentrating somewhere else. That's what I found confusing."
In high-dimensions the probability mass of any well-behaved probability distribution concentrates in (or _around_ if you want to acknowledge the fuzziness) the typical set. Hence accurate estimation of expectations requires quantifying the typical set. Any evaluation outside of the typical set is wasted because it offers increasingly negligible contributions to the integrals -- a few additional evaluations may not drive the cost up appreciably but they will still be wasted.
So it's not that the only thing that matters is avoiding the mode. Rather what matters is that, contrary to many's intuitions, the neighborhood around the mode does inform expectation values and so exploring that neighborhood is insufficient for estimating expectations. That then motivates the question of what neighborhoods do matter, which is answered by concentration of measure and the existence of the typical set.
And in practice, we don't actually do any of this explicitly. Instead we construct algorithms that somehow quantify probability mass (MCMC, VB, etc) and they will implicitly avoiding the mode and end up working with the typical set.
> We'll always need an ensemble of evaluations, in which case the relevant question really is "how should I distribute these evaluations". And for any scalable algorithm none of them will be near the mode.
Why not distributing them according to the probability distribution? If I sample one million points and one happens to be near the mode, this doesn't mean any additional cost. This is no worse than sampling one million points none of which happens to be near the mode.
> Any evaluation outside of the typical set is wasted because it offers increasingly negligible contributions to the integrals
This is not a reason to exclude the mode from the typical set, because one evaluation at the mode offers a larger contribution to the integrals than one evaluation elsewhere.
> Instead we construct algorithms that somehow quantify probability mass (MCMC, VB, etc) and they will implicitly avoiding the mode and end up working with the typical set.
The algorithms don't have to avoid the mode, they just have to sample from it fairly (not much, but more than from any other region of the same size in the rest of the typical set).
May I suggest that you try sampling from a high-dimensional distribution and see how many samples end up near the mode? For example, try a 50-dimensional IID unit gaussian and check how often r = sqrt(x_1^2 + ... x_50^2) < 0.25 * sqrt(D)? You can also work this out analytically -- it's the classic example of concentration of measure.
By construct samples from a distribution will concentrate in neighborhoods of high probability mass, and hence the typical set.
It was you who said that it was better to exclude the mode from the typical set because "the additional evaluations near the mode would strictly add more cost". Now you tell me that evaluations near the mode are extremely unlikely. Something that, believe it or not, I understand. Maybe you would have liked it better if I had talked about one sample in one trillion being near the mode. And in that case that evaluation wouldn't be wasted because its contribution to the computed integral would be more important than if I had sampled by chance another point in the typical set with lower probability density.
Excluding the region with highest probability density from the typical set is a bit like saying that the population of New York is concentrated outside NYC because most people lives elsewhere.
No, one evaluation near the mode is more efficient than one evaluation at the boundary (because the density is higher).
I agree that the region immediately around the mode is naturally "avoided" (because the volume is small). But your paper makes it look like we increase efficiency by explicitely avoiding it and concentrating somewhere else. That's what I found confusing.