> We'll always need an ensemble of evaluations, in which case the relevant question really is "how should I distribute these evaluations". And for any scalable algorithm none of them will be near the mode.
Why not distributing them according to the probability distribution? If I sample one million points and one happens to be near the mode, this doesn't mean any additional cost. This is no worse than sampling one million points none of which happens to be near the mode.
> Any evaluation outside of the typical set is wasted because it offers increasingly negligible contributions to the integrals
This is not a reason to exclude the mode from the typical set, because one evaluation at the mode offers a larger contribution to the integrals than one evaluation elsewhere.
> Instead we construct algorithms that somehow quantify probability mass (MCMC, VB, etc) and they will implicitly avoiding the mode and end up working with the typical set.
The algorithms don't have to avoid the mode, they just have to sample from it fairly (not much, but more than from any other region of the same size in the rest of the typical set).
May I suggest that you try sampling from a high-dimensional distribution and see how many samples end up near the mode? For example, try a 50-dimensional IID unit gaussian and check how often r = sqrt(x_1^2 + ... x_50^2) < 0.25 * sqrt(D)? You can also work this out analytically -- it's the classic example of concentration of measure.
By construct samples from a distribution will concentrate in neighborhoods of high probability mass, and hence the typical set.
It was you who said that it was better to exclude the mode from the typical set because "the additional evaluations near the mode would strictly add more cost". Now you tell me that evaluations near the mode are extremely unlikely. Something that, believe it or not, I understand. Maybe you would have liked it better if I had talked about one sample in one trillion being near the mode. And in that case that evaluation wouldn't be wasted because its contribution to the computed integral would be more important than if I had sampled by chance another point in the typical set with lower probability density.
Excluding the region with highest probability density from the typical set is a bit like saying that the population of New York is concentrated outside NYC because most people lives elsewhere.
Why not distributing them according to the probability distribution? If I sample one million points and one happens to be near the mode, this doesn't mean any additional cost. This is no worse than sampling one million points none of which happens to be near the mode.
> Any evaluation outside of the typical set is wasted because it offers increasingly negligible contributions to the integrals
This is not a reason to exclude the mode from the typical set, because one evaluation at the mode offers a larger contribution to the integrals than one evaluation elsewhere.
> Instead we construct algorithms that somehow quantify probability mass (MCMC, VB, etc) and they will implicitly avoiding the mode and end up working with the typical set.
The algorithms don't have to avoid the mode, they just have to sample from it fairly (not much, but more than from any other region of the same size in the rest of the typical set).