In the context of bandits, how does this perform compared to Thompson Sampling?

j2kun · on March 14, 2017

For bandit learning MWUA is called EXP3, and it's close to optimal (off from the information-theoretical lower bound by a factor of log n, I believe). Various improvements like EXP4 and EXP4p keep improving this bound.

The difference is it's optimizing for a different mathematical model, one which assumes no stochasticity and the rewards can be arbitrary, and adversarially chosen with full knowledge of the algorithm. This "full knowledge" guarantee is what makes MWUA all about "adaptability" (and it's what makes the linear programming example work so spectacularly in the post: the roles are flipped and we as the algorithm designers get to be MWUA's adversaries!)

I don't have a direct empirical comparison of EXP3 and Thompson sampling, but I am delighted to say my blog post [1] comes up first when you google "Exp3"

[1]: https://jeremykun.com/2013/11/08/adversarial-bandits-and-the...

sdenton4 · on March 14, 2017

Seems closely related.

The update of the distributions in a Thompson sampler can be thought of as a multiplicative update, considering the old dist as a prior. Those updates become relatively smaller with time though, as the distributions become better known. So TS has a more complicated update rule.

Then there's the decision rule. I think of TS as approximating the probability that a given observable is the best, by drawing from the distributions. This effectively just uses a softmax on the weights. Which seems close in spirit, and probably more open to analysis than TS, which often gets complicated under analysis.