Yes, I am aware, I did not mean random search as in random actions, but random search with improved heuristics to find a policy.
The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.
And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.
So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.
When optimizing high-dimensional policies, the gap in sample complexity between PPO (and policy gradient methods in general) and ES / random search is pretty big.
If you compare the Atari results from the PPO and ES papers from OpenAI, PPO after 25M frames is better than ES after 1B frames. In these two papers, the policy parametrization is roughly the same, except that ES uses virtual batchnorm. For DOTA, with a much bigger policy, I'd expect the gap between ES and PPO to be much bigger than for Atari.
My takeaway from [0] and Rajeswaran's earlier paper is that one can solve the MuJoCo tasks with linear policies after appropriate preprocessing, so we shouldn't take them too seriously. That paper doesn't do an apples-to-apples comparison between ES and PG methods on sample complexity.
All of that said, there's not enough careful analysis comparing different policy optimization methods.
The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.
And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.
So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.
[0] http://www.argmin.net/2018/02/20/reinforce/