> We’ve increased the reaction time of OpenAI Five from 80ms to 200ms. This reac...

vomjom · on July 18, 2018

The author of that paper has since built an agent that has human-level reaction time and is comparable against professional players: http://youtube.com/vladfi1

gwern · on July 18, 2018

Yes, but the fact that PPO can learn very long range strategies with enough computation when most would expect it to diverge/fail to learn at all is already demonstrated by the original 5x5 DoTA bot. That's probably the same thing there: it can handle the long-range learning and so does fine at human-level APM, while the SSBM AI is stuck learning short-term strategies which heavily rely on simple reactive fast policies.

vomjom · on July 18, 2018

There's nothing about PPO that helps it learn long-range strategies. It primarily lets you make multiple steps for a single batch so you can converge faster.

In fact, for a single step with no policy lag, it's equivalent to a standard policy gradient update.

DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/

I suspect the difference that allows you to train with reaction time is an RNN or compensating for the lag some other way. I'm testing that out right now with my own SSBM bot: https://www.twitch.tv/vomjom

gwern · on July 18, 2018

> There's nothing about PPO that helps it learn long-range strategies.

Exactly. Which is why it's so surprising that it did anyway despite that and discount rates which don't give any value past a minute or so.

> DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/

Note that the CTF agent is way more complex, featuring multilevel RL and evolutionary losses, and even DNC in the agents.