There's nothing about PPO that helps it learn long-range strategies. It primarily lets you make multiple steps for a single batch so you can converge faster.
In fact, for a single step with no policy lag, it's equivalent to a standard policy gradient update.
I suspect the difference that allows you to train with reaction time is an RNN or compensating for the lag some other way. I'm testing that out right now with my own SSBM bot: https://www.twitch.tv/vomjom
In fact, for a single step with no policy lag, it's equivalent to a standard policy gradient update.
DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/
I suspect the difference that allows you to train with reaction time is an RNN or compensating for the lag some other way. I'm testing that out right now with my own SSBM bot: https://www.twitch.tv/vomjom