Hacker News new | past | comments | ask | show | jobs | submit login

> There's nothing about PPO that helps it learn long-range strategies.

Exactly. Which is why it's so surprising that it did anyway despite that and discount rates which don't give any value past a minute or so.

> DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/

Note that the CTF agent is way more complex, featuring multilevel RL and evolutionary losses, and even DNC in the agents.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: