Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI Five Benchmark (blog.openai.com)
209 points by gdb on July 18, 2018 | hide | past | favorite | 77 comments



I'm surprised they felt confident enough to lift a lot of the restrictions so soon. A lot of them seemed like big game changers I wouldn't expect the ai to be able to exploit so easily. Especially the introduction of Roshan and invisibility, they mention implementing some randomization, it seems like it would take quite a big commitment in terms of resources to take Roshan where the AI wouldn't even realize the benefits immediately (namely the Aegis, the XP and gold they would have to value against the loss from farming/taking towers).

The introduction of the other heroes also comes as a surprise, I wouldn't have expected them to have the ai utilizing new abilities. They don't mentioned how they are picked, other than the ai having a random draft of them (does the ai pick their composition?)


I wonder if the reward mechanisms factor in denying resources from the enemy team. A lot of the time it makes sense to take Roshan just so the enemy doesn't get the opportunity to.


Yes, the reward for a team includes a term which is negative of the enemies rewards. This was mentioned in the previous blog post. Here it is: https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae939...


I think that'd be an even harder behavior for the ai to come to though, since they'd have to recognize those benefits for the other team and their origin, and then finding that taking Roshan prevents those benefits. I think that's a next step after discovering the benefit of taking Roshan.


Same argument applies to pushing and defending towers, getting rax etc. Assuming that the devs did not hardcode rewards for those objectives, then the AI surely already has to 'understand' events that impact the game in long term.


I think you're right about it being more about denying the enemy team. IIRC, in the last update they said that the bots tend to prioritize denying creeps and taking objectives over getting perfect last hits in lane.


> We’ve increased the reaction time of OpenAI Five from 80ms to 200ms. This reaction time is much closer to human level, though we haven’t seen evidence of changes in gameplay as Openai Five’s strength comes more from teamwork and coordination than reflexes.

This new constraint is interesting. The Super Smash Bros. Melee AI paper noted that they had to keep the reaction times to superhuman levels in order for the model to converge (albeit DOTA is a bit different from Melee): https://arxiv.org/abs/1702.06230


The author of that paper has since built an agent that has human-level reaction time and is comparable against professional players: http://youtube.com/vladfi1


Yes, but the fact that PPO can learn very long range strategies with enough computation when most would expect it to diverge/fail to learn at all is already demonstrated by the original 5x5 DoTA bot. That's probably the same thing there: it can handle the long-range learning and so does fine at human-level APM, while the SSBM AI is stuck learning short-term strategies which heavily rely on simple reactive fast policies.


There's nothing about PPO that helps it learn long-range strategies. It primarily lets you make multiple steps for a single batch so you can converge faster.

In fact, for a single step with no policy lag, it's equivalent to a standard policy gradient update.

DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/

I suspect the difference that allows you to train with reaction time is an RNN or compensating for the lag some other way. I'm testing that out right now with my own SSBM bot: https://www.twitch.tv/vomjom


> There's nothing about PPO that helps it learn long-range strategies.

Exactly. Which is why it's so surprising that it did anyway despite that and discount rates which don't give any value past a minute or so.

> DeepMind was also able to train a CTF agent with human-level reaction time: https://deepmind.com/blog/capture-the-flag/

Note that the CTF agent is way more complex, featuring multilevel RL and evolutionary losses, and even DNC in the agents.


> Because our training system Rapid is very general, we were able to teach OpenAI Five many complex skills since June simply by integrating new features and randomizations. Many people pointed out that wards and Roshan were particularly important to include — and now we’ve done so. We’ve also increased the hero pool to 18 heroes. Many commenters thought these improvements would take another year.

The linked commenters thought that getting to "real dota" (more than 100 heroes, captains mode instead of random, ...) would take another year. So I don't think it's fair to make that statement.

Edit: Don't get me wrong, I think the improvements are very nice, but pointing to people saying "these people thought we would need a year, we did it in under a month!" is not what you should do if you didn't actually do what the linked people stated.


It seems like the 5 invulnerable couriers restriction is something that will have a huge influence on how the early game is played and is something that the humans won't have any experience taking advantage of.


I don't think it's that complicated to adapt to... everyone should just be pretty much constantly ferrying out regen and harass more aggressively.


people play with 5x invulnerable couriers in turbo mode. Essentially that is what they do. They get their items asap.


And bottles are currently disabled so you can't use the most abusive strat that 5 couriers would allow.


I don't think you can bottle ferry anymore


you can



You can't. It was changed around this year.

Courier isn't that important. It's being phased out across the recant patches. And there is a popular Dota mod with 5 fast invulnerable couriers.


Courier is important. Otherwise you are forced to go back healing/getting items. Which will lose you XP and gold and ultimateley the game.

Yes there is Turbo, no it's not comparable to regular gameplay.


Regular dota players learn to play without the courier anyway, because someone always feeds it away or uses it to ferry themselves a magic stick.

So maybe playing without a courier at all would be more representative of the pub experience ;)


Must be 1k you are talking about.


Note: The choice of human opponent team is from people that are vastly better than the majority of Dota2 players, but still vastly worse than the top-tier pro teams.

So when next time Elon tweeted: "OpenAI beats the human players in 5v5"

You know that the game is not broken by AI yet (not like Go, which is indeed broken by AI).


I think from the trajectory it’s pretty obvious who will be on top in five years. Whether the intercept is now, a month from now, or half a year away doesn’t matter all that much.


If I'm understanding correctly, those five players are for the match in early August.

They'll still have a match against top pros at the International in late August.


Does anyone know how random drafting work? If they're truly random, i.e. randomly picking 5 heroes out of 18 (CM, DP, ES, Gyro, Lich, Lion, Necro, Qop, Razor, Riki, Nevermore, Slark, Sniper, Sven, Tide, Viper, or WD) then it's much less about teamwork. What if they end up with Razor, QOP, Nevermore, DP, Gyro? The problem here is in Dota, each hero almost has a clear position in the game, much like soccer. Having both teams randomly pick 5 heroes would probably ruin the game, and make it really difficult for human (think covariate shift), whereas the bot is probably trained using this distribution


From [1], the teams alternatively pick a hero from the pool. In this case the pool is not random, but fixed to the 18 heroes. Calling this Random Draft is confusing. Would be nice if they can confirm that the humans would have a choice in the heroes they play. Otherwise the larger hero pool is meaningless.

[1]: https://dota2.gamepedia.com/Game_modes#Random_Draft


I assume they're referring to the existing game mode called Random Draft. The pool is usually 50 heroes instead of 18, but you'd run it the same way. Ordered picks like Captains Mode, except no bans.


Sorry, is it really necessary to say nevermore instead of "sf"? I understand some of us have been playing since WC3 days and are used to the old names (wisp, necrolyte, nevermore) but it's actually longer to pronounce AND more confusing nowadays.


That's Dota equivalent of the "man of culture".


"POTM of the moon" is where it's at


More confusing? No. Ingame voice lines also use the name on occasion.


How could you make the argument that it isnt more confusing to use non-dota2 names?


Because it is not a non-dota2 name.


That's called All Random.


Yes but no one plays serious in All Random mode. Supports are called supports for a reason, they're strong early game without a lot of items. Some gave early ganking, counter-ganking abilities, some have healing, harrasing abilities or really good early stats. Carries are stronger mid game and late game because naturally they're weaker early game. Because of this, you need to specify farming position, setup ganking, rotation, early game. All of these strategies will be gone if you play All Random. Basically it becomes a 2k pub trash game and no one > 4k mmr actually practices all random daily. Unless you're in SEA where people just first pick carries :) :) :)


This will be very interesting to see, despite the vast differences in ruleset that make this much much less complex than the actual game.

My prediction is that we're very very far away from AI that can beat the top teams in a 5v5. Amateur teams can easily be beat simply on the strength of the mechanics (which are very very strong on the AI, beating even pros), but the strategy and coordination of the top teams are out of this world.


Dare to put a time on that prediction? 1 year, 5?


I would say 18 months from August 5 when they do the stream they will still be unable to beat a professional team playing the full game with no restrictions.

Right now they are so many key parts of the game: Illusions, Summons, Bottle, Courier, and most of the heroes. The 18 they have chosen are all fairly straightforward and make drafting simple. I want to see an AI playing Huskar, IO, and Natures Prophet. Better, I want to see an AI that can draft and ban.


Exactly what I think. The way the Pros exploit those heroes requires a lot of logic deduction, not just game sense feeling and tree searching (which current AI methods are strong at). Same thinking if we can combine all three of them, I think we will be very close to AGI.


I really hope OpenAI teached bots to type "cyka" and "go mid"


Don't forget safelane pos 1 feeding at minute 6 and typing "gg mid no gank"


Agressive tipping is also of the utmost importance


Less restrictions is of course better, but I'm still not impressed without an actual Captain's Mode all-heroes draft. In addition inverse_pi's comments about all-heroes being vital because heroes have different roles to play, the draft is both an important part of the game and, I would think, one of the most difficult for an AI: it involves bluffing, mind games, online strategy adjustment in response to an opponent's actions, and awareness of the current meta.

The draft isn't everything and it's possible that a sufficiently talented AI could always lose the draft and still win the game, but that would be a pretty boring outcome from the perspective of contributing to AI knowledge (just like it's possible, though unlikely in Dota, that sufficiently good micro could overwhelm any disadvantage in strategy and tacitcs if the AI can play at 2000 APM: it would "win", but only in a very boring sense)


It can't really have 2000 APM because it observes 450 frames per minute.


I chose that number sort of arbitrarily to just mean "very very fast", but I see your point.


Furthermore, they are limiting the reaction time to 200ms, to match good humans (I suspect some pros are actually faster than that) and remove any advantage there. So it doesn't have a meaningful advantage over pros in any mechanical/reaction time sense; it's truly just trying to play the game more intelligently from what I can tell.


Still has the advantage of simultaneously observing thousands of game variables from the API at a glance.


Ah yeah good point


With that reaction time it's trying to pass the Dota-Turing's tests.


Based on the chart showing the effect of "We're still fixing bugs" in their last blog post, it looks like they should have the skills 'buffer' to handle significantly better teams than those they have faced so far.

https://blog.openai.com/openai-five/

"We’re still fixing bugs. The chart shows a training run of the code that defeated amateur players, compared to a version where we simply fixed a number of bugs, ..."

Looking at the chart and the fact that they are confident enough to lift several restrictions, I'd bet on OpenAI Five winning against at least some of the professional teams at The International. It's even possible they will beat most teams there.


I am embarrassed to say that I am confused about what the outputs are from the "Rapid" RL training system. Do you end up with an executable that then drives the game inputs/api? Does it produce a "bot script" that is used by the game to drive the logic? I understand that thousands of CPUs/GPUs are used for the training, but then what is actually playing the game at the end of the day?


(I work on the Dota team at OpenAI.)

The output is a trained neural network!


Hi Gdb, next week I am giving a presentation on your awesome Dota work to the local data science community in vancouver BC. I have reviewed the info your team has released so far and i have a few questions:

- I saw no mention of CNNs, is it true CNNs are not used even for the 8x8 terrain grid input?

- do you have any comments about rapid+PPO vs say impala+vtrace? Would the ability to use more off-policy data be very helpful here?

- any comments on how you selected the reward constants?

- was the teamwork/tau something your team came up with or was this a known approach?

- the attention keys are most interesting, can you comment on why they dont flow through the lstm? Does it make it easier for the network to quickly change unit attention or some other reason?

- any comment of the choice of single-layer LSTM vs multilayer ostensibly for operating on longer timescales?

- does this result mean that HRL is less critical than some people thought?

- any comment on magnitude of compute, like in the post from may?

Thank you for sharing your fascinating work!


Could you go into some more detail on the actual engineering mechanics? Does each bot have an instance of the neural net model it runs a separate PC? How often do you feed game state into the net? What's the output of the network (bunch of movement / item / spell commands) that are fed in through the game driver?


Oh, good question, I didn't think of that either. there one NN that consumes the state for each of the bot players and then returns the "next action" for that bot, or is there a separate NN for each of the bots, and does that NN run on the LAN machine or is the LAN machine just running the game code and python agent which is mediating the game code and the NN?


I think OP wants to know how the neural network actually plays the game. I think in this case the dota client has an api for bots that it can use?


Yes, there's a bot API.

We dump state from the bot API each tick and send it over GRPC to a Python agent, which formats the state into a tuple of Numpy arrays. That Numpy array is passed into 5 neural networks (one per agent), each of which returns a tuple of Numpy arrays. Each tuple is decoded into a semantic action, which is then returned to the game via GRPC.


Is the entire system one agent, which is then replicated across 5 bot instances, or do you have a specific network per hero?


Does the entire NN run on the game laptop or is it passing the tuple back to OpenAI for processing?


NNs don't need much resources to run once trained. It's just a bunch of matrix multiplications.


So are 3d engines. It all depends on size of operands. Constant factors are important.


Are you seeing this? How can you let this go unchallenged?

https://twitter.com/eternalenvy1991/status/10196414446030520...


What is the purpose of having deep learning run on games like AlphaGo and DOTA2, instead of having them train on more general or real world tasks? Is it a constraint on the amount of data, since in video games you can easily generate more?


The data generation is indeed one key aspect of it. To train a reinforcement learning model such as this one, you do need an insane amount of data (they wrote somewhere that the model played the equivalent of 180 years of Dota per day).

Overall, games are a good playground to test ideas and verify assumptions. The next step to transfer this type of knowledge to real world problems would be to build a simulator, train on it using ungodly amounts of computing resources, and then fine-tune the final model on the real world thing. This has been done for robot control tasks in the past. But first, you have to develop and prove that the base learning algorithm works -- and games are nice for that.

This here is also a good showcase of collaboration learned by RL agents, and beating pro teams in an esport where prize pools range in the millions of dollars is an amazing way to convince people.


- You can't have thousands of years of real world tasks for low cost

- Clearly defined goal


Training RL agents in the real world is expensive and thus not parallelizable. The current focus on games and VR simulations of robots is exactly because of this reason. The RL agents are much more "sample inefficient" than humans, meaning they need more experiences to learn a skill.

And we, humans (and animals) have a huge environment with billions of agents and millions of years of evolution behind us which allows us to come preloaded with good instincts, they are trying to replicate this process in a few months.


How do 5 DotA players coordinate? They share information via voice?

How does a deep learning algorithm coordinate between 5 heros? I assume it's not 5 bots communicating over some channel but one bot acting on 5 heros?


Surprisingly, it's 5 completely separate bots:

"OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”"

- https://blog.openai.com/openai-five/


So the bots do not communicate directly?


The bots presumably "learned" that the other heroes act the way that the bot would have acted were they in that position (i.e. all my allies run the same algorithm, so I can predict what they would do)


That seems like a pretty serious disadvantage


How does counter-picking work if it's no longer mirror match? Was this a separate model?


According to the article the bots are only doing random picks from a pool of 18 heroes for now. I imagine pick/ban will come with later iterations.


They will apparently be playing Random Draft mode, which normally means that it works similarly to All Pick, except for the fact that the players pick from a pool of 50 random heroes. How this will work with the 18-hero pool is something I don't know.


It's not that clear. They call it "Random Draft", but this picking mode in Dota 2 doesn't mean "doing random picks". "Doing random picks" is the "All Random" mode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: