Hacker News new | past | comments | ask | show | jobs | submit login
OpenAI Five (blog.openai.com)
646 points by gdb on June 25, 2018 | hide | past | favorite | 124 comments



Disclosure: I work on Google Cloud (and vaguely helped with this).

For me, one of the most amazing things about this work is that a small group of people (admittedly well funded) can show up and do what used to be the purview of only giant corporations.

The 256 P100 optimizers are less than $400/hr. You can rent 128000 preemptible vcpus for another $1280/hr. Toss in some more support GPUs and we're at maybe $2500/hr all in. That sounds like a lot, until you realize that some of these results ran for just a weekend.

In days past, researchers would never have had access to this kind of computing unless they worked for a national lab. Now it's just a budgetary decision. We're getting closer to a (more) level playing field, and this is a wonderful example.


I would just want to comment that while this is true in principle, it's also slightly misleading because it does not include how much tuning and testing is necessary until one gets to this result.

Determining the scale needed, fiddling with the state/action/reward model, massively parallel hyper-parameter tuning.

I may be overestimating but I would reckon with hyper-parameter tuning and all that was easily in the 7-8 figure range for retail cost.

This is slightly frustrating in an academic environment when people tout results for just a few days of training (even with much smaller resources, say 16 gpus and 512 CPUs) when the cost of getting there is just not practical, especially for timing reasons. E.g. if an experiment runs 5 days, it doesn't matter that it doesnt use large scale resources, because realistically you need 100s of runs to evaluate a new technique and get it to the point of publishing the result, so you can only do that on a reasonable time scale if you actually have at least 10x the resources needed to run it.

Sorry, slightly off topic, but it's becoming a more and more salient point from the point of academic RL users.


I hear you. I would say that this work is tantamount to what would normally be a giant NSF grant.

Depending on your institution, this is precisely why we (and other providers) give out credits though. Similar to Intel/NVIDIA/Dell donating hardware historically, we understand we need to help support academia.


Yes, thank you for that by the way, did not want to diminish your efforts. Just wanted to point out that papers are often misleading about how many resources are needed to get to the point of running the result. I have received significant amounts of money from Google, full disclosure.


That's so awesome. Thanks for the exchange you two had. I love seeing the technology permeate through it's different causeways to become a useful and tangible product for more and more people. It's a thing of beauty to watch unfold each and every time, to me.


This is a very good point. While the final model might be a weekend of training, getting there is a lot more iterations/work.


>> Toss in some more support GPUs and we're at maybe $2500/hr all in.

Amazing, indeed. That's only 5/8 of my entire travelling allowance, from my PhD studentship.

Hey, I'd even have some pocket money left over to go to a conference or two!


This is more than many academic positions pay (or cost the uni) in a year; esp. in Europe. This an absurd amount of money/resources and more of a sign that this part of academia is not about outsmarting but outspending the "competition".


(I too work for Google Cloud)

I agree. One of the most amazing things about watching this project unfold is just how quickly it went from 0 to 100 with minimal overhead. It's amazing to watch companies and individuals push the boundaries of what is possible with just the push of a button.


Agree 100%, pay as you go compute has helped us tremendously. A large amount of our time is spent analysing results and interpreting models and the ability to power up and train a new topology without the huge cap-ex is the reason my company is still alive!


I agree that 2500x48hrs is probably a reasonably cost to pay for these kind of sweet results. But it is a bit prohibitively expensive for an ML hobbyist to try to replicate in their own free time. I wonder if there is some way to do this w/o all the expensive compute. Pre-trained models is one step towards this, but so much of the learning(for the hobbyist) comes from struggling to get your RL model off the ground in the first place.


It'd be interesting to see in the graphs (when the OpenAI team gets to them) how good you get at X hours in. Because if you're pretty good at X=4, that's still amazing.

Edit: I guess https://blog.openai.com/content/images/2018/06/bug-compariso... is approximately indicative (you currently need about 3 days to beat humans).


Transfer learning is about the best we can do right now. Using a fully trained ResNet / XCeptionNet and then tacking on your own layers after the end is within reach to hobbyists with just a single GPU on their desktop. There's still a decent amount of learning for the user even with pre-trained models.


+1 this is what I do for my at home (non work) experiments in using word embedding and RNNs for generative text summarization. Using transfer learning makes this affordable as a hobby project.


Quoting from the original article.

> This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines.

So this is game engine itself, taking up the CPUs. Maybe the DoTA code can be optimized x2 for self play?!

IIRC AlphaZero was about x10 more efficient than AlphaGo Zero due to algorithm improvement.

So overall, $100K for the final training run, which maybe can go down to $10K for a different domain of similar complexity.


Interesting question! I assume in the Bot/headless mode, it's pretty optimized to skip the part needed for rendering, but you still need to do enough physics and other state update.

Best case, I'd assume at least a few ms per tick, because games become as complex as possible and still fit in 30 fps (33 ms, much of which is rendering, but still much happens regardless of producing pixels).


> Maybe the DoTA code can be optimized x2 for self play?!

Please don't. Every time they change something, several other things break.

Ok, just kidding.

But their fix logs are really look like the game logic is built by adding a hack on top of a hack with no automatic testing. Everything seems to hold on the playtesting.


Does the approach scale at low scale though? Like, would this project only bear fruit when run at large scale?

Getting budgetary approval isn't easy for everyone. Especially with an unproven process. And even then, there could be a mistake in the pipeline. All that money down the drain.


Good question! RL (and ML generally) definitely works better as you add more scale, but I still feel that this particular work is roughly "grand challenge" level. You shouldn't expect to just try this out as your first foray :).

I will note this paragraph from the post:

> RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.

which is mostly about the challenge of longer time horizons (and therefore LSTM related). If your problem is different / has a smaller space, I think this is soon going to be very approachable. That is, we recently demonstrated training ResNet-50 for $7.50.

There certainly exist a set of problems for which RL shouldn't cost you more than the value you get out of it, and for which you can demonstrate enough likelihood of success. RL itself though is still at the bleeding edge of ML research, so I don't consider it unusual that it's unproven.


Depends on the function you're approximating.


Great work! Having access to this scale of computing for so cheap really is amazing


So as someone working in reinforcement learning who has used PPO a fair bit, I find this quite disappointing from an algorithmic perspective.

The resources used for this are almost absurd and my suspicion is, especially considering [0], that this comes down to an incredibly expensive random search in the policy space. Or rather, I would want to see a fair bit of analysis to be shown otherwise.

Especially given all the work in intrinsic motivation, hierarchical learning, subtask learning, etc, the sort of intermediate summary of most of these papers from 2015-2018 is that so many of these newer heuristics are too brittle/difficult to make work, so we resort to slightly-better-than brute force.

https://arxiv.org/abs/1803.07055


(I work at OpenAI on the Dota team.)

Dota is far too complex for random search (and if that weren't true, it would say something about human capability...). See our gameplay reel for an example of some of the combos that our system learns: https://www.youtube.com/watch?v=UZHTNBMAfAA&feature=youtu.be. Our system learns to generalize behaviors in a sophisticated way.

What I personally find most interesting here is that we see qualitatively different behavior from PPO at large scale. Many of the issues people pointed to as fundamental limitations of RL are not truly fundamental, and are just entering the realm of practical with modern hardware.

We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding.


Thank you for taking the time to respond, I appreciate it.

Well I guess my question regarding the expensiveness comes down to wondering about the sample efficiency, i.e. are there not many games that share large similar state trajectories that can be re-used? Are you using any off-policy corrections, e.g. IMPALA style?

Or is that just a source off noise that is too difficult to deal with and/or the state space is so large and diverse that that many samples are really needed? Maybe my intuition is just way off, it just feels like a very very large sample size.

Reminds me slightly of the first version of the non-hierarchical TensorFlow device placement work which needed a fair bit of samples, and a large sample efficiency improvement in the subsequent hierarchical placer. So I recognise there is large value in knowing the limits of a non-hierarchical model now and subsequent models should rapidly improve sample efficiency by doing similar task decomposition?


The best way we know to think of it is in terms of variance of the gradient.

In a hard environment, your gradients will be very noisy — but effectively no more than linear in the duration you are optimizing over, provided that you have a reasonable solution for exploration. As you scale your batch size, you can decrease your variance linearly. So you can use good ol' gradient descent if you can scale up linearly in the hardness of the problem.

This is a handwavy argument admittedly, but seems to match what we are seeing in practice.

Simulators are nice because it is possible to take lots of samples from them — but there's a limit to how many samples can be taken from the real world. In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go. As a community, we are just starting to get fast enough computers to test out ideas there.


Yo, this probably isn't the type of HN comment you're used to, but I just wanted to say thanks for enriching the dota community. I know that's not really why you're doing any of this, but as someone who's deeply involved with the community, people get super hyped about what you guys have been doing.

They also understand all of the nuances, similar to HN. Last year when you guys beat Arteezy, everyone grokked that 5v5 was a completely different and immensely difficult problem in comparison. There's a lot of talent floating around /r/dota2, amidst all the memes and silliness. And for whatever reason, the community loves programming stories, so people really listen and pay attention.

https://imgur.com/Lh29WuC

So yeah, we're all rooting for you. Regardless of how it turns out this year, it's one of the coolest things to happen to the dota 2 scene period! Many of us grew up with the game, so it's wild to see our little mod suddenly be a decisive factor in the battle for worldwide AI dominance.

Also 1v1 me scrub


Agreed! Can't wait to not have to play Dota 2 with humans :p


> Also 1v1 me scrub

I wanted to play SF against the bot so badly - even knowing I'd get absolutely destroyed over and over agin


EDIT (I work at OpenAI and wrote the statement about the variance of the gradient being linear): Here's a more precise statement: the variance is exponential in the "difficulty" of the exploration problem. The harder the exploration, the worse is the gradient. So while it is correct that things become easy if you assume that exploration is easy, the more correct way of interpreting our result is that the combination of self play and our shaped reward made the gradient variance manageable at the scale of the compute that we've use.


> In order to decrease the number of samples needed from the environment, we expect that ideas related to model-based RL — where you spend a huge number of neural network flops to learn a model of the environment — will be the way to go.

Will those models be introspectible / transferrable? One thing I'm curious about is how AI's learn about novel actions / scenarios which are "fatal" in the real world? Humans generally spend a lot of time being taught these things (rather than finding out for themselves obviously) and eventually come up with a fairly good set of rules about how not to die in stupid ways.


Transferability depends on the way the models is set up, and moves on a scale.

Introspectable: given that you can ask unlimited "What if" questions models, we should be able to get a lot of insights into how the models work internally. And you can often design them to be introspectable as some performance or complexity cost. (if that's what you meant by introspectable).


Can you clarify why variance only scales linearly in the duration you are optimizing over? I would have expected it to be exponential, since the size of the space you are searching is exponential in the duration.


Re variance, the argument is not entirely bullet proof, but it goes like this: we know that the variance of the gradient of ES grows linearly with the dimensionality of the action space. Therefore, the variance of the policy gradient (before backprop through the neural net) should similarly be linear in the dimensionality of the combined action space, which is linear in the time horizon. And since backprop through a well-scaled neural net doesn't change the gradient norm too much, the absolute gradient variance of the policy gradient should be linear in time horizon also.

This argument is likely accurate in the case where exploration is adequately addressed (for example, with a well chosen reward function, self play, or some kind of an exploration bonus). However, if exploration is truly hard, then it may be possible for the variance of the gradient to be huge relative to the norm of the gradient (which would be exponentially small), even though the absolute variance of the gradient is still linear in the time horizon.



That makes sense, thanks for clarifying!


> Dota is far too complex for random search

Why? We know that random search is smart enough to find a solution if given arbitrarily large computation. So, that random search is not smart enough for Dota with the computational budget you used, is not obvious. Maybe random search would work with 2x your resources? Maybe something slightly smarter than random search (simulated annealing) would work with 2x your resources?

> and if that weren't true, it would say something about human capability

No it would not. A human learning a game by playing a few thousand games is a very different problem than a bot using random search over billions of games. The policy space remains large, and the human is not doing a dumb search, because the human does not have billions of games to work with.

> See our gameplay reel for an example of some of the combos that our system learns

> Our system learns to generalize behaviors in a sophisticated way.

You're underestimating random search. It's ironic, because you guys did the ES paper.


> If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it.

Are there that many domains for which this is relevant?

Game AI seems to be the most obvious case and, on a tangent, I did find it kind of interesting that DeepMind was founded to make AI plug and play for commercial games.

But unless Sim-to-Real can be made to work it seems pretty narrow. So it sort of seems like exchanging one research problem (sample-efficient RL) for another.

Not to say these results aren't cool and interesting, but I'm not sold on the idea that this is really practical yet.


Simulation to real learning seems to be slowly and steadily improving? Eg as seen in https://ai.googleblog.com/2017/10/closing-simulation-to-real...

Transfer learning, which seems more widely researched, has also been making progress at least in the visual domain.


There seems to be a bunch of work in this area, but I have no idea how you measure progress in this area, it's not like you can do evaluations on a shared task.

And it's clearly not solved yet either - 76% grab success doesn't really seem good enough to actually use, and that with 100k real runs.

I don't really know how to compare the difficulty of sim-to-real transfer research to sample efficient RL research, and it's good to have both research directions as viable, but neither seems solved, so I'm not really convinced that "just scaling up PPO" is that practical.

I'm hoping gdb will be able to tell me I'm missing something though.


>> Our system learns to generalize behaviors in a sophisticated way.

Could you elaborate? One of the criticisms of RL and statistical machine learning in general is that models generalise extremely poorly, unless provided with unrealistic amounts of training data.


Why Dota and not something like adverse-weather helicopter flying which is more "useful"?


If I had to guess I would say that Dota is a very complex environment that could be akin to real-world complexity that is simulatable to the point that simulation and the real game work identical. The real world isn't nearly as clean, however, as we get better and better at these "toy" examples we likely could learn more efficiently on the real world problems.


I think the "simple random search" algorithm in the paper you linked is not so simple -- it's basically using numerical gradient descent with a few bells and whistles invented by the reinforcement learning community in the past few decades. So perhaps it would be more fair to say that gradient descent (not random search) has proven to be a pretty solid foundation for model-free reinforcement learning.


Yes, I am aware, I did not mean random search as in random actions, but random search with improved heuristics to find a policy.

The point being that that the bells and whistles of PPO and other relatively complaticated algorithms (e.g. Q-PROP), namely the specific clipped objective, subsampling, and a (in my experience) very difficult to tune baseline using the same objective, do not significantly improve over gradient descent.

And I think Ben Recht's arguments [0] expands on that a bit in terms of what we are actually doing with policy gradient (not using a likelihood ratio model like in PPO) but still conceptually similar enough for the argument to hold.

So I think it comes down to two questions: How much do 'modern' policy gradient models improve on REINFORCE, and how much better is REINFORCE really than random search? The answer thus far seemed to be: not that much better, and I am trying to get a sense of if this was a wrong intuition.

[0] http://www.argmin.net/2018/02/20/reinforce/


When optimizing high-dimensional policies, the gap in sample complexity between PPO (and policy gradient methods in general) and ES / random search is pretty big. If you compare the Atari results from the PPO and ES papers from OpenAI, PPO after 25M frames is better than ES after 1B frames. In these two papers, the policy parametrization is roughly the same, except that ES uses virtual batchnorm. For DOTA, with a much bigger policy, I'd expect the gap between ES and PPO to be much bigger than for Atari.

My takeaway from [0] and Rajeswaran's earlier paper is that one can solve the MuJoCo tasks with linear policies after appropriate preprocessing, so we shouldn't take them too seriously. That paper doesn't do an apples-to-apples comparison between ES and PG methods on sample complexity.

All of that said, there's not enough careful analysis comparing different policy optimization methods.

(Disclaimer: I am an author of PPO)


This article (like pretty much all from OpenAI) is really well done. I love the format and supporting material - makes it waay more digestible and fun to read in comparison to something from arxiv. The video breakdowns really drive the results home.


To be fair, there is very little technical content... I don't think they could repackage this content into an arxiv-style paper if they wanted to.


Good point - but I think that the difference is valuable. If that is the average person's first touch point with the content, then it would do a better job of making it accessible than a technical paper. Agreed that a follow-up detailed post or paper would be awesome!


Far too many hyperlinks though. Who clicks on hyperlinks for words like "defeat", "complex", "train" and "move"? Seems like if I link them then they'll link me and we'll all get higher ranking search results. Maybe I'm the only one who gets annoyed by this.


It is essentially the same frequency of links that you'd see on any Wikipedia article. In a field where there is an enormous amount of jargon, it is probably a good thing that they clearly define as much as possible.


I agree. We have to deal with both the standard ML jargon, and lots of Dota terms in this article. How am I supposed to know what "Creep blocking" is?


It's blocking creeps, obviously!

Which, in turn, requires you to understand the concept of what a creep is, and how blocking them contributes to creep equilibrium (and what creep equilibrium is) and how the various states of equilibrium contribute to gameplay, and how/why/when you want to manipulate that (for example, you want to block creeps at some early points in the game so your opponent has to attack uphill, but between those particular points in time you want to push your creeps in deeper to ensure you have time to complete other objectives). :)

Obviously, you don't need to know anything above, but once you start diving into the depth of things OpenAI (and human players) deal with every game, it gets pretty insane that a bot can learn at such a high level so quickly.


Agreed. If the choice is between fewer hyperlinks with a higher bar of entry with jargon and a ton of hyperlinks that let a reader dig deeper where necessary, I'd choose the latter 100/100 times.


This is a really interesting writeup, specially if you know a bit more about how Dota works.

That it managed to learn creep blocking from scratch was really surprising for me. To creep block you need to go out of your way to stand in front of the creeps and consciously keep doing so until they reach their destination. Creep blocking just a bit is almost imperceptible and you need to do it all the way to get a big reward out of it.

I also wonder if their reward function directly rewarded good lane equilibrium or if that came indirectly from the other reward functions


They link to a short description of the reward function in the blog: https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae939...


It's not really "from scratch". The bots are rewarded for the number of creeps they block, so it's not impossible that they would find some behavior to influence this score.


That was true for their original 1v1 bot, but in the latest blog post they mention bots can learn it on their own if left to train longer.


That's not rigorously supported. It's just an anecdote they mention off-hand. The final version of the bot does use the creep block reward.


To be clear:

- The 1v1 bot played at The International used a special creep block reward (and a big if statement separating that part of the agent from the self-play trained part). It trained for two weeks.

- A 2v2 bot discovered creep blocking on its own, no special reward. It trained for four weeks.

- OpenAI Five does not have a creep blocking reward, but neither (to our knowledge) does it creep block currently. Trained for 19 days!


I see. Thanks! So it manages to win lanes without even creep blocking? That's quite good. Any chance you could share the last hits @ 10 mins for the games it has played (for both bots and humans)? I think that's a crucial number to judge how OpenAI Five is winning its games.


I believe the article said that Blitz rated the bot last-hitting at about average for humans, although he might over-rate what an average human player last hits like.


Yeah, he might be overestimating 2.5k mmr players, and there's also something to be said about the consistency by which the bot last hits. A human player would have a high variance of last-hit performance, while the bot will probably guarantee a minimum amount, thus ensuring a minimum set of items needed for the mid-game transition.

But my larger point is, the early game doesn't have a lot of strategic elements in it. You have to last hit, not die, harass opponent, get items. You can play it by the book pretty much. The challenge in early game is to be able to handle 5 different things at the same time. So there's never really a question of what to do, but doing it does require mechanical prowess, which we know bots can easily be better at, than humans.

The team composition chosen is very early game snowball oriented. So is the bot winning simply due to mechanical superiority and early game advantage? Access to last hits @ 10 mins, gold and net worth graphs would allow us to answer that question.


They are using preemptible CPUs/GPUs on Google Compute Engine for model training? Interesting. The big pro of that is cost efficiency, which isn't something I expected OpenAI to be optimizing. :P

How does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (https://news.ycombinator.com/item?id=14728476) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)


(I work at OpenAI on the Dota team.)

Cost efficiency is always important, regardless of your total resources.

The preemptibles are just used for the rollouts — i.e. to run copies of the model and the game. The training and parameter storage is not done with preemptibles.


If these (or other similar) experiments would show viability of this network architecture, the cost could be decreased a lot with development of even more specialized hardware.

Also one could look at the cost of the custom development of bots and AIs using other more specialized techniques: sure, it might require more processing power to train this network, but it will not require as much specialized human interaction to adapt this network to a different task. In which case, the human labor cost is decreased significantly, even if initial processing costs are higher. So in a way you guys do actually optimize cost efficiency.


You should probably introduce yourself as openai co-founder :)


Disclosure: I work on Google Cloud (and with OpenAI), though I'm not a PM :).

As gdb said below, the GPUs doing the training aren't preemptible. Just the workers running the game (which don't need GPUs).

I'm surprised you felt cost isn't interesting. While OpenAI has lots of cash, that doesn't mean they shouldn't do 3-5x more computing for the same budget. The 256 "optimizers" cost less than $400/hr, while if you were using regular cores the 128k workers would be over $6k/hr. So using preemptible is just the responsible choice :).

There's lots of low hanging fruit in any of these setups, and OpenAI is executing towards a deadline, so they need to be optimizing for their human time. That said, I did just encourage the team to consider checkpointing the DOTA state on preemption though, to try to eke out even more utilization. Similarly, being tighter on the custom shapes is another 5-10% "easily".

Don't forget, they're hiring!


>OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.

A bit disappointing, it would be very cool to see what kind of communication they'd develop.


Would be interesting to see if when one agent declines to help another several times, the other one would decide against helping him when he calls. The logical explanation would then be that the agent would come to value his life more than his comrade's (because he is helping, and his comrade has refused several times). The human explanation would be that he refuses to help out of spite. It could even lead to those two agents "hating" the other, though it would be more like cold calculation.


yeah, putting a LSTM hidden state into communication channel seems to be the most expressive way to build up team work.


How would you build that?


I wanted to add the observation that all the restricted heroes are ranged. Necrophos, Sniper, Viper, Crystal Maiden, and Lich.

Since playing a lane as a ranged hero is very different from playing the same lane as a melee hero, I wonder whether the AI has learned to play melee heroes yet.


Not only are they ranged, but this lineup is very snowball-oriented, i.e. the optimal play style with this kind of lineup is to gain a small advantage in the early game and then keep pushing towers together aggressively. The middle-to-late game doesn't really matter. Whoever wins the early game wins the game. And we do know that bots are going to be good at early game last hitting.


The article states the bots are actually rather mediocre at last hitting.


I've played DotA for over 10 years so this development is quite relevant to me. So excited to see this next month!

Although it's extremely impressive, all the restrictions will definitely make this less appealing to the audience (shown in the Reddit thread comments).


Thanks! The restrictions are a WIP, and will be significantly lifted even by our July match.


> Partially-observed state. Units and buildings can only see the area around them. The rest of the map is covered in a fog...

Actually, this is true on multiple levels. There is fog of war, but then there is the fact that a human player can only look at a given window of the game at a time, and has to pan the window to see the area away from their character. (The mini-map shows some level of detail for the rest of the map, but isn't high resolution and doesn't show everything that might be of interest.) Also, you can only issue orders on what is directly visible to you, so if you pan away from your character that restricts what you can do.

Is OpenAI Five modeling this aspect of the game? Otherwise it's still "cheating" in some sense vs how a human would be forced to play.


I'm pretty sure they are not. From https://blog.openai.com/openai-five#differencesversushumans:

>OpenAI Five is given access to the same information as humans, but instantly sees data like positions, healths, and item inventories that humans have to check manually. Our method isn’t fundamentally tied to observing state, but just rendering pixels from the game would require thousands of GPUs.


While this is a cool result, I wonder if the focus on games rather than real-world tasks is a mistake. It was a sign of past AI hype cycles when researchers focused their attention on artificial worlds - SHRLDU in 1970, Deep Blue for chess in the late 1990s. We may look back in retrospect and say that the attention Deepmind got for winning Go signaled a similar peak. The problem is that it's too hard to measure progress when your results don't have economic importance. It's more clear that the progress in image processing was important because it resulted in self-driving cars.


Firstly, research into Chess AI has had a surprising amount of beneficial spin-off, even if we don't call the result "AI".

Secondly, while it's still a simplification and abstraction, DotA's ruleset is orders-of-magnitude more similar to operating in the real world than Chess's is.

Thirdly, I'd argue that the adversarial nature of games makes it _easier_ to track progress, and to ensure that measure of progress is honest.

There's a lot of ways you can define "progress" in self-driving cars. Passengers killed per year in self-driving vs. human-driven cars? Passengers killed per passenger-mile? Average travel time per passenger-mile in a city? etc.

With games, you either win, or you don't.


Another benefit of showing off progress with games is it allows the everyday reader to follow and understand it as well. It works great as a public awareness standpoint, especially when an AI can beat a human (i.e. Gary Kasparov vs Deep Blue). Awareness is a good thing in the space.


Will the agent controls all 5 players or will each agent control a single player?

One of the hard challenge of DOTA is whether or not to "trust" your teammate to do the right action. I.e. One can aggressively go for a kill knowing that their support will back them.. but one can also aggressively go for a kill while their support let them die, and then the whole team starts blaming and tilting because the dps "threw". It's a fine balance.. From personal experience, it seems like in lower leagues it's better to always assume that you're by yourself, whereas in higher leagues you can start expecting more team plays.

Another example is often many players will use their ultimate ability at the same time and "wasting" it. It would be easy for an agent controlling all 5 players to avoid this.. but how would a individual agent knows whether or not to use their ult? Are the agents able to communicate between each others? If so, is there a cap to "how fast it does it?". I.e. on voice, it takes a few seconds to give orders.


Seems it's five individual agents with no communication, just a reward function that shifts towards team-based rewards:

"OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training."

So pretty much like pubs.


It would be pretty interesting to see one or two of the bots playing with humans on their team.


The Dota AI system has 3 'levels' to it - team level, mode level and action level. The mode/action level can choose to ignore or respect the team level as it sees fit. [1] Additionally, they say under the "Coordination" section:

OpenAI Five does not contain an explicit communication channel between the heroes’ neural networks. Teamwork is controlled by a hyperparameter we dubbed “team spirit”. Team spirit ranges from 0 to 1, putting a weight on how much each of OpenAI Five’s heroes should care about its individual reward function versus the average of the team’s reward functions. We anneal its value from 0 to 1 over training.

To me, that reads as 5 individual agents, one for each character.

[1]: https://developer.valvesoftware.com/wiki/Dota_Bot_Scripting#...


They can't mess up the ults because it's very improbable that they'll cast them on the same tick.

They can't mess up the help in the teamfights because they see the intentions of each other by the way the heroes move.

That's why Blitz is saying that the bots are perfect in a teamfight.


I think this is quite impressive. I'm a bit confused about the section saying that "binary rewards can give good performance". Is it saying that binary rewards (instead of continuous rewards) work fine, but end-of-rollout rewards (instead of intermediate rewards such as kills) work poorly?


Binary rewards (win/loss score at the end of the roll out) scored a "good" 70.

With sparse reward (kills, health, etc), scored a better 80 and learned much faster.

Normally, "reward engineering" uses human knowledge to give more continuous, richer rewards. This was not used here.


Perhaps we are looking at a different graph, but in the one I am looking at, blue is "sparse" (plateaus at 70) and orange is "dense" (very quickly hits 80). I believe "dense" means they are doing reward engineering.


The "sparse blue graph" is just the binary win loss outcome - learns ok-ish but slow

The "dense orange graph" - uses more dense rewards - kills, health - and learns better. I referred to this as a "sparse reward" - since it is still a fairly lean and sparse function.

But this is just my opinion. Also note this is for the older 1v1 agent.

The current reward function is even more detailed, and they blend and anneal the 5 agent score, so i dunno...

https://gist.github.com/dfarhi/66ec9d760ae0c49a5c492c9fae939...


I want to see this datapoint on their AI and Compute chart: https://blog.openai.com/ai-and-compute/


>Each of OpenAI Five’s networks contain a single-layer, 1024-unit LSTM that sees the current game state (extracted from Valve’s Bot API)

This will likely dramatically simplify the problem vs. what the DeepMind/Blizzard framework does for StarCraft II, which provides a game state representation closer to what a human player would actually see. I would guess that the action API is also much more "bot-friendly" in this case, i.e., it does not need to do low-level actions such as boxing to select.


Definitely reduces the problem excessively, even inside the game itself they have a big list of restrictions for items and heroes.

It makes sense to solve this easier problem first as there will be more headlines faster


The problem they're trying to solve is also not how to recognise actions from pixels, it's how to outstrategise and outexecute players at the game. Conceptual rather than mechanical advantage.


Wow, very excited about this. I don't know too much about RL, but for me the "170,000 possible actions per hero" seems far too large an output space to be feasible. What happens if the bot wants to do an invalid action? Nothing, or some penalty for selecting something invalid?


OpenAI is cover up research AI for the CIA. The main goal will be to kill innocent folks with this type of AI research. These folks are working for CIA without noticing the involvement of The Spy Agency. They are ostensibly private institutions and businesses which are in fact financed and controlled by the CIA. From behind their commercial and sometimes non-profit covers, the agency is able to carry out a multitude of clandestine activities—usually covert-action operations. Many of the firms are legally incorporated in Delaware because of that state's lenient regulation of corporations, but the CIA has not hesitated to use other states when it found them more convenient. The NSA/CIA's best-known proprietaries are Amazon, facebook, Microsoft, Palantir, OpenAI (cover up research AI via non-profit) and Google.... Good luck with working inside a military research without decoding the source of funding.


Are those 180 years of games "seeded" by real games, or was it entirely self play?

Also, how does this system cope with gameplay changes that arise when the game is patched? It's new news to any experienced Dota player that even small changes can have major impact on the meadow gam that even small changes can have major impact on winning strategy. Would it need to be re-trained every patch?


> Are those 180 years of games "seeded" by real games, or was it entirely self play?

The writeup implies that it's entirely self-play.

> Also, how does this system cope with gameplay changes that arise when the game is patched?

From the sound of it, they don't. Since it's a policy gradient method, which learns only from the last set of samples, hypothetically, they could simply swap out the DoTA binary on the fly in parallel and let it automatically update itself by continued training. (The difference between optimal pre/post-patch is a lot smaller than the difference between a random policy and an optimal policy...)


Makes sense on both counts.

The fact that it's not seeded at all is very interesting. A lot of Dota expertise derives from knowing what the opponent is going to do at a particular time. I remember many comments from experienced Go players that AlphaGo made moves that no human player would make, so I wonder if that will appear in this case as well.


They do discuss some current differences in playstyle toward the bottom, like faster openings and more use of support heroes, which the self-play has invented (along with rediscovering standard tactics). So it's at least a little different.

Whether these are better is hard to say. It's not superhuman, after all, unlike AlphaGo, so it's not presumptively right, and you can't doublecheck by doing a very deep tree evaluation (because DoTA doesn't lend itself to tree exploration - far too many actions and long-range).


What are the 170,000 discrete actions?

Rough guesses for available actions:

  32 (directions for movement)
+ 10 (spell/item activations)

  * 20 (potential targets. heroes + near by creeps)
+ 15 (attack commands. 5 enemy heroes and ~10 near by creeps)

Which still leaves... approximately 170,000 actions unaccounted for


You can attempt to attack / move to / cast many spells on arbitrary pixels on the map. The bots are shown casting spells on targets that aren't visible in the demo. The amount of available targets probably blows up the count.


I figured this much as well, and I think this also begins to explain why some of the restrictions exist, and how difficult it would be to generalize this to the entirety of the Dota action space. I'm assuming they were pretty smart at defining and limiting the possible action space to get down to 170K. For example, restricting the hero pool down to 5 heroes which only have a reasonably small number of options in a reasonably small radius around them (I think Sniper's Q ability might lead to the highest number of discretized actions among their chosen hero pool), banning Boots of Travel (though I suppose this shouldn't add too many actions since you have to TP to a friendly unit of which there are not that many, so maybe this doesn't pose a problem with respect to the action space size, but it does have strategic implications), etc.

For a hero like invoker who can cast Sunstrike anywhere on the map at any time, would you try to come up with domain heuristics (only consider locations in the map near enemy heroes), or deal with an explosion of possible actions (and this applies to a ton of different hero mechanics that are not in scope here)?


If the goal of the project is generalization, you likely want to shy away from opinionated heuristics like the former you mention.

In the development for the Magic the Gathering AI (Duels), one of the restrictions is "don't cast harmful spells on your targets" even though for some edge cases this is actually the optimal thing to do. They traded a smaller search space at the expense of optimality.


I see. It seems like training a model for a hero with only targeted spells would be much faster than training a model for a hero which can cast spells at arbitrary map locations. I don't play DoTA, so not sure how many such heroes exist, or if a team comp of only heroes with targeted abilities would even be viable


Bot Crystal Maiden was killing allright by casting non-targeted spells into the fog (where the enemy ran to hide).

Sniper and Viper have non-targeted abilities that they were using to zone the enemy in a teamfight.


It sounds like 170,000 is every possible combination of actions that might ever be valid. They stated that usually around 1000 are valid at any point in time.

Based on the examples under the "Model structure" section, I'm guessing they are counting all combinations of spell and target location, including locations on the ground for ground-targetable spells? That could add up quick... e.g. 10 spells * 20 target units * 9x9 grid of locations around each = around 16,000 possibilities.


Well, can one take more than one action per frame? Then it's quite multiplicative.


Any thoughts from the Dota team on how drafting heroes will work by the time we get to TI? Am also curious if you've seen more experimental drafts in early results that aren't as popular in the pro scene.


The OpenAI bots are still very limited. Current set of restrictions:

- Mirror match of Necrophos, Sniper, Viper, Crystal Maiden, and Lich

- No warding

- No Roshan

- No invisibility (consumables and relevant items)

- No summons/illusions

- No Divine Rapier, Bottle, Quelling Blade, Boots of Travel, Tome of Knowledge, Infused Raindrop

- 5 invulnerable couriers, no exploiting them by scouting or tanking

- No Scan


Any thoughts from the DOTA team on handling a world map which not bounded in size ?

In my projects, the "world" size can change (unlike Go, Chess where the board size is fixed).

Is the DoTA board size fixed?

I guess the LTSM encodes the board history as seen by the agent. But this probably slows the learning.

Some people suggested auto-encoder to compress the world, and then feed it to a regular CNN.

Any comments would be welcome.


Played DotA. The map is fixed size.


I'm a Legend dota2 player and also a Machine Learning researcher and I'm fascinated by this result. The main message I take away is, we might already have powerful enough methods (in terms of learning capabilities), and we're limited by hardware (this also makes me a little sad). My thoughts,

1) "At the beginning of each training game, we randomly "assign" each hero to some subset of lanes and penalize it for straying from those lanes until a randomly-chosen time in the game...." Combining this with "team spirit" (weighted combined reward - networth, k/d/a). They were able to learn early game movement for position 4 (farming priority position). For roaming position, identifying which lane to start out with, what timing should I leave the lane to have the biggest impact, how should I gank other lanes are very difficult. I'm very surprised that very complex reasoning can be learned from this simple setup.

2) Sacrificing safe-lane to control enemy's jungle requires overcoming local minimum (considering the rewards), and successfully assign credits over a very very long horizon. I'm very surprised they were able to achieve this with PPO + LSTM. However, one asterik here is if we look at the draft, Sniper, Lich, CM, Viper, Necro. This draft is very versatile with Viper and Necro can play any lane. This draft is also very strong in laning phase and mid game. Whoever win sniper's lane and win laning phase in general is probably going to win. So this makes it a little bit less of a local optimal. (In contrast to having some safe lane heroes that require a lot of farm).

3) "Deviated from current playstyle in a few areas, such as giving support heroes (which usually do not take priority for resources) lots of early experience and gold." Support heroes are strong early game and doesn't require a lot items to be useful in combat. Especially with this draft, CM with enough exp (or a blink, or good positioning) can solo kill almost any hero. So it's not too surprising if CM takes some farm early game, especially when Viper and Necro are naturally strong and doesn't need too much of farm (they still do, but not as much as sniper). This observation is quite interesting, but maybe not something completely new as it might sound like.

4) "Pushed the transitions from early- to mid-game faster than its opponents. It did this by: (1) setting up successful ganks (when players move around the map to ambush an enemy hero — see animation) when players overextended in their lane, and (2) by grouping up to take towers before the opponents could organize a counterplay." I'm a little bit skeptical of this observation. I think with this draft, whoever wins the laning phase will be able to take next objectives much faster. And winning the laning phase is really 1v1 skill since both Lich and CM are not really roaming heroes. If you just look at their winning games and draw conclusion, it will be biased.

5) This draft is also very low mobility. All 5 heroes Sniper, Lich, CM, Necro, Viper share the weakness of small movement speed (except for maybe Lich). Also, none of these heroes can go at Sniper in mid/late game, so if you have better positioning + reaction time, you'll probably win.

Overall, I think this is a great step and great achievement (with some caveats I noted above). As far as next steps, I would love to see if they can try meta-learned agent where they don't have to train from scratch for a new draft. I would love to see they learn item building, courier usage instead of using scripts. I would also love to see they learn drafting (can be simply phrased as a supervised problem). I'm pretty excited about this project, hopefully they release a white paper with some more details so we can try to replicate.


This feels like Ender's Game without Ender.


Quite a good read! Impressive results, it seems. Still think much more useful to research learning complex things without absurd compute/sample inefficiency/various hacks eg reward shapring (which, lets be honest, this seems to have a lot of), but still interesting results.


What are other killer applications of deep learning rather than CV and gameplaying?


What's the estimated cost of a project like this?


Without considering salaries you can look up the costs for their compute: https://cloud.google.com/compute/pricing 128,000 CPUs and 256 GPUs I think they mention training for 2 months in the video


another commenter near the top (with some experience posted), estimating ~$2500/hour. 60grand a day to use hundreds of thousands of cores to learn to play computer games, roughly 1.8mill for 30 days of active learning. It's cool, does seem a little bit greedy, that is still expensive as buck yo. you need a big ol bank to fund you. dropping 60k/day on compute doesnt fly for many smaller companies if you ask me.


As our understanding of "AI" gets better, it'll cost less and less and will start to be affordable for smaller players; but the initial R&D always cost a lot.


The live 5v5 match at TI should be great to watch.


>OpenAI Five plays 180 years worth of games against itself every day.

Human players do it in a fraction of their much smaller lifespans.


On the other hand, humans couldn't play 180 years even if they wanted to.


pref list item view




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: