AlphaGo Zero: Learning from scratch

aeleos · on Oct 18, 2017

The fact that they only used self play with no outside input here is really interesting. I wonder if this system produced more new styles of play. While I am not that familiar with Go, I know in some of the other articles they talk about things like Chinese starts that are specific to certain cultures. I wonder if the fact that it had no outside reinforcement made it produce movements that we have already seen that are somehow inherent to the game, or if it produced many more new moves that were a result of it learning without any possibility of cultural interference. According to the article it did invent some unconventional and creative moves, but I also wonder how much it rediscovered.

I also wonder how much it’s style of play changes if it were re trained, due to the random start that it is given. Maybe that would produce something like seeds for procedurally generated worlds in games. Like if they could find a seed for a Chinese or japanese players, or ones that more aggressive styles. This is some pretty cool work and may open up even more doors for pure reinforcement learning

kndyry · on Oct 18, 2017

I don't think it's an overstatement to say that, since playing Lee Sedol in 2016, AlphaGo has completely revolutionized professional and amateur go. It's certainly not unprecedented — the last major revolution happened in the early 20th century (often called the 'Shin Fuseki' era [0]) — but AlphaGo has demonstrably surpassed any previous high-water mark.

> I wonder if this system produced more new styles of play.

Absolutely. One such innovation has been the use of early 3-3 invasions [1]. There are many more, and indeed AlphaGo's games are still being analyzed by professional players. Michael Redmond, a 9-dan professional, has been working with the American Go Association on one such series [2].

> I wonder if the fact that it had no outside reinforcement made it produce movements that we have already seen that are somehow inherent to the game...

Interestingly, yes. Strong players have commented that AlphaGo seems to agree with things that players like Go Seigen [3] have suggested in the past, but that were never fully developed or understood [4].

Very, very interesting work indeed.

[0] https://senseis.xmp.net/?ShinFuseki

[1] https://www.eurogofed.org/index.html?id=127

[2] http://www.usgo.org/news/category/go-news/computer-goai/mast...

[3] https://senseis.xmp.net/?GoSeigen

[4] https://lifein19x19.com/forum/viewtopic.php?f=13&t=14129

alex_stoddard · on Oct 18, 2017

I have only skimmed the paper but one thing I don't see any discussion of is whether komi (the handicap given to white for going second) is correct.

They do say the rules used for all games, including self-play, set komi consistently to 7.5 .

If the strongest AI was consistently winning predominantly with one color it would be an indication that komi isn't fair for the best play.

Of the 20 games released for the strongest play it appears white won 14 times and black 6. I don't think that is enough to be conclusive but maybe komi is too high.

I wonder if different "correct" play at the strongest levels would be learned with a 6.5 komi.

gizmo686 · on Oct 18, 2017

You can only change komi by full point increments. There is a .5 to break ties, but a komi of 7.5 is identical to one of 7.4.

From a theoretical standpoint, any non-integer komi should lead to one player winning 100% of the time. So even if the actual win ratio is 14:6 at komi=7.5 that might still be the best value.

btilly · on Oct 18, 2017

If you had an estimate of the real difference, you could switch to breaking ties randomly. Black wins 60% of the ties, white wins 40%. There will be a ratio at which each side should win 50% of the time.

I agree that with perfect play, it will be a 50% of a tie to each side. But it is still interesting to ask for a better estimate of practical play.

xelxebar · on Oct 19, 2017

Michael Redmond mentions this in the AlphaGo vs AlphaGo review series he's doing with the AGA. AlphaGo selfplay games are with 7.5 komi under Chinese rules, and apparently, Deepmind has stated that black vs white wins is almost exactly 50/50. IIRC Redmond mentioned that white (?) only had some sub 1% advantage in the entire self-play corpus.

Jach · on Oct 18, 2017

I don't remember where I read it but in some earlier versions of AlphaGo they tried a komi of 6.5 and black ended up winning more often. That indicates the correct komi value is 7, but since Go doesn't have ties, you have to pick which side you want to favor to break the tie. (White seems reasonable.)

conistonwater · on Oct 18, 2017

That doesn't sound right because in Chinese rules, which is what AG uses, komi only changes in steps of two. Are you sure about 6.5? Could it be 5.5?

Jach · on Oct 18, 2017

Well in Japanese rules the komi is 6.5 so that's the alternative that tends to come up. Some quick searching I found a transcript from one of the games where DeepMind said 7.5 slightly favors white but they didn't say anything about 6.5 or 5.5, while a random comment from r/baduk claims that pro game analysis shows 6.5 slightly favors black and 7.5 slightly favors white.

wetpaws · on Oct 20, 2017

I guess they should then randomize komi so any player would have 50/50 chance of getting a small edge.

fhe · on Oct 19, 2017

the correct komi number has puzzled Go players for centuries, now we might finally have a chance to figure out the right answer (although not without some reservations). over the last 5 decades, komi has consistently been raised to keep the game more leveled between white and black (black makes the first move, so has the advantage). historically, there was no komi, and people kept an even game by always playing even number of games with each player switching sides after each game.

for whatever reason, it's no longer feasible in modern pro game (not to mention that this could result in no winner if each player wins half the game), so komi was introduced. at first at 5.5, and steadily climbed higher to 7.5 at present. In pro game, even a change of 1 is considered a big deal, so from 5.5 to 7.5 is hardly trivial.

Now with alphago playing "perfect" games against itself, we might finally be able to put to rest the debate of the correct komi (the Japanese Go associations for decades have kept meticulous records of every professional game, in order to find the correct komi).

There is a big "but" though. The correct komi at Alphgo Zero's level might not be the correct komi for human level players (AlphaGo is estimated to be 2-3 handicaps above human play; this is a bigger gap between the average pro player and the best amateurs).

Indeed, the change from 5.5 komi to 7.5 komi also had a lot to do with the change in play style rather than simply zooming in on the "correct" komi number. In the 70s and 80s, predominant play style was more conservative, and 5.5 might well be the correct komi for the time (defined as resulting in 50:50 chance of winning for either side). As play style shifted to become more aggressive and confrontational (actually fueld somewhat by the introduction of komi), it was discovered that komi needs to be raised to keep chances of winning at 50:50.

To make an analogy, suppose one is playing a casino game of chance that gives the house a slight advantage (similar to the first mover advantage for black in go). If one only makes small bets, the house will end up winning only a small amount. in other words, the player needs to be compensated by a small amount to make the game "fair".

If however one makes big bets (i.e. more aggressive game play), then the compensation needs to be bigger too, to make the game "fair", even if the underlying probabilities have not changed.

following this logic, while 7.5 komi is fair for Alphago vs. alphago games, it might not be the right number for human games. I suspect it might be samller for humans.... if only we could calibrate Alphago to the average human level and generate millions of self-play games...

maxerickson · on Oct 18, 2017

They could just test it right?

Or am I misunderstanding the hardware requirements?

Spare_account · on Oct 20, 2017

With respect to your very interesting comment (I genuinely appreciate your input), you appear to have mis-understood the comment you were replying to.

You've commented on the differences in the style of play that AlphaGo introduced, but the post you were replying to (by aeleos) was going a step further and hypothesising about the potential for a newer, completely 'non-human' style that AlphaGo Zero may have created.

Your comments definitely contribute to the discussion but it was bugging me that there appeared to be a tangent forming about AlphaGo that was overlooking AlphaGo Zero which would be the more interesting area to explore.

posterboy · on Oct 20, 2017

"more new styles of play" seems to indicate that non-human play.

Spare_account · on Oct 21, 2017

Yes, aeleos was interested in that, and so am I, and it seems to be what this entire thread _should be about_. kndyry steered back towards AlphaGo. I'm not sure this merits any further disection.

shmageggy · on Oct 18, 2017

The key part from the paper:

> To assess the merits of self-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server data set; this achieved state-of-the-art prediction accuracy compared to pre vious work 12,30–33 (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.

aeleos · on Oct 18, 2017

That is really interesting. Given a neural network that solely exist to play Go, one that is influenced by the human mind is limited compared to the exact same set of neurons that doesn't have that influence.

EDIT: changed a set of neurons to neural network per andbbergers comments

andbberger · on Oct 18, 2017

Please don't refer to it as 'a set of neurons' - it only serves to fuel the (IMO) absolutely ridiculous AI winter fearmongering, and is also just a bad description. Neural nets are linear algebra blackboxes, the connections to biology are tenuous at best.

Sorry to be that guy, but the AI hype is getting out hand. COSYNE this year was packed with papers comparing deep learning to the brain... it drives me nutty. Convnets can be reasonably put into analogy with the visual system.... because they were inspired from it. But that's about it.

To address your actual comment: I would argue that this is not really interesting or surprising (at least to the ML practitioner), it is very well known that neural nets are incredibly sensitive to initialization. Think of it like this: as training progresses, parameters of neural nets move along manifolds in parameter space, but they can get nudged off of the "right" manifold and will never be able to recover.

Sorry for the rant, the AI hype is just getting really out of hand recently.

Machine learning is specifically not magic. Blackboxes are not useful. Convnets work so well because they build the symmetries of natural scenes directly into the model - natural scenes are translation invariant (as well as a couple of other symmetries), anything that models them sure as hell better have those symmetries too, or you're just crippling your model with extra superfluous parameters.

aeleos · on Oct 18, 2017

I changed my comment to neural network since a set of neurons is somewhat wrong, but I don't really agree that there isn't much of a connection between this and biology. While there might not be much of a connection between how they currently work and how our brains work, the whole point of machine learning and neural networks is to improve computers performance on the things we are good at. And while originally it was loosely modeled on it, and might be different know, it doesn't make it so people can't compare it to the brain. It would be wrong to say it is exactly like the brain, but I don't think there is anything wrong with comparing and contrasting the two. If our goal is to improve performance and we are the benchmark, then why shouldn't we compare them.

What I found interesting was mainly that it was us who nudged the parameter space you talked about into the "wrong" one manifold, especially given how old and complicated Go is. The sheer amount of human brain power that has been put into getting good at a game wasn't able to find certain aspects of it, and in 60 hours of training a neural network was able to.

andbberger · on Oct 19, 2017

I'm not saying there is nothing of value to be obtained by investigating connections between ML and the brain. That's how I got into ML in the first place, doing theoretical neuro research.

We absolutely should and do look to the brain for inspiration.

I'm taking issue with the rather ham-fisted series of papers that have come out in recent years aggressively pushing the agenda of connections between ML and neuro that just aren't there.

Are you sure that humans have done more net compute on Go than Deepmind just did? The Go game tree is _enormous_, humans are bias. We don't invent strategies from scratch, we use heuristics handed down to us from the pros (who in turn were handed down the heuristics from their mentors).

To me, it's not so interesting or surprising that the human initialized net performed worse. We just built the same biases and heuristics we have into that net.

Houshalter · on Oct 18, 2017

As far as we know the brain is just a "linear algebra blackbox". It's an uninteresting reduction since linear algebra can describe almost everything. Yes NNs aren't magic, but neither is the brain. Likely they use similar principles. Hinton has a theory about how real neurons might be implementing a variation of backpropagation and there are a number of other theories.

ollin · on Oct 19, 2017

>As far as we know the brain is just a "linear algebra blackbox"...Likely they use similar principles.

I'm not an expert, but my impression is that this is not really a reasonable claim, unless you're only considering very small function-like subsystems of the brain (e.g. visual cortex). Neural nets (of the nonrecurrent sort) are strict feed-forward function approximators, whereas the brain appears to be a big mess of feedback loops that is capable of (sloppily, and with much grumbling) modeling any algorithm you could want, and, importantly, adding small recursions/loops to the architecture as needed rather than a) unrolling them all into nonrecursive operations (like a feedforward net) or b) building them all into one central singly-nested loop (like an RNN).

The brain definitely seems to be using something backprop-like (in that it identifies pathways responsible for negative outcomes and penalizes them). But brains also seem to make efficiency improvements really aggressively (see: muscle memory, chunking, and other markers of proficiency), even in the absence of any external reward signal, which seems like something we don't really have a good analogue for in ANNs.

Houshalter · on Oct 19, 2017

There are some parts of the brain we have no clue about. Episodic memory or our higher level ability to reason. But most of the brain is just low level pattern matching just like what NNs do.

The constraints you mention aren't deal breakers. We can make RNNs without maintaining a global state and fully unrolling the loop. See synthetic gradients for instance. NNs can do unsupervised learning as well, through things like autoencoders.

fjsolwmv · on Oct 20, 2017

A pattern matcher can learn high level reasoning. Reasoning is just a boolean circuit

mannigfaltig · on Oct 18, 2017

> It's an uninteresting reduction since linear algebra can describe almost everything.

The question is whether it can do so efficiently. As far as I know, alternating applications of affine transforms and non-linearities are not so useful for some computations that are known to occur in the brain such as routing, spatio-temporal clustering, frequency filtering, high-dimensional temporal states per neuron etc.

andbberger · on Oct 19, 2017

Hinton changes his opinion about what the brain is doing every 5 years... Hinton is not a neuroscientist...

madez · on Oct 19, 2017

If he changes his opinion, which I understand to be models of the brain in this case, and each iteration improves the model, then that is perfectly fine. It would be bad if someone did not change their view in case of inconsistent evidence.

andbberger · on Oct 19, 2017

For political opinions sure, but if he's changing his opinions so often ...

When you're a big scientific figure, I think that you have some extra responsibility to the public to only say things you're very confident about. Or otherwise very clearly communicate your uncertainty!!

red75prime · on Oct 19, 2017

> Hinton is not a neuroscientist...

It's not like neuroscientists know that either.

fspeech · on Oct 19, 2017

Agreed. If we announce that A* search is superhuman in finding best routes, most technorati would't bat an eye. Technically it is probably accurate to say that the results here show that neural networks can find good heuristics for MCTS search through unsupervised training in the game of Go. According to DeepMind authors:

"These search probabilities usually select much stronger moves than the raw move probabilities of the neural network; MCTS may therefore be viewed as a powerful policy improvement operator. Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner as a sample of the value – may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure ..."

The fact that this reinforcement training is unsupervised from the very beginning is quite exciting and may lead to better heuristics for other kinds of combinatorial optimization problems.

etiam · on Oct 18, 2017

Please don't refer to them as black boxes. The internals are fully observable.

andbberger · on Oct 19, 2017

Fully observable and we still have no idea what the hell it's doing.

Makes neuroscience seem kinda bleak doesn't it?

There has been a lot of great work lately building up a theory of how these things work, but it is very much still in the early stage. Jascha Sohl-Dickstein in particular has been doing some great work on this.

We don't even have answers to the most basic questions.

For instance (pedagogically), how the hell is it possible to train these things at all? They have ridiculously non-convex loss landscapes and we optimize in the dumbest conceivable way, first-order stochastic gradient descent. This should not work. But it does, all too often.

Not a great example because there are easy hand wavy arguments as to why it should work, but as far as proofs go...

The hand wavy argument goes as follows: - we're in like a 10000 dimensional space, for the stationary point we're at to be a true local minima that means each one of those 10000 dimensions goes uphill in either direction. It's overwhelming likely that there's at least one way out - there are many many different ways to set the params of the net for each function. Permutation is a simple example.

We really have no idea how these things work.

Anyone who tells you otherwise is lying to you...

otabdeveloper1 · on Oct 20, 2017

> Fully observable and we still have no idea what the hell it's doing.

Of course we do. It's matching a smooth multi-dimensional curve to sample data.

fragsworth · on Oct 18, 2017

It's a conceptual black box. There's no way for us to understand what each individual neuron is doing.

blt · on Oct 18, 2017

The tools we have developed so far are limited, but that's different from "there's no way". Many academics are working hard right now to better understand deep neural networks.

etiam · on Oct 18, 2017

Indeed?

https://arxiv.org/pdf/1312.6034.pdf

ionforce · on Oct 18, 2017

Is there meaningful information in what one observes?

aeleos · on Oct 18, 2017

Yes, it turns out you can find meaningful information. etiam provided this https://arxiv.org/pdf/1312.6034.pdf The main issue is making sure what you are looking for is actually what the network is doing. You have to correctly interpret and visualize a jumble of numbers, which usually requires a hypothesis about how it worked in the first place. But assuming both go well you can gain meaningful information.

fjsolwmv · on Oct 20, 2017

Can I train an NN to visualize the numbers?

autarch · on Oct 18, 2017

> things like Chinese starts that are specific to certain cultures

While it's true that there are national styles of play, the Chinese opening is not called that because it's really popular among Chinese people. It's called that because a particular Chinese pro helped popularize it, even though it was invented by a Japanese amateur.

See https://en.wikipedia.org/wiki/Chinese_opening for some more info. FWIW, I (a caucasian American) use this opening all the time. It's just a generally good opening if you like a certain style of play.

igravious · on Oct 19, 2017

> talk about things like Chinese starts that are specific to certain cultures

Came here to make this point.

It's Chinese Opening, not Chinese start – similarly recall that you have the French Defense / Italian Defense / Scandinavian Defense among chess opening variations and none of these implies that that opening variation is specific to that culture or nation.

jasonwatkinspdx · on Oct 18, 2017

> I wonder if this system produced more new styles of play.

One thing Alpha go has told us clearly is that it thinks human players over value the margin of victory vs the probability of victory.

edanm · on Oct 18, 2017

I'm not 100% sure I agree. It values probability of victory because that's it's goal. For humans, aiming only for probability of victory might not be as good, because we're much worse at estimating probabilities. So aiming for maintaining a large margin at all times is conceivably the best proxy that we can use in practice.

pmontra · on Oct 18, 2017

Agreed. I know I'm winning by 4 points but I have no idea about my probability of winning. However if I'm winning I know that I should play low risk moves and refrain from starting complicated fights. That increases the probability of winning. IMHO the exact value is out of reach for human beings.

derefr · on Oct 19, 2017

It would be interesting to play human go, assisted by a go computer that doesn't say anything about moves, but rather just spits out, for each player, their current likelihood of victory if all further moves by both players were "what it would do."

That way, each player could know, at all times, (one major factor that goes into) their probability of winning. They'd still have to mentally adjust it for the likelihood of them and their opponent making an error, and how that can be controlled by making intimidating moves, etc. But it could lead to much tighter control on the abstract flow of the game.

It'd almost be like the computer was the general, issuing strategy, picking battles; and the human player the tactician, fighting those battles.

pmontra · on Oct 19, 2017

There is computer go program (maybe Crazy Stone?) that analyzes a game record and annotates it with the winning percentage for every move.

Knowing that the opponent's winning probability changed from 52 to 57 was interesting only because it hints at a mistake. In case of such a large change the program suggests the move it would have played.

I saw an annotated game record and there were no variations: I remember a suggested move that made me wonder "why!?".

Another benefit of seeing the value of the winning probability is an assessment of who's ahead. However that's already possible with the score estimation that programs and go servers provide. Sometimes is crude, sometimes is good, but it's the score, not the winning probability that humans can estimate when playing. The best probability estimate I can make is: if the score is close and the game is still complicated, it's 50-50; if the score is close but the game is almost over, it's 95-5 for who's ahead. If the score is not close, the player with more points will probably win.

fjsolwmv · on Oct 20, 2017

Ha, like watching someone play a game and moaning or cheering at their plays.

roenxi · on Oct 18, 2017

The Go community learned to understand that the margin of victory is meaningless along time ago. The most famous game of Honinbo Dosaku, a famous Go player from the late 1600s, is arguably a game where he gave a handicap to his opponent and lost by one point. Lee Chang-Ho, who was the reigning champion in the late 90s, had a style that consistently tried to win by small margins.

AlphaGo now appears to be better than humans in all aspects of gameplay, and it better at calculating very thin margins of probability that a human cannot. This is not unique to any individual aspect of its gameplay; against humans it can also win by huge margins depending on what mistakes the human makes.

hasenj · on Oct 19, 2017

I think if AlphaGo foresaw it was losing by one point, it would start playing reckless moves, as it did against Lee-Sedol in the only match it lost against him.

fjsolwmv · on Oct 20, 2017

Risky, not reckless.

hasenj · on Oct 24, 2017

It wasn't risky. It was reckless. It played moves that made no sense and were obviously bad moves or pointless moves.

I'm talking specifically about game #4 of the Lee Sedol games.

XR0CSWV3h3kZWg · on Oct 18, 2017

Part of that could be trying to compensate for counting skills that aren't quite at machine levels.

marcosscriven · on Oct 18, 2017

Possibly a dumb q, but is ‘self play’ in any way related to ‘adversarial’ learning? I don’t see it mentioned in the article, but it reminds me of the principle.

aeleos · on Oct 18, 2017

In some ways it is, but the main difference is that adversarial learning (usually) produces a second neural network whose purpose is to exploit weakness is the first. Whereas reinforcement learning does not produce a second neural network to beat the first, it uses what it learned to solely improve the original.

As a side note, the main application I have seen with adversarial learning research is with photo recognition, but I guess you could have an adversarial network exist to help help improve an object recognition network. At that point it would probably become something between adversarial and reinforcement learning. However, with game based reinforcement learning, it doesn't require a second specific network as the adversary, it can easily just be paired against itself.

It isn't a dumb question, they are very similar in some ways. They mainly differ in what exactly the goal of the opponent is. In this case, it is to help improve itself, however in typical adversarial situations it is solely to exploit (become its adversary).

ericb · on Oct 18, 2017

I'm reminded of Eliezer Yudkowski's article "There is no fire Alarm for Artificial General Intelligence." Is this smoke?

https://intelligence.org/2017/10/13/fire-alarm/

Yes, this is not an AGI. But the hockey-stick takeoff from defeats some players, to defeats an undefeated world-champion, to defeats the version of itself that beat the world champion 100% of the time is nuts. If this happens in other domains, like finance, health, paper clip collection, the word singularity is really well chosen--we can't see past this.

Arcsech · on Oct 18, 2017

While this is promising, there's a long way to go between this and the other things you mentioned. Go is very well-defined, has an unequivocal objective scoring system that can be run very quickly, and can be simulated in such a way that the system can go through many, many iterations very quickly.

There's no way to train an AI like this for, say, health: We cannot simulate the human body to the level of detail that's required, and we definitely aren't going to be able to do it at the speed required for a system like this for a very long time. Producing a definitive, objective score for a paper clip collection is very difficult if not impossible.

AlphaGo/DeepMind represents a very strong approach to a certain set of well-defined problems, but most of the problems required for a general AI aren't well-defined.

red75prime · on Oct 18, 2017

> most of the problems required for a general AI aren't well-defined.

Do you care to give an example? Are they more or less well defined than find-the-cat-in-the-picture problem?

> Producing a definitive, objective score for a paper clip collection is very difficult if not impossible.

Erm, producing of objective comparison of relative values of Go board positions is still not possible.

goatlover · on Oct 18, 2017

> Do you care to give an example? Are they more or less well defined than find-the-cat-in-the-picture problem?

You mean like go over and feed the neighbor's cat while they're on vacation?

How about instead, being able to clean any arbitrary building?

Go isn't remotely similar to the real world. It's a board game. A challenging one, sure, and AlphaGo is quite a feat, but it's not exactly translatable to open ended tasks with variable environments and ill-specified rules (maybe the neighbor expects you to know to water the plants and feed the goldfish as well).

roenxi · on Oct 19, 2017

At this point, there is no evidence that the limiting factor in these cases is AI/software.

The limiting factor with the neighbors cat is the robotics of having a robust body and arm attachment. We know that the scope of current AI can:

1) Identify a request to feed a cat

2) Identify the cat, cat food and cat's bowl from camera data

3) Navigate an open space like a house

Being able to clean an arbitrary building is also more the challenge of building the robot than the AI identifying garbage on a floor or how to sweep something.

It is not clear there are hard theoretical limits on an AI any more. There are economic limits based on the cost of a programmer's attention. There are lots of hardware limits (including processor power).

shanusmagnus · on Oct 19, 2017

In my opinion the deepest and most difficult aspect of this example is the notion of 'clean' which will be different across contexts. Abstractions of this kind are not even close to understood in the human semantic system, and in fact are still minimally researched. (I expect much of the progress on this to come from robotics, in fact.)

eutectic · on Oct 19, 2017

I remember seeing a demonstration by a deep learning guy of a commercially available robot cleaning a house under remote control. You are seriously underestimating the difficulty of developing software to solve these problems in an integrated way.

shhbrunette · on Oct 19, 2017

This. It is a lot like the business guy thinking it is trivial to program a 'SaaS business' because he has a high level idea in his mind. Like all things programming the devil is in the detail.

letlambda · on Oct 21, 2017

The hardware is certainly good enough to assist a person with a disability living in a ranch house with typical household tasks. As demonstrated by human in the loop operation.

https://www.youtube.com/watch?v=eQckUlXPRVk

zardo · on Oct 19, 2017

We have have rockets that can go to orbit, and we have submersibles that can visit the ocean floor. That does not mean the rocket-submarine problem is solved, doing both together is not the same problem as doing both separately.

red75prime · on Oct 20, 2017

It also doesn't mean that a rocket-submarine is the way to go.

CJefferson · on Oct 18, 2017

The difference is a go AI can play billions of games and a simple 20 line C program can check, for each game, who won.

For "cat in the picture", every picture must have the cat first identified by a person, so the training set is much smaller, and Google can't throw GPUs at the problem.

red75prime · on Oct 19, 2017

> Google can't throw GPUs at the problem.

The field progresses swiftly. https://arxiv.org/abs/1602.00955

chrchang523 · on Oct 18, 2017

The absolute value of any Go board position is well-defined, and MCTS provides good computationally tractable approximations that get better as the rest of the system improves but already start better than random.

time-of-flight · on Oct 19, 2017

Check the Nature paper (and I think this is one of the biggest take-aways from AlphaGo Zero):

"Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts."

In this new version, MCTS is not even used to evaluate a position! Speaking as a Go player, the ability for the neural network to accurately evaluate a position without "reading" ahead is phenomenal (again, read the Nature paper last page for details).

z0r · on Oct 19, 2017

the absolute value of a go board position is well defined? where?

wetpaws · on Oct 23, 2017

As a human go player I can say that evaluating board position is close to impossible.

You may have a seemingly good position and in two turns it seems that you have lost the game already.

kzrdude · on Oct 18, 2017

> We cannot simulate the human body to the level of detail that's required

A-ha! So we use AGI for this! :-)

empath75 · on Oct 19, 2017

You don't even need to produce an AGI for this kind of intelligence to be frightening.

At some point, a military is going to develop autonomous weapons that are vastly superior to human beings on the battle field, with no risk of losing human lives, and there is going to be a blitzkrieg sort of situation as the relative power of nations shifts dramatically.

If we have two such countries we could have massive drone and cyberwars being fought faster than people even can comprehend what's happening.

Right now most countries insist on maintaining human control over the machinery of death. But that will only last for as long as autonomous death machines don't dominate the battlefield.

It's a fun challenge right now to build a machine that can win in Starcraft, but it's really a hop skip and a jump from there to winning actual wars.

Veedrac · on Oct 19, 2017

Nuclear ICBMs already push us past that boundary. The world can no longer afford to fight a war seriously.

nocoder · on Oct 19, 2017

In that case you just nuke the shit out of everybody or create army if autonomous suicide bomber with nukes, biological and chemical weapons of all kinds. Once all humans are extinct the harmony on earth will be restored and everyone will leave happily ever after.

clarky07 · on Oct 24, 2017

i'm not sure robot soldier is scarier than nukes. generally speaking, if they are just single task robots performing functions in dangerous situations, that seems like an improvement to risking human lives.

panic · on Oct 18, 2017

The core technique of AlphaGo is using tree search as a "policy improvement operator". Tree search doesn't work on most real-world tasks: the "game state" is too complex, there are too many choices, it's hard to predict the full effect of any choice you might make, and there often isn't even a "win" or "lose" state which would let you stop your self-play.

habitue · on Oct 18, 2017

This version explicitly does not use tree search.

panic · on Oct 18, 2017

MCTS means "Monte-Carlo Tree Search". It's the core of the algorithm. The big difference is that it doesn't use rollouts, or random play: it chooses where to expand the tree based only on the neural network.

cjbprime · on Oct 18, 2017

No, 'habitue is correct. This new blog post says that the new software no longer does game readouts and just uses the neural net.

Tarq0n · on Oct 18, 2017

That's not what Monte Carlo Tree search is. The new version is still one neural network + MCTS. There's no way to store enough information to judge the efficiency of every possible move in a neural network, therefore a second algorithm to simulate outcomes is necessary.

Twirrim · on Oct 18, 2017

Read the white paper. MCTS is still involved, right the way through.

ankeshanand · on Oct 18, 2017

The new version does use MCTS, you should read the paper again. :)

dastbe · on Oct 18, 2017

If you read the paper, they do in fact still use monte-Carlo tree search. They just simplify their usage in conjunction with reducing the number of neural networks to 1

AlexCoventry · on Oct 18, 2017

It does, during training.

panic · on Oct 18, 2017

Tree search is also used during play. In the paper, they pit the pure neural net against other versions of the algorithm -- it ends up slightly worse than the version that played Fan Hui, at about 3000 ELO.

AlexCoventry · on Oct 18, 2017

Oh, so it's just not using rollouts to estimate the board position? Thanks for the clarification.

mrec · on Oct 19, 2017

It doesn't use rollouts at all:

> AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions.

ohhhwell · on Oct 18, 2017

Thanks for that link, well worth the read.

This is an interesting question to ask in these "how far away is AGI" discussions:

I was once at a conference where there was a panel full of famous AI luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI was very far off, except for two famous AI luminaries who stayed quiet and let others take the microphone.

I got up in Q&A and said, “Okay, you’ve all told us that progress won’t be all that fast. But let’s be more concrete and specific. I’d like to know what’s the least impressive accomplishment that you are very confident cannot be done in the next two years.”

There was a silence.

Eventually, two people on the panel ventured replies, spoken in a rather more tentative tone than they’d been using to pronounce that AGI was decades out. They named “A robot puts away the dishes from a dishwasher without breaking them”, and Winograd schemas. Specifically, “I feel quite confident that the Winograd schemas—where we recently had a result that was in the 50, 60% range—in the next two years, we will not get 80, 90% on that regardless of the techniques people use.”

asah · on Oct 19, 2017

IBM Watson on Winograd schemas? It beat jeopardy... ?

YeGoblynQueenne · on Oct 19, 2017

I spent an hour of my life that I'll never get back reading Yudkowski's overly-long article and I believe I can summarise it thusly:

"We don't know how AGI will arise; we don't know when; we don't know why; we don't know anything at all about it and we won't know anything about it until it's too late to do anything anyway; We must act now!!"

The question is- if we don't know anything about this unknowable threat, how can we protect ourselves against it? In fact, since we're starting from 0 information, anything we do has equal chances of backfiring and bringing forth AGI as it has of actually preventing it. Yudkowski is calling for random action, without direction and without reason.

Besides, if Yudkowski is none the wiser about AGI than anyone else, then how is he so sure that AGI _will_ happen, as he insists it will?

Yudkowski is fumbling around in the dark like everyone else in AI. Except he (and a few others) has decided that it's a good strategy, under the circumstances, to raise a hell of a racket. "It's dark!" he yells. "Beware of the darkness!". Yeah OK, friend. It's dark- we can all tell. Why don't you pipe down and let us find the damn light?

gliptic · on Oct 19, 2017

So, in your view, starting MIRI, doing fundamental research into AI safety and advocating for it, is not trying to find the damn light?

You exemplify exactly the attitude he's trying to combat. "Oh, nobody knows anything, let's not care about consequences and do whatever."

YeGoblynQueenne · on Oct 19, 2017

Sorry but I don't really see Yudkowski's contributions as "fundamental research into AI safety". More like navel-gazing without any practical implications. At best, listening to him is just a waste of time. At worse, AGI is a real imminent threat and having people like him generating useless noise like he does will make it harder for legitimate concerns to be heard, when the time comes.

gliptic · on Oct 19, 2017

[flagged]

YeGoblynQueenne · on Oct 19, 2017

Yes, I did and it's very bad form to go around asking people if they read the article. Try to remember that different people form different opinions from similar information.

gliptic · on Oct 20, 2017

Well, then you should have noticed what the article was about, which was not to detail a research program about AI safety. Different articles can address different aspects of a problem without being accused of advocating "random action". That's just ridiculous.

Tenoke · on Oct 19, 2017

>The question is- if we don't know anything about this unknowable threat, how can we protect ourselves against it? In fact, since we're starting from 0 information, anything we do has equal chances of backfiring and bringing forth AGI as it has of actually preventing it. Yudkowski is calling for random action, without direction and without reason.

Are you sure you read the essay? That's literally the question he answers.

At any rate, we do have more than '0 information', and if you make an honest effort to think of what to do you can likely come up with better than 'random actions' for helping (as many have).

YeGoblynQueenne · on Oct 19, 2017

>> Are you sure you read the essay? That's literally the question he answers.

My reading of the article is that he keeps calling for action without specifying what that action should be and trying to justify it by saying he can't know what AGI would look like (so he can't really say what we can do to prevent it).

>> if you make an honest effort to think of what to do you can likely come up with better than 'random actions' for helping (as many have).

Sure. If my research gets up one day and starts self-improving at exponential rates I'll make sure to reach for th

geofft · on Oct 18, 2017

... yeah, before reading that link my position was "Wow, that's super neat, but Go is a pretty well-defined game," and after reading it I remembered that my position maybe a year or two ago was "Chess is a well-defined game that's beatable by AI techniques but Go is acknowledged to be much harder and require actual intelligence to play and won't be solved for a long while" and now I'm worried. Thanks for posting that.

goatlover · on Oct 18, 2017

Go is still a well defined game within a limited space that doesn't change, and rules that don't change. It's just harder than Chess, but that doesn't make it similar to tons of real world tasks humans are better at.

geofft · on Oct 19, 2017

That's probably true, but that's very much not what people were saying about Go a couple years ago. There were a lot of people talking about how there isn't a straightforward evaluation function of the quality of a given state of the board, how things need to be planned in advance, how there's much more combinatorial explosion than in chess, etc., to the point where it's a qualitatively different game.

For me, as someone who accepted and believed these claims about Go being qualitatively different, realizing that no, it's not qualitatively different (or that maybe it is, but not in a way that impedes state-of-the-art AI research) is increasing my skepticism in other claims that board games in general are qualitatively different from other tasks that AIs might get good at.

(If you didn't buy into these claims, then I commend you on your reasoning skills, carry on.)

YeGoblynQueenne · on Oct 19, 2017

About those claims- this is from Russel and Norvig, 3d ed. (from 2003, so a way back):

Go is a deterministic game, but the large branching factor makes it challeging. The key issues and early literature in computer Go are summarized by Boozy and Cazenave (2001) and Muller (2002). Up to 1997 there were no competent Go programs. Now the best programs play most of their moves at the master level; the only problem is that over the course of a game they usually make at least one serious blunder that allows a strong opponent to win. Whereas alpha—beta search reigns in most games, many recent Go programs have adopted Monte Carlo methods based on the UCT (upper confidence bounds on trees) scheme (Kocsis and Szepesvari, 2006). The strongest Go program as of 2009 is Golly and Silver's MoGo (Wang and Golly, 2007; Gelly and Silver, 2008). In August 2008, MoGo scored a surprising win against top professional Myungwan Kim, albeit with MoGo receiving a handicap of nine stones (about the equivalent of a queen handicap in chess). Kim estimated MOGO's strength at 2-3 dan, the low end of advanced amateur. For this match, MoGo was run on an 800-processor 15 terailop supercomputer (1000 limes Deep Blue). A few weeks later, MoGo, with only a five-stone handicap, won against a 6-dan professional. In the 9 x 9 form of Go, MoGo is at approximately the 1-dan professional level. Rapid advances are likely as experimentation continues with new forms of Monte Carlo search. The Computer Go Newsletter, published by the Computer Go Association, describes current developments.

There's no word about how Go is qualitatively different to other games, but maybe the referenced sources say something along those lines. Personally, I took a Masters course in AI two years ago, before AlphaGo and I remember one professor saying that the last holdout where humans can still beat computers in board games was GO, but I don't quite remember him saying anything about qualititative difference. Still, I can recall hearing about the idea that Go needs intuition or something like that, except I've no idea where I've heard that. I guess it might come from the popular press.

I guess this will sound a bit like the perenial excuse that "if it works, it's not AI" but my opinion about Go is that humans just weren't that good at it, after all. We may have thought that we have something special that makes us particularly good at Go, better than machines- but AlphaGo[Zero] has shown that, in the end, we just have no idea what it means to be really good at it (which, btw, is a damn good explanation of why it took us so long to make AI to beat us at it).

That, to my mind, is a much bigger and much more useful achievement than making a good AI game player. We can learn something from an insight into what we are capable of.

geofft · on Oct 20, 2017

s/2003/2009/, I think, but the point stands. (Also I think I have the second edition at home and now I want to check what it says about Go.)

> my opinion about Go is that humans just weren't that good at it, after all. We may have thought that we have something special that makes us particularly good at Go, better than machines- but AlphaGo[Zero] has shown that, in the end, we just have no idea what it means to be really good at it (which, btw, is a damn good explanation of why it took us so long to make AI to beat us at it).

I really like that interpretation!

otabdeveloper1 · on Oct 20, 2017

> the last holdout where humans can still beat computers in board games was GO

False, because nobody ever bothered to study modern boardgames rigorously.

Modern boardgames have small decision trees but very difficult evaluation functions. (Exactly opposite from computational games like Go.)

Modern boardgames can probably be solved by pure brute force calculation of all branches of the tree, but nobody knows if things like neural networks are any good for playing them.

YeGoblynQueenne · on Oct 20, 2017

In AI, "board games" generally means classical board games (nim, chess, backgammon, go etc) and "card games" means classical card games (bridge, poker, etc). Russel & Norvig also discuss some less well-known games, like kriegspiel (wargame) if memory serves, but those are all classical at least in the sense that they are, well, quite old.

I've seen some AI research in more modern board games actually. I've read a couple of papers discussing the use of Monte Carlo Tree Search to solve creature combat in Magic: the Gathering and my own degree and Master's dissertation were about M:tG (my Master's was in AI and my degree dissertation was an AI system also).

I don't know that much about modern board games, besides collectible card games, but for CCGs in particular, the game trees are not small. I once calculated the time complexity of traversing a full M:tG game tree as O(b^m * n^m) = 2.272461391808129337799800881135e+5564 (where b the branching factor, m the average number of moves in a game and n the number of possible deck permutations for a 60 card deck taking into account cards included multiple times). And mine was probably a very conservative estimate.

Also, to my knowledge, Neural nets have not been used for magic-playing AI (or any other CCG playing AI). What has been used is MCTS, on its own, without terrible success. The best AI I've seen incorporates some domain knowledge, in the form of card-specific strategies (how to play a given card).

There are some difficulties in using ANNs to make an M:tG AI. Primarily, the fact that a truly competent player should be able to pick up a card it's never seen before and play it correctly (or decide whether to include it in a deck, if the goal is to also address deck-building). For this, the AI player will need to have at least some understanding of M:tG's language (ability text). It is my understanding that other modern games have equal requirements to understand some game context outside of the main rules, which complicates the traditional tactic of generating all possible moves, pruning some and choosing the best.

In any case what I meant to say is that people in AI have indeed considered other games besides the classical ones- but when we talk about "games" in AI we do mean the classics.

otabdeveloper1 · on Oct 20, 2017

> but when we talk about "games" in AI we do mean the classics

Only because of inertia. There's nothing inherently special about "classics". Eventually somebody will branch out once Go and poker are mined out of paper and article opportunity.

Once we do then maybe some new, interesting algorithms will be found.

In principle, every game can be solved by storing all possible game states in a database. Where brute-force storing is impractical due to size concerns, compression tricks have to be used.

E.g., Go is a simple game because at the end, every one of the fixed number of board spaces is either +1, -1 or 0. Add them up and you know if you won. This means that every move is either "correct" or "incorrect"; the problem of classifying multidimensional objects into two classes is a problem that we're pretty good at now, and things like neural networks get the job done.

A slightly more complex game like Agricola has no "correct" and "incorrect" moves because it's not zero-sum; you can make an "incorrect" move and still win as long as your opponent is forced to make a relatively more "incorrect" move.

Not sure how much of a difference that makes, but what's certain is that by (effectively) solving Go we've only scratched the surface. It's not the end of research, only the beginning.

YeGoblynQueenne · on Oct 20, 2017

Sure. Research in game playing AI doesn't end with Go, or any other game. We may see more research in modern board games, now that we're slowly running out of the classics.

I think you're underestimating the amount of work and determination it took to get to where we are today, though (I mean your comment about "inertia"). Classic board games have the advantage of a long history and of being well understood (the uncertainty about optimal strategies in Go notwithstanding). Additionally, for at least some of them like chess, there are rich databases of entire games that can be used outright, without the AI player having to generate-and-test them in the process of training or playing.

The same is not true for modern games. On the one hand, modern board games like Agricola (or, dunno, Settlers or Carcassonne etc) don't have such an extensive and multi-national following as the classics so it's much harder to find a lot of data to train on (which is obviously important for machine-learning AI players). I had that problem when considering an M:tG AI trained with machine learning: I would have liked to find play-by-play data on professional games but there just isn't any (or where there is it's not enough, or it's not in any standardised format).

Finally, classic board games have cultural significance that modern board games dont' quite match, despite the huge popularity of CCGs like M:tG or Pokemon, or Eurogame hits like Settlers. Go, chess and backgammon in particular have tremendous historical significance in their respective areas of the world- chess in Eastern Europe, backgammon in the Middle East, Go in SE Asia. People go to special academies to learn them, master players are widely recognised etc. You don't get that level of interest with modern board games- so there's less research interest for them, also.

People in game playing AI have been trying for a very long time to crack some games like Go and, recently, poker (not quite cracked yet). They didn't sit around twiddling their thumbs all those years, neither did they choose classical board games over modern ones just because they didn't have the imagination to think of the latter. In AI research, as in all research, you have to make progress before you can make more progress.

otabdeveloper1 · on Oct 20, 2017

> Go is acknowledged to be much harder and require actual intelligence to play

No, Go is a much less intelligent[1] game. It has a huge decision tree and requires massive amounts of computation to play, but walking trees and counting is exactly what computers do well and what humans do poorly.

[1] 'Intelligence' here means exactly that which differentiates humans from calculators: the ability to infer new rules from old ones.

mdemare · on Oct 20, 2017

Nobody was saying that before AlphaGo beat Lee Sedol. So this feels like moving the goalposts.

ilaksh · on Oct 18, 2017

The smoke is when things like the same simulated robot that learned to run around like a mentally challenged person also learns to simulate throwing and can read very basic language.

It will seem quite stupid and inept at first. So people will dismiss it. But when they have a system with general inputs and outputs that can acquire multiple different skills, that will be an AGI, and we can grow it's skills and knowledge passed human level.

arikrak · on Oct 18, 2017

> But the hockey-stick takeoff

The hockey stick is lying horizontally though instead of vertically. If it took 3 days to go from 0 to beating the top player in the world, I wouldn't have expected it to take 21 days to beat next version. I guess something happens at the top levels of Go that make training much harder.

On another note, I didn't look at the details closely but it seems AlphaGo Zero needed much less compute training time than Alpha Go Master. Could getting rid of any human inputs really make it that much more efficient? That implies it will be able to have an impact in many different areas, which is a bit scary...

(Updated - it took 3 days to beat the top player in the world.)

gizmo686 · on Oct 18, 2017

This type of curve is what I would expect out of machine learning. At first there is rapid improvement as it learns the easy lessons. The rate then slows down as further incremental improvements become less impact.

What is, perhaps, surprising is that human play happens to be relatively close to the asymptote. Although this could be explained by Alphago being the first system to beat humans. If its peek performance were orders of magnitude higher than humans, a weaker program would have already beaten us.

kazagistar · on Oct 19, 2017

The horizontal hockey stick makes sense to me in terms of learning. Each increased layer of of understanding a complex system could mean a potentially exponentially increasing difficulty.

nothis · on Oct 19, 2017

I'm sure it's naive to jump to sci-fi conclusions just yet, but I admit it's equal parts fascinating and terrifying. The general message of the posts is that human knowledge is cute but not required to find new insights. Define the measure of success and momma AI will find the answer. At this point, the path to AGI is about who first defines its goals right and that seems... doable? Even scarier: We think the holy grail of AI is simulating a human being. The AI of the future might chuckle at that notion.

tanilama · on Oct 18, 2017

Wait for Alpha StarCraft for some real panic. So far RL based method has limited success outside of simple games(Not to say Go is simple, but rather the presentation and control parts of the format).

Tepix · on Oct 19, 2017

I'd like to see a StarCraft player AI that wins using a mere 1/10th of the effective actions per minute (EPM) of world class players. To me it seems beating another player while using fewer actions indicates superior skill, understanding and/or intellect.

vitaflo · on Oct 20, 2017

Not sure I agree with this fully. Certainly many actions used in a typical SC game are redundant, but there are reasons for it. Lag for one. If there's a possibility of lag or dropped packets, spamming a command will help nullify this problem.

The other is the entire reason for high APM, the stop/start problem. Pro players keep high APM so that when they actually need high EPM their muscle memory is already at full tilt. If you slow down your APM during lulls in the action it becomes harder to suddenly increase it when a fight happens.

Certainly that's an entirely human condition that a machine wouldn't need to worry about. But I'm not sure it means lack of skill.

durdn · on Oct 18, 2017

You expressed my exact thoughts and I was about to link to the same insightful article. I guess my comment could've been shortened as a silent upvote, but I commented anyway.

jorgemf · on Oct 18, 2017

Games are a joke compare with real life. The number of variables and rules is well defined in games, while in real life it is not. That is why AGI is not coming anytime soon.

davidkuhta · on Oct 18, 2017

> Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play.

So technically this version has lost every game it's ever won.

Jokes aside, it's pretty interesting to note that they were able to combine the "policy" and "value" networks. Good SO answers on the difference (https://datascience.stackexchange.com/questions/10932/differ...)

> accumulating thousands of years of human knowledge during a period of just a few days

It'd be interesting for what this would mean when things like a neural lace become a reality.

As an aside, anyone have any other links or references to others investigating learning algorithms with a 'tabula rasa' approach?

greysphere · on Oct 18, 2017

TD-gammon is a well known version of this technique (with 2 ply lookahead, vs a 1600 deep mcts) https://en.m.wikipedia.org/wiki/TD-Gammon

Temporal difference learning was previously consider weak at 'tactical' games, ie ones with gamestates that require long chains of precise moves to improve position (like many checkmate scenarios in chess) .

For anyone more familiar with this technique, is it clear how the mcts/checkpoint system overcomes this? How sensative is the system to the tuning params for those parts of the alg. Like is Go a particularly good candidate because of the ~400 play positions resulting in a (relatively) small tree seach requirement? (I kinda cant believe im saying that go has 'a small search tree'!)

We us td learning for the ai in our game Race for the Galaxy, so it's neat to hear about possible avenues for improvement!

greysphere · on Oct 18, 2017

After digging a bit deeper into the paper, it seems a key part of the new scheme is the NN is trained to help guide a deep/sparse tree search (as opposed to TD-gammons fully exhaustive 2-ply search). It's somewhat surprising to me that the simple win/loss is a strong enough signal to train this very 'intermediate step' in the algorithm - a spectacular result! It begs the question what other heuristic based algorithms would be improved by replacing a hand rolled non-optimal heuristic function with a NN?

sjg007 · on Oct 19, 2017

It's estimating the probability of winning from the position based on what it has already seen. So basically it's a giant conditional probability distribution. Is it mistaken to interpret this as a bayesian network?

davidkuhta · on Oct 18, 2017

Wow, that was a really deep and enjoyable Wikipedia rabbit hole journey. I hadn't heard of Temporal Difference before (though I was familiar with Q-learning).

It was interesting to note that TD-Gammon improved with expert designed features. I wonder if this was simply related to the technology of the field as it stood over 20 years ago or some underlying categorization or complexity associated with the games themselves (backgammon being more favorable to human comprehension than Go in this case).

> Even though TD-Gammon discovered insightful features on its own, Tesauro wondered if its play could be improved by using hand-designed features like Neurogammon's. Indeed, the self-training TD-Gammon with expert-designed features soon surpassed all previous computer backgammon programs. It stopped improving after about 1,500,000 games (self-play) using 80 hidden units.

For others: Richard Sutton, one of the pioneers of TD makes his Reinforcement Learning: An Introduction textbook available for free on his website: http://incompleteideas.net/sutton/ (MIT Press also links to it)

shmageggy · on Oct 18, 2017

PSA: The new edition of Sutton/Barto has a nice discussion of (the original) AlphaGo in the back.

http://incompleteideas.net/sutton/book/the-book-2nd.html

shmageggy · on Oct 18, 2017

Yeah it's not clear to me why temporal difference learning all of a sudden works so well here? Is it the case that nobody had really tried it for learning a policy for Go with a strong NN architecture? In the Methods they mention TD learning for value functions but I don't see anything about policies.

edit: OK, they're calling it policy iteration as opposed to TD learning. I guess I don't get the difference.

psb217 · on Oct 18, 2017

TD learning is, in some sense, a component of policy iteration. TD learning is about learning the value function for a given policy. In policy iteration you use a value function to decide how to update the policy for which the value function was estimated, and you iterate between the "learn value" and "update policy" steps.

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelb...

narrator · on Oct 18, 2017

It's my opinion that TD Gammon was solved in the 1990s because backgammon is a 1 dimensional board. It didn't need the convolutional techniques of the Go neural nets to gain insight into the game and could thus be solved by a traditional neural net.

milofeynman · on Oct 18, 2017

https://blog.openai.com/

I believe their DotA 2 AI uses that approach

mappingbabeljc · on Oct 18, 2017

Correct - we saw a similar phenomenon of rapid capability gain via self-play in our Dota 2 work: https://blog.openai.com/more-on-dota-2/

davidkuhta · on Oct 18, 2017

Thanks!

apetresc · on Oct 18, 2017

> So technically this version has lost every game it's ever won.

No, they've also played it against AlphaGo Lee and AlphaGo Master. The SGFs are available at: https://www.nature.com/nature/journal/v550/n7676/extref/natu...

davidkuhta · on Oct 18, 2017

I meant that tongue-in-cheek based on the "by playing games against itself" during training. Nonetheless, thanks for clarifying that in case it's unclear for others (and for the SGFs).

JoeDaDude · on Oct 19, 2017

I remember reading about Blondie24, a program that learned to play checkers at a high level without human input. It was based on neural network and genetic algorithm technology. From the Wikipedia entry: "The significance of the Blondie24 program is that its ability to play checkers did not rely on any human expertise of the game. Rather, it came solely from the total points earned by each player and the evolutionary process itself." [1].

In addition to numerous journal articles, the creators wrote a lay-person book on their creation: Blondie24: playing at the edge of AI, by David B. Fogel [2].

[1]. https://en.wikipedia.org/wiki/Blondie24

[2]. https://dl.acm.org/citation.cfm?id=501597

Jyaif · on Oct 18, 2017

"It uses one neural network rather than two." and "AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features."

This is amazing! The technology they came up with must be super generic.

bluetwo · on Oct 18, 2017

It sounds like that's what they are going for. Minimal tuning, generally adaptable. Very interesting.

skinner_ · on Oct 18, 2017

Also, unsupervised. Also, no rollouts. They got rid of a lot of complexity. At this point it looks like a reasonable challenge to write a superhuman Go AI in 500 lines of unobfuscated python.

andai · on Oct 19, 2017

I was wondering about this: can we study AlphaGo Zero and other nets created in the same way for similarities, extract and study them? Or are we limited to observing the behavior and learning from that?

tom_wilde · on Oct 19, 2017

'rollouts' ELI5? I didnt pick this up from the paper..

thx :)

samfriedman · on Oct 18, 2017

Super, super impressive work. I'd love to see how hard it is to apply the architecture to other games/problems that work well with self-play.

gordon_freeman · on Oct 18, 2017

I will be interested to see what kind of algorithms they have used to allow AlphaGo to learn from its own moves. Are these pretty generics algos or are these very customized and specific ones that only apply to AlphaGo and the game of Go?

Scaevolus · on Oct 18, 2017

They have a new reinforcement learning algorithm that should be generically applicable to anything where a long sequence of moves results in a specifically gradable outcome.

> The neural network in AlphaGo Zero is trained from games of selfplay by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network fθ. The MCTS search outputs probabilities π of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network fθ(s); MCTS may therefore be viewed as a powerful policy improvement operator. Self-play with search—using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value—may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network’s parameters are updated to make the move probabilities and value (p, v)= fθ(s) more closely match the improved search probabilities and selfplay winner (π, z); these new parameters are used in the next iteration of self-play to make the search even stronger.

romaniv · on Oct 18, 2017

> They have a new reinforcement learning algorithm that should be generically applicable to anything where a long sequence of moves results in a specifically gradable outcome.

Statements like these always make me wonder why certain obvious things weren't tried. If it's so generic, why wasn't it tried on Chess? Or was it tried, failed to impress and thus didn't make it into the press release?

This is a big problem with all these public discussion on AI. Almost no one speaks about algorithm failures. I haven't seen a single research paper that said "oh, and we also tried algorithm in X domain and it totally sucked".

namelost · on Oct 18, 2017

The conventional wisdom for Chess engines is that aggressive pruning doesn't work well. Chess is much more tactical than Go, selective algorithms tend to lead to some crucial tactic being missed, and the greater the search depth, the more likely that is.

Modern Chess engines are designed to brute-force the search tree as efficiently as possible. I will go out on a limb here and say they would wipe the floor with AlphaGo, because AlphaGo's hardware would be more of a liability than an asset against a CPU.

See also: https://chessprogramming.wikispaces.com/Type+A+Strategy https://chessprogramming.wikispaces.com/Type+B+Strategy

nojvek · on Oct 19, 2017

Until I see AlphaGo zero defeating StockFish 100-0 and with same algorithm defeating best Go AI and killing the Atari games including montezuma’s revenge, I call this hype bullshit.

Give me your results on OpenAI gym in a variety of different styles of games including GTA and WoW. I will believe you if a generic unsupervised algorithm running on a single machine is absolutely destroying the best players.

Until then ...

taneq · on Oct 19, 2017

Just like Lee Se-dol is a Go grandmaster, beats Gary Kasparov at chess and can also get a perfect score in Pac-Man, right? I mean, if you can't do all of those things then are you even a human-level intelligence?

romaniv · on Oct 19, 2017

This just illustrates that surpassing "human level" performance is a silly and arbitrary benchmark, because there is no such thing as general human level performance. But I bet Kasparov would be pretty good at Go, and Sedol would be pretty good at chess.

Universality is the real hard problem of AI. In the long run, a mediocre AI that does a lot of different things is far more useful that most targeted "superhuman" AIs. Most domains simply don't require better-than-human performance, but could still reap tremendous benefits from automation.

taneq · on Oct 20, 2017

Agreed. It's great that we have domain-specific approaches that can beat humans in their domain (and that we're learning how to make these approaches more generic so that, with re-training, they can adapt to new domains), but the real "oh snap" moment will be when we build something that's barely-adequate but widely adaptable. Something with the adaptability of a corvid or an octopus, say. If we get to that level, it'll mean we've discovered the "universal glue" that joins specialist networks together into a conscious entity.

red75prime · on Oct 19, 2017

You forget to add "running on 20 watts of power". It's not reasonable to require it to run on a single machine, when brain performance is estimated to be more than 10 petaflops.

Tepix · on Oct 19, 2017

I don't know if you're being sarcastic or not. If not, I suggest you look at the cartoon at http://www.kurzweilai.net/robot-learns-self-awareness

goatlover · on Oct 19, 2017

Add Pacman and Pitfall to the list. Humans have played perfect games of both. My understanding is DeepMind performed poorly on those games.

williamichang · on Oct 19, 2017

Doesn't this sound very much like how a human learns to play the game? MCTS ~ play/experience (move probabilities); self-play with search ~ study/analysis (move evaluation); repetition and iteration to build intuition (NN parameters).

amelius · on Oct 18, 2017

But I suppose they still do the searching/pruning with a separate piece of code (not a neural network).

hacker_9 · on Oct 18, 2017

From the paper: "it uses a single neural network, rather than separate policy and value networks ... it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning."

jtolmar · on Oct 18, 2017

Yes, but tree search + neural net is still pretty generic. It only assumes that you can enumerate branches.

mannigfaltig · on Oct 18, 2017

It also presumes that one can simulate the world at low cost. In AlphaGo Zero it takes 0.4 s for 1.600 node extensions, but in this case the cost of the world is negligible. Anyway, assuming you need that many node extensions to get decent quality updates, that puts a rather a tight limit on the cost of simulating the world.

gwern · on Oct 19, 2017

DM has already done a bunch of work on 'deep models' of environments to plan over. Use them and you have 'model-predictive control' and planning, and this tree extension to policy gradients would work as well (probably). It could be pretty interesting to see what would happen if you tried that sort of hybrid on ALE.

mannigfaltig · on Oct 19, 2017

I guess deep world models are still severely riddled by all sorts of problems: vanishing gradients, BPTT being O(T), poor generalization ability of NNs (which likely is due to the lack of attractor state associative recall, as well as concept composability), lack of probabilistic message passing to deal with uncertainty, and perhaps some priors about the world are necessary to make learning tractable (such as spatial maps and fine-tuning for time scales that contain interesting information).

disposable_123 · on Oct 19, 2017

What are the main papers from DM on this ? Are you referring to "CONTINUOUS CONTROL WITH DRL" ?!

zodiac · on Oct 19, 2017

You're right, but MC rollouts work better (better estimate) for some games than others.

gfodor · on Oct 19, 2017

I'm wondering if once one of these algorithms comes along that has been perfected if it is going to "burn in" the domain it was built for as the target of problem reductions, similar to 8086 assembly or the qwerty keyboard living on today despite them being ancient relics.

For example, after this result it seems if you can reduce your problem domain onto Go (or a similarly structured game) you now have a way to create a superhuman solver. It may just be easier to do that then try to even figure out how to design and tune a new network.

I could imagine waking up in 10 years being confused at why all software efforts in the AI space are focused on just figuring out clever ways to map real problems onto a hodgepodge of seemingly random "toy" domains like Go and Chess and Starcraft. Hell, maybe the Starcraft bot will immortalize Starcraft in a way the game never would have been able to if it becomes a good reduction target for a lot of domains.

It kind of reminds me of how SVMs were "abused" by twisting non-linear domains into them via kernel methods, or by proving the NP-equivalence of a problem by reducing it onto 3-SAT, or how ImageNet's weights are being re-purposed for other image oriented prediction tasks.

colllectorof · on Oct 19, 2017

In many domains mapping the problem to a tree search already gives you a superhuman solver or at least a passable solver. Problem mapping is what most of modern AI research is about. That's how the field was redefined in recent years. Just like Vladimir Vapnik says[1], it's becoming more engineering than science. (And sometimes more software alchemy than engineering.)

[1] https://www.youtube.com/watch?v=5mvfpSdWsOo "Brute Force and Intelligent Paradigms of Learning"

zodiac · on Oct 19, 2017

I think the only thing about go that enables this technique is "turn-based perfect information 0-sum game".

hmate9 · on Oct 19, 2017

Also it has fairly limited input. A real world problem may have much more possible inputs at any time step as opposed to placing just 1 stone

isolli · on Oct 19, 2017

What is the analogy with the QWERTY keyboard?

shmageggy · on Oct 18, 2017

Looks like the performance improvement comes from two key ingredients:

1) Using Residual networks instead of normal convolutional layers

2) Using a smarter policy training loss that uses the full information from a MCTS at each move. In the previous version, I believe they just ran the policy network to the end of the game and used a very weak {0, 1} reinforcement signal over all of the moves played. Here, it looks like they use each run of MCTS to provide a fully supervised signal over all moves it explores.

mtremsal · on Oct 18, 2017

How is it different to apply the loss on each actual move at the end of the game VS on each rollout (which is itself a tiny game)? Does it help reinforce learning towards the end game as shorter rollouts are needed? Is the more accurate information then propagated to earlier moves as well?

gwern · on Oct 19, 2017

I think the difference is that under 1/0 policy gradient loss, it gets feedback only on the actual chosen move. Under MCTS-rollouts-each-move, it gets feedback on every move on the board whether its value estimate was slightly too high or low plus the ultimate outcome of the 1 move it did make.

RoboTeddy · on Oct 18, 2017

Also (3) training a dual policy & value network that can benefit from a single shared representation of the game

nyxtom · on Oct 18, 2017

So this is fun:

"AlphaGo Zero is the program described in this paper. It learns from self-play reinforcement learning, starting from random initial weights, without using rollouts, with no human supervision, and using only the raw board history as input features. It uses just a single machine in the Google Cloud with 4 TPUs (AlphaGo Zero could also be distributed but we chose to use the simplest possible search algorithm)."

VikingCoder · on Oct 19, 2017

Single machine?

Stunning.

polskibus · on Oct 18, 2017

I remember reading ages ago in Scientific American about a much more interesting (and useful) AI application of this technique.

Genetic algorithms were used to evolve new, more efficient variants of existing electronic circuits. I dug it up - it was: https://www.scientificamerican.com/magazine/sa/2003/02-01/#a... Article "Evolving inventions". I have no idea if there is an open-access version anywhere.

As far as I remember, that approach led to some patents, because some of the inventions were better than existing solutions. One of the examples in the article was a low-pass filter (I dont remember if AI version was actually better or worse than human-made).

The essential element of this approach was that in electronics (as in go) there exist a well defined set of rules, that allows researchers to build a simulation engine with optimization/evaluation function that the AI targets by itself, without supervision. It's great to see that this approach is still alive, although in my humble opinion, application in electronics is much more interesting than Go.

tmzt · on Oct 18, 2017

Somebody needs to dig this up and apply it to an open FPGA toolchain like ICEstorm.

The other SA article on this was The Darwin Chip which I think went into more detail.

One of the limitations was the lack of documentation for the actual bitstream.

qrv3w · on Oct 19, 2017

Nature is actually hosting it without a paywall:

https://www.nature.com/scientificamerican/journal/v288/n2/pd...

alexbeloi · on Oct 18, 2017

Very impressive, the original implementation relied a lot on feature engineering.

I'm surprised they're able to prevent a self-play equilibrium with such a simple loss function.

It's sort of like they are using auxiliary outputs but instead of using them to fit features, they are fitting to multiple ways of arriving at 'best play', through predicting value (SL) and predicting probability for best outcome (RL). In principle, they're doing the same thing but in practice it seems like they are making up for each others shortcomings (e.g. self-play equilibrium with RL).

grondilu · on Oct 18, 2017

> If similar techniques can be applied to other structured problems, such as protein folding, reducing energy consumption or searching for revolutionary new materials,

Protein folding sounds like a nice idea for their next challenge.

thefalcon · on Oct 18, 2017

https://www.bloomberg.com/news/articles/2017-10-18/deepmind-...

Indeed.

tc · on Oct 19, 2017

When things will start getting interesting is when we figure out how to get move simulation and search into the network itself, rather than programming that on the outside. As far as I know, no-one has even the faintest idea of how to do that. We have an existence proof that this should be possible.

The networks are great at perception and snap-prediction. Anything a human can do in 200ms is fair game. And with clever engineering, we can make magic happen by iterating or integrating those things.

But it's after that first 200ms that humans get really intelligent. When we can come up with an architecture that lets the networks themselves start simulating possibilities, backtracking, deciding when to answer now or to think more -- when the network owns the loop -- then it will get interesting.

derefr · on Oct 19, 2017

> We have an existence proof that this should be possible.

Not guaranteed. The human brain has diffusion signalling (i.e. neurotransmitters passing out of the synaptic cleft, into a neighbouring one, and activating a receptor on some other spacially-local axon as a result.) And one of those signalling molecules is thought to represent, in its intensity, a confidence-interval bias adjustment (i.e. a pruning bias factor for MCTS.) So the brain's MCTS-equivalent process may rely on some extra-graphical properties of the brain-as-embodied-meat-thing.

red75prime · on Oct 19, 2017

That will be a couple of additional terms in activation function. Or am I missing something?

derefr · on Oct 19, 2017

“Neighbouring” is defined in terms of embedding in a metric space and inverse-cube diffusion, rather than anything to do with graphic connectivity.

Also, these signals pile up in the synaptic cleft until they’re picked up, so it’s not just about instantaneous transmissivity as if these were radio signals.

But also also, other stuff like monoamine oxidase is floating about in its own diffusion patterns, cleaning up these signals.

It’s basically like a “scent” communication embodied-actor model, but a very complex one where things like redox reactions with the atmosphere occur.

Oh, and there are “secondary messengers”: signals that trigger other signals that, among other things, inhibit the release of the original signal when received back at the sender, such that an dynamic equilibrium state is reached between the two signal types.

thomasahle · on Oct 21, 2017

I think what you are suggestion is similar to Deep Mind's Sokoban bot: https://deepmind.com/blog/agents-imagine-and-plan/

miketery · on Oct 19, 2017

What do you mean by move simulation?

baq · on Oct 19, 2017

I think he means that the NN somehow learns MCTS without it being coded in explicitly.

galkk · on Oct 18, 2017

Why don't use the same approach for chess?

It's very interesting to see if it is able to handle much more advanced and tuned engines that exist for chess, game with considerable much more complicated rules?

Florin_Andrei · on Oct 18, 2017

I think chess is less compelling because, in a sense, it is a "solved problem" - superhuman AI chess players already exist.

And chess, while it does have more complex base rules, has a much lower combinatorial complexity than Go.

EvgeniyZh · on Oct 18, 2017

Well, I'd love to see NN solution beating top chess engines. It might also introduce novelty to the game, just as regular engines did

feelix · on Oct 18, 2017

It'd be particularly useful to have a chess bot that can play badly in the same way a human does.

The problem with the current chess bots is that they play badly, badly. They choose a terrible random mistake to make every few moves, while some of their other moves are brilliant. They cannot accurately mimic beginner or intermediate level players.

thefalcon · on Oct 18, 2017

This seems like something DeepMind could create, given the incentive. They were able to train AlphaGo to predict human moves in Go at a very high accuracy (obviously not with AlphaGo Zero, but the inferior human-predictive version is how they determined that AGZ is playing qualitatively differently).

Florin_Andrei · on Oct 18, 2017

In a sense, that would be like replicating the human brain's functionality, including the bugs and limitations.

xoroshiro · on Oct 19, 2017

I have some idea how it MIGHT work, but it would be a very boring solution involving 'learning' Stockfish's parameters and HOPING to find improvements to something like integrating time management and search/pruning into it.

I wouldn't bet on it though. SMP is notoriously hard to work with alpha-beta search and there are a lot of clever tricks (which is probably still not perfect). Maybe with ASICs, you could make it stronger, but then it wouldn't be as fair a comparison.

EvgeniyZh · on Oct 19, 2017

Well, all top engines did some kind of search on parameters, not sure if you can find much improvement there.

I'm talking about something similar to the described in the paper, 100% self-learned solution without using human heuristics, based on NNs. That could bring a totally new ideas into chess.

SAI_Peregrinus · on Oct 18, 2017

Shogi is probably the closest historical game in terms of complexity to Go. Some of the larger variants might exceed Go's complexity if played with drops, though that's not normally done. And Go played on a 9x9 board (like standard Shogi) has a substantially lower state space complexity (and almost certainly lower by other measures as well.)

But shogi is much more obscure outside of Japan than go or chess, so it gets less interest, especially in the large-board variants.

jononor · on Oct 18, 2017

I think that the existence of highly optimized chess AI makes it interesting from two angles: 1) Generalization: Can one make AI using same approach that can play both chess and Go at superhuman levels 2) Efficiency: Can these newer methods match or outperform also in terms of compute/energy costs

But maybe not sexy enough, or we just don't hear about it as much.

nilkn · on Oct 22, 2017

That makes it even more interesting. I think it would be very notable and significant if a neural network with MCTS and self-play reinforcement learning could surpass Stockfish, which has superhuman strength but was developed with an utterly different approach involving lots of human guidance and grandmaster input.

Giraffe attempted this (with more standard tree search than MCTS and with only a value function rather than a combined policy/value network), but only reached IM level -- certainly impressive, but nowhere close to Stockfish.

orbifold · on Oct 18, 2017

Denis Hassabis was asked this in a Q&A after a talk he gave and according to him someone did this (bootstrap a chess engine from self play) successfully, while still being a student and was hired by them subsequently.

shmageggy · on Oct 18, 2017

I didn't see the talk, but I'm guessing he was referring to the Giraffe engine done by Matthew Lai (https://arxiv.org/abs/1509.01549). The main thing there is that he only learns an evaluation function, not a policy. Giraffe still uses classical alpha-beta search over the full action space. AFAIK nobody has learned a decent policy network for chess, probably because 1) it's super tactical, and 2) nobody cares that much because alpha-beta is so strong

dragontamer · on Oct 18, 2017

Because Chess is a simpler game than Go.

Minimax with Alpha Beta pruning works in Chess because the search tree is way smaller. The reason why all this "Monte-Carlo Tree Search + Neural Nets" are being used in Go because Minimax + Alpha Beta pruning DOESN'T work in Go.

titzer · on Oct 18, 2017

This is pretty incredible, especially the power dissipation results. Only 4 TPUs? Humans are toast.

Symmetry · on Oct 18, 2017

That's still 10 times as much energy as a human body or 100 times as much as a human brain. But yeah, it's not like they're throwing a datacenter at this.