There is rampant misunderstanding of some parts of this article; allow me to help :)
The "no-search" chess engine uses search (Stockfish) in in two ways:
1. To score positions in the training data. This is only training data, no search is performed when actually playing.
2. To play moves when the position has many options with a 99% win rate. This is to prevent pathological behavior in already won positions, and is not "meaningful" in the objective of grandmaster-level play.
In one sense, I can understand why they would choose to use Stockfish in mate-in-N positions. The fact that the model can't distinguish between mate in 5 and mate in 3 is an implementation detail. Since the vast majority of positions are not known to be wins or draws, it's still an interesting finding.
However, in reality all positions are actually wins (for black or white) or draws. One reason they gave for why stockfish is needed to finish the game is because their evaluation function is imperfect, which is also an notable result.
Is this in comparison to some other evaluation function which is perfect? I agree that all positions should have a certainty of win, draw, or lose with perfect play, but no engine is close to that level of evaluation function.
I do suspect that this pathological behavior could be trained out with additional fine tuning, but likely not without slightly diminishing the model's overall ability.
It comes down to: What is the evaluation for? For a human using an engine to analyze, it is about getting to more win-likely positions. And for an engine, it really is the same; plus to guide the search. Having a perfect trinary win/draw/loss would certainly be the _truth_ about a position in some objective way, but it would almost certainly not be the optimal way to win chess games against a set opponent. 1. e4 and 1. h3 are almost certainly both draws with perfect play from both sides, but the former is much more likely to net a win, especially for a human using the engine.
Neural networks are universal approximators. Choose a function and get enough data from it and you can approximate it very closely and maybe exactly. If initially creating function F required algorithm Y ("search" or whatever), you can do your approximation to F and then say "Look F without Y" and for all we know, the approximation might be doing things internally that are actually nearly identical to the initial F.
Is the model more efficient than Stockfish? I think Stockfish runs on regular CPU computer and I'd guess this " 270M parameter transformer model" requires a GPU but I can't find any reference to efficiency in the paper.
Also found in the paper: "While our largest model achieves very good performance, it does not completely close the gap to Stockfish 16". It's actually inferior but they still think it's an interesting exercise. But that's the thing, it's primarily an exercise like calculating pi to a billion decimal points or overclocking a gaming laptop.
Well I think it’s interesting to the extent that it optimizes the solution for a different piece of hardware, the TPU. Their results are also applicable to GPUs. Since the problem is highly parallelizable, we might expect a viable model to quickly approximate a more accurate evaluation, and perhaps even make up for it in volume.
Only on the same way that DNA is a cheat for constructing organisms. After all they are universal recipes for organisms. Study enough DNA sequences and splice together a sequence and you can theoretically design any mythical creature you can think of.
But if someone grew an actual dragon in a lab by splicing together DNA fragments, that would still be a major feat. Similarly, training a neural net to play grandmaster level chess is simple in theory but extremely difficult in practice.
It doesn't get the actual optimal Q values computed from Stockfish (presumably this takes infinite compute to calculate), in fact it gets computed estimates from polling Stockfish for only 50ms.
So you're estimating from data a function which is itself not necessarily optimal. Moreover, the point is more like how far can we get using a really generic transformer architecture that is not tuned to domain-specific details of our problem, which Stockfish is.
No, arbitrarily wide neural networks are approximators of Borel Measurable functions. Big difference between that and “any function”. RNNs are Turing Complete though.
"Aready won position" or "99% win rate" is statistics given by Stockfish (or professional chess player). It is weird to assume that the same statement is true for the trained LLM since we are assessing the LLM itself. If it is using during the game then it is searching, thus the title doesn't reflect the actual work.
It's quite clear from the article that the 99% is the model's predicted win rate for a position, not its evaluation by Stockfish (which doesn't return evaluations in those terms).
It's true that this is a relatively large deficiency in practice: how strong would a player be if he played the middlegame at grandmaster strength but couldn't reliably mate with king and rook?
The authors overcame the practical problem by just punting to Stockfish in these few cases. However, I think it's clearly solvable with LLM methods too. Their model performs poorly because of an artifact in the training process where mate-in-one is valued as highly as mate-in- fifteen. Train another instance of the model purely on checkmate patterns - it can probably be done with many fewer parameters - and punt to that instead.
Human players have this concept of progress. I couldn't give a good succinct description of exactly what that entails, but basically if you are trading off pieces that's progress, if your king is breaking through the defensive formation of the pawn endgame that's progress. If you are pushing your passed pawn up the board that's progress. If you are slowly constricting the other king that's progress.
When we have a won position we want to progress and convert it to an actual win.
I think the operational definition I would use for progress is a prediction of how many more moves the game will last. A neural network can be used for that.
>> 1. To score positions in the training data. This is only training data, no search is performed when actually playing.
That's like saying you can have eggs without chickens, because when you make an omelette you don't add chickens. It's completely meaningless and a big fat lie to boot.
The truth is that the system created by DeepMind consists of two components: a search-based system used to annotate a dataset of moves and a neural-net based system that generates moves similar to the ones in the dataset. DeepMind arbitrarily draw the boundary of the system around the neural net component and pretend that because the search is external to the neural net, the neural net doesn't need the search.
And yet, without the search there is no dataset, and without the dataset there is no model. They didn't train their system by self-play and they certainly didn't hire an army of low-paid workers to annotate moves for them. They generated training moves with a search-based system and learned to reproduce them. They used chickens to make eggs.
Their approach depends entirely on there being a powerful chess search engine and they wouldn't be able to create their system without it as a main component. Their "without search" claim is just a marketing term.
Neither of those has been shown to produce equivalent training data, no.
They should do one of those instead of using search before they claim it’s possible to not use search.
Or to borrow your analogy, you’ll need to show me a duck egg to prove you can make omelettes without chickens. Making an omelette from chicken eggs and claiming hypothetically some mystery other animal could have done it is nonsense.
The point -- which I don't think you got -- is that extremely generic ingredients like high-quality data (which is the point of Stockfish here) and very deep Transformer-type Neural Networks, are enough to nearly match the performance of ad-hoc, non-generalisable techniques like gametree search algorithms.
This has two possible applications: 1. There's far less need to invent techniques like MCTS in the first place. 2. A single AI might be able to play grandmaster level chess by accident.
The catch is you need high quality data in large amounts.
I did get the point and I'm commenting that the point is missing the point. There is nothing new in learning that a large neural net can approximate the output of a classical system. This has been done many times before. The real point is that DeepMind build a system that is half-search and pretend it's no-search. You cannot get the "high-quality data" without a classical system- not in chess.
Btw, just to be a bit more constructive (not by much) the proper term for what DeepMind did is "neuro-symbolic AI". But DeepMind shunned the term even for AlphaGO, a system comprised of a couple of neural nets and Monte-Carlo Tree Search.
The whole thing is just political: DeepMind use neural nets, GOFAI is dead and that's the way to AI. That's their story and they're sticking with it.
It's more like saying you can make omelette without killing chickens, even though chickens were clearly involved at some point. So I see your point, that this doesn't allow grandmaster level chess play with no search at any point, but I also think it's fair to say that this approach allows you to use search to build an agent which can play grandmaster-level chess without, itself, using search.
> That's like saying you can have eggs without chickens, because when you make an omelette you don't add chickens.
I just took it in the same way as saying that being a vegetarian is generally better for animal welfare, as you're not harming chickens as directly by eating an omelette, as you would by eating their wings.
> That's like saying you can have eggs without chickens, because when you make an omelette you don't add chickens. It's completely meaningless and a big fat lie to boot.
It's like saying ChatGPT isn't a human brain.
It was trained with human brains. But it isn't a human brain.
A quote from discord: "apparently alpha-zero has been replicated in open source as leela-zero, and then leela-zero got a bunch of improvements so it's far ahead of alpha-zero. but leela-zero was barely mentioned at all in the paper; it was only dismissed in the introduction and not compared in the benchmarks. in the stockfish discord they are saying that leela zero can already do everything in this paper including using the transformer architecture."
Leela zero was an amazing project improving on AlphaZero, showing the feasibility of large scale training with contributed cycles, and snatching the TCEC crown in Season 16
It forced Stockfish to up its game, essentially by adopting neural techniques themselves (though a different type, Stockfish uses nnue).
How much of this "grandmaster-level" play is an artifact of low time controls? I notice they only achieve GM ELO in Blitz against humans, achieve significantly worse ELO against bots, and do not provide the "Lichess Blitz ELO" of any of their benchmark approaches.
All of it? Inference speed is proportional to the number of parameters which can't vary depending on the time control. For longer time controls you'd need a larger network. For those who don't know chess, the quality of high-level play in five-minutes-each games is extremely much lower than in 90-minutes-each games.
I wonder if one could make a neural net play human-like, at various levels, by for instance training smaller or larger nets. And by human-like, I don't mean ELO level, but more like the Turing tests - "does this feel like playing against a human?"
I wonder how many time-annotated chess play logs are out there. (Between humans, I mean.)
I suppose varying the neural net size wouldn't be the best way of doing that; very small nets can have very "unhuman-like" behaviour. I'm not an expert on reinforcement learning, but for other fields in deep learning that's typically the case.
I think that, to simulate worse human-like players, it would be better to just increase the temperature: don't always select the best move, at every step just select one of the top 10, randomly proportional to some function of the model-predicted probability of it being "the best" move (e.g. a power of the probability; very large powers give always the best move, i.e. the strongest player, and powers close to 0 tend to choose uniformly at random, i.e. the weakest player). The only thing I'm not certain about is, if you train the original network well enough, stupid blunders (that a very bad human player like me would make) are still scored so low that there's no way this algorithm will pick them up - the only way to know would be to try.
> don't always select the best move, at every step just select one of the top 10
Engines already do this when you turn down their skill level. It does not lead to human-like play.
The problem is that bad (or just non-expert) human players don't make completely random mistakes. They tend to make very specific types of mistakes. For example, they may miss certain types of tactics, or underestimate king safety, or forget about hanging pieces.
In order to make a bot that feels like a human, you need to somehow capture the specific weaknesses that human players have.
Maia chess did this. It works ok for low to mid elo levels (around 50% accuracy). But their project also didn’t use any search, it just directly predicted the move. Humans actually do perform search, so a more accurate model at higher elos will probably need to do something like that. However, humans don’t do a complete game search like stockfish, and we don’t do full game rollouts like lc0 either.
While its performance against humans is very impressive indeed, its performance against engines is somewhat less so:
> Our agent’s aggressive style is highly successful against human opponents and achieves a grandmasterlevel Lichess Elo of 2895. However, we ran another instance of the bot and allowed other engines to play it. Its estimated Elo was far lower, i.e., 2299. Its aggressive playing style does not work as well against engines that are adept at tactical calculations, particularly when there is a tactical refutation to a suboptimal move.
I like this even more that I’ve read that. That sounds like it makes this agent a much more human-like player than the perfect calculator traditional chess engines. It may end up being more fun for humans to play against if it’s strong but has holes in its play.
Not really. Mikhail Tal was easily one of the strongest calculators in chess history. Definitely the strongest in his time besides maybe Fischer.
The idea that Tal mostly made dubious sacrifices is largely a myth heavily based in a joke he himself made. In actual fact he always did deep calculation and knew that no easy refutation existed, and that he had a draw by perpetual check in hand(until beaten by Ding a few years ago, Tal actually had the record streak of unbeaten games in classical chess). He was making calculated risks knowing his opponents would not be likely to outcalculate him. He also had a very deep understanding of positional play, he just had a very different style of expressing it, relying more on positional knowledge to create sharp positions centered around material imbalance.
Well without explicit search would probably be more accurate.
They note that though in the paper:
>Since transformers may learn to roll out iterative computation (which arises in search) across layers, deeper networks may hold the potential for deeper unrolls.
We don’t know if it’s using implicit search either. While it would be interesting if the network was doing some internal search, it’s also possible it has just memorized the evaluations from 10M games and is performing some function of the similarity of the input to those previously seen.
Even if it's "implicit" I'm not sure if that matters that much. The point is that the model doesn't explicitly search anything, it just applies the learned transformation. If the weights of the learned transformation encode a sort of precomputed search and interpolation over the dataset, from an algorithmic perspective this still isn't search (it doesn't enumerate board states or state-action transitions).
>performing some function of the similarity of the input to those previously seen.
This is indeed what transformers do. But obviously it learns some sort of interpolation/extrapolation which lets it do well on board states/games outside the training set.
I agree, from a practical perspective what matters is that is distilling stockfish’s search to a certain extent, which could have computational efficiencies. “Without search” just means we’re not doing anything like minimax or MCTS.
If the Transformer was 'just' memorizing, you would expect width scaling to work much better than depth scaling (because width enables memorization much more efficiently), and you also wouldn't expect depth to run into problems, because it's not like memorization is that complex - but it does suggest that it's learning some more complicated algorithm which has issues with vanishing gradients & learning multiple serial steps, and the obvious complicated algorithm to be learning in this context would be an implicit search akin to the MuZero RNN (which, incidentally, doesn't need any symbolic solver like Stockfish to learn superhuman chess from scratch by self-play).
>We don’t know if it’s using implicit search either.
Sure
>it’s also possible it has just memorized the evaluations from 10M games and is performing some function of the similarity of the input to those previously seen.
That's not possible. The possible set of moves in chess is incredibly large and it is incredibly easy to play a game that has diverged from training. a model that has just memorized all evaluations would break within ten or so moves tops much less withstand robust evaluations.
If it could reliably win a mate in N position without inexplicably blundering, I would be more inclined to buy your search hypothesis. But it doesn’t, which is one of the reasons the authors gave for finishing with stockfish. So whatever it’s doing is clearly lossy which an actual search would not be.
Neural nets memorize all sorts of things. They memorize ad clicks in high dimensional state spaces. Transformers trained on the whole internet can often reproduce entire texts. It’s lossy, but it’s still memorizing.
That seems like the simplest explanation for what’s happening here. There’s some sort of lossy memorization, not a search. The fact that the thing it has memorized is the result of a search doesn’t matter.
>If it could reliably win a mate in N position without inexplicably blundering, I would be more inclined to buy your search hypothesis.
I don't have a "search hypothesis". I don't know what strategy the model employs to play. I was simply pointing out that limited search learned by the transformer is not out of the question. Stockfish finishing is not necessary to play chess well above the level a memorization hypothesis makes any sense. This is not the first LLM chess machine.
>Neural nets memorize all sorts of things. They memorize ad clicks in high dimensional state spaces. Transformers trained on the whole internet can often reproduce entire texts. It’s lossy, but it’s still memorizing.
Intelligent things memorize. Humans memorize a lot. I never said the model hasn't memorized a fair few things. Many human chess grandmaster memorize openings. What i'm saying is that it's not playing games via memorization any more than a human is doing the same.
>That seems like the simplest explanation for what’s happening here. There’s some sort of lossy memorization, not a search.
The options aren't only lossy memorization or lossless search.
Given that they used position evaluation from (a search chess engine[1]) Stockfish, how is this "without search"?
Edit: looking further than the abstract, this is rather an exploration of scale necessary for a strong engine. Could go without "without search" in the title I guess.
[1]: IIRC, it also uses a Leela-inspired NN for evaluation.
Leela without search supposedly plays around expert level, but I thought the no-search Leela approach ran out of gas around there. Without search there means evaluating 1 board position per move. The engine in the paper (per the abstract) use a big LLM instead of a Leela style DCNN.
So it's a space time trade-off then? Store enough searched and weighted positions into the model and infer them. In this way, inference is replacing Stockfish search, just less accurately, but much faster and with memory required for the model.
Does Stockfish really use a Leela-inspired NN? I thought the NNUE was independently developed and completely different (it's a very tiny network that runs on the CPU).
Yeah, NNUE is a separate invention that unfortunately, Deepmind often get undeserved credit for inspiring. It didn't even originate in chess engines but a shogi version of Stockfish. Architecture is completely different from the nets in Leela or Alpha Zero.
Wait, so progress on Stockfish would happen regardless of Alpha Chess? I always thought they were inspired by it in the newer versions, and got much improved rating from incorporating it.
Well, NNUE is surprisingly similar to what Stockfish was doing before NNUE. Before it was doing what's called piece-square tables. The basic idea(the stockfish evaluator had a lot more going on in addition, using multiple tables and interpolating between them based on game phase) is to assign some heuristic value to every square, for every piece. So it's just a 6x8x8 array that maps piece positions to values.
To get the evaluation of the whole position, you add up all of these mappings for the pieces on the board with opposite signs for the opposing players.
If you blur your eyes a little, this already looks a lot like a neural net. It's just a big summation of terms, and if you leave in a 0*(whatever value) for every piece that's not present, you've effectively embedded your lookup table into a giant mathematical expression that can be optimised by gradient descent.
The reason computer shogi programmers stumbled on this is that they were experimenting with adding more dimensions to the piece-square table, specifically via indexing by king position as well. So now you have 4 or 5 dimensions, making for a pretty massive array. Hand-tuning all the values becomes less and less feasible, and so I think discovering this idea of rearchitecting it as a neural-net was more or less inevitable.
So NNUE is actually just a pretty natural evolution of what they were doing before.
This is true, but at least for a while (I’m not sure if it’s still the case), Leela data was used (along with data generated from Stockfish self-play) to train Stockfish’s NN.
Slightly off topic but am I the only one that approaches strategy games by making a "zeroth order approximation". Eg find the shortest path to victory under the (obviously faulty) assumption that my opponent does nothing and the board is unchanging except for my moves. Now find my opponents shortest path to victory under the same assumption. Then evaluate, if we both just ignore each other and try to bum rush the victory condition, who gets there first?
For most games, if you can see a way to an end state within 3-5 steps under these idealized conditions, there's only so much that an actual opponent can do to make the board deviate from the initial static board state that you used in your assumption. The optimal strategy will always be just a few minor corrections of edit distance from this dumb no-theory-of-mind strategy. You can always be sure that whoever has the longer path to victory has to do something to interfere with the shorter path of their opponent, and there's only ever so many pieces which can interact with that shorter path. Meaning whatever path to victory is currently shortest short circuits the search for potential moves.
On beginner level this might work, but if people are more competitive they begin to realize the benefit of not only playing the own game, but reading the enemies plan (e.g. scouting in Starcraft/AoE2) to counteract it as much as possible.
Chess against humen is different. Usually, there is no path to victory, only to remis. People just follow strategic plans that people told them would be slightly beneficial later on. Along following that strategic plan people mess up and the first one to realize that the opponent messed up usually wins. Like having a piece advantage of 2-3 is usually considered a win already.
I agree with this. My default approach to board games is basically to maximize victory points early. This usually works; when 4 people are playing a new game for the first time, I usually win. This doesn't really work when people know how to play the game specifically, though.
I think this algorithm is better than many other algorithms that people come up with, however.
(As an aside, when I play a card game I sort my cards with a merge sort instead of an insertion sort. People said you would never use these algorithms in real life, but you can if you want to!)
This comment got me thinking about how I sort my cards. I scan the whole hand, then make the biggest changes first (e.g. consolidating suits/card types), then sort the subgroups.
Huh, guess I'm doing a sort of human heuristic version of Quicksort
The interesting thing about sorting cards for a trick taking game is often standard algorithms don't directly apply, because you don't have a fixed end state. I choose the order of the suits based on how the cards are currently laid out, as long as I end up with red and black alternating.
If you programmed this as a chess strategy, it would probably result in an engine that played the Scholar's mate every game. This is actually close to what low Elo players do in chess, but as you get closer to 800-ish ELO the probability of attempted scholar mates drop dramatically (likely due to it being an opening that isn't that good).
This is one of the canonical ways people learn chess. It's not that bad of a way to play because it emphasizes thinking about good moves, and it efficiently finds mates in N (when done by a human)
In higher level play it usually loses to opponents that are aware they're not playing alone, at least that's the case with bots that do in fact stay unaware of their opponent.
A lot of decision in chess would be like "this square would be nice for that piece, how can I get there ?" and then analyze what your opponent can do to prevent you to do that, or what counterplay that gives him. So what you are doing makes a lot of sense.
It's called the Null-move heuristic [1]. The assumption being that if a move can't win even if the opponent doesn't respond to it, then probably it is a pretty bad move. It's pretty standard in traditional chess search engine. Though I don't think AlphaZero's monte carlo search uses it.
This would never work in chess because there is almost always a checkmate in a few moves under the assumption that the opponent doesn't defend it. So this engine would just play way too aggressively and over-extent their pieces.
Take scholars mate for example. You can win in 4 moves from the initial position and it is an easy win if the opponent doesn't defend it but playing against someone that knows chess it is a horrible opening because it is easy to defend and leaves you in a weak position.
I guess this current method is not applicable to have the model explain why a given move was played as it is not planning more than one more ahead. Very cool, nonetheless.
I think this is an interesting finding from a practical perspective. A function which can reliably approximate stockfish at a certain depth could replace it, basically "compressing" search to a set depth. And unlike NNUE which is optimized for CPU, a neural network is highly parallelizable on GPU meaning you could send all possible future positions (at depth N) through the network and use the results for a primitive tree search.
The Stockfish installer is ~45 MB. At 16 bits per parameter, the 270B model would be over 500 MB. The 9B model would be smaller than Stockfish, but you could probably find a smaller chess engine that achieves 2000 ELO.
Dedicated chess computers were hitting 2000 ELO with an 8-bit 6502 running at <10MHz in the late 1980s. The Novag Super expert had 96KB of ROM, which also included the opening book, so yeah, quite a bit smaller.
The advantage of this approach that we can run many simultaneous computations on the GPU/TPU. Instead of using maybe a dozen CPU threads, we can approximate the value of a few thousand positions at the same time.
“To prevent some of these situations, we check whether the predicted scores for all top five moves lie above a win percentage of 99% and double-check this condition with Stockfish, and if so, use Stockfish’s top move (out of these) to have consistency in strategy across time-steps.”
> Indecisiveness in the face of overwhelming victory
> If Stockfish detects a mate-in-k (e.g., 3 or 5) it outputs k and not a centipawn score. We map all such outputs to the maximal value bin (i.e., a win percentage of 100%). Similarly, in a very strong position, several actions may end up in the maximum value bin. Thus, across time-steps this can lead to our agent playing somewhat randomly, rather than committing to one plan that finishes the game quickly (the agent has no knowledge of its past moves). This creates the paradoxical situation that our bot, despite being in a position of overwhelming win percentage, fails to take the (virtually) guaranteed win and might draw or even end up losing since small chances of a mistake accumulate with longer games (see Figure 4). To prevent some of these situations, we check whether the predicted scores for all top five moves lie above a win percentage of 99% and double-check this condition with Stockfish, and if so, use Stockfish’s top move (out of these) to have consistency in strategy across time-steps.
That's a crucial part of chess that can't simply be swept under the rug. If I had won all the winning positions I've had over the years I'd be hundreds of points higher rated.
What if a human only used Stockfish in winning positions? Is it cheating? Obviously it is.
The process of converting a completely winning position (typically one with a large material advantage) is a phase change relative to normal play which is the struggle to achieve such a position. In other words you are doing something different at that point. For example, me as weak FIDE CM (Candidate Master) could not compete with a top grandmaster in a game of chess, but I could finish off a trivial win.
Edit: Recently I brought some ancient (1978) chess software back to life https://github.com/billforsternz/retro-sargon. These two phases of chess, basically two different games, were quite noticeable with that program, which is chess software stripped back to the bone. Sargon 1978 could play decently well, but it absolutely did not have the technique to convert winning positions (because this is different challenge to regular chess). For example, it could not in general mate with rook (or even queen) and king against bare king. The technique of squeezing the enemy king into a progressively smaller box was unknown to it.
If Stockfish detects a mate-in-k (e.g., 3 or 5) it outputs k and not a centipawn score. We map all such outputs to the maximal value bin (i.e., a win percentage of 100%). Similarly, in a very strong position, several actions may end up in the maximum value bin. Thus, across time-steps this can lead to our agent playing somewhat randomly, rather than committing to one plan that finishes the game quickly (the agent has no knowledge of its past moves). This creates the paradoxical situation that our bot, despite being in a position of overwhelming win percentage, fails to take the (virtually) guaranteed win and might draw or even end up losing since small chances of a mistake accumulate with longer games (see Figure 4). To prevent some of these situations, we check whether the predicted scores for all top five moves lie above a win percentage of 99% and double-check this condition with Stockfish, and if so, use Stockfish’s top move (out of these) to have consistency in strategy across time-steps.
So they freely admit that their thing will draw or even lose in these positions. It's not merely making the win a little cleaner.
Yes. So how is this irrelevant for qualifying as GM-level play then? Being able to play these positions is a clear prerequisite for even being in the ballpark of GM strength. If you regularly choke in completely winning endgames, you'll never get there.
This is cheating, plain and simple. It would never fly in human play or competitive computer play. And it's most definitely disingenuous research. They made an engine, it plays a certain level, and then they augment it with preexisting software they didn't even write themselves to beef up their claims about it.
> If you regularly choke in completely winning endgames, you'll never get there.
Except we're talking about moves where no human player would choke because they are basically impossible to lose except by playing at random (which is what the bot does).
It makes no sense to try and compare to a human player in the same situation because no human player could at the same time end up in such a position against a strong opponent and be unable to exploit them once there…
It's basically a bug, and what they did is just working around this particular bug in order to have a releasable paper.
That must mean they found some similarity metric for high-level chess which is very impressive. In chess one pawn moving one square can be the difference between a won and a lost position. But knowing that usually requires lots of calculation.
I'd be interested to see how well the ELO holds up when the model is quantized.
At INT8 a small transformer like this could have a pretty amazing speed and efficiency on an Edge TPU or other very low power accelerator chip. The question becomes then is it faster / more efficient than Stockfish 16 on a similarly powered CPU. As we've seen with LLM's, they can be extremely speedy when quantized and all the stops pulled out on hardware to efficiently infer them compared to the raw FP16 and naive implementations.
I still want to see some examples of a "mistake" in the training data getting detected or reduced.
For example, somewhere in the training data the string "All cats are red" should get detected when lots of other data in the training set contradicts the statement.
And obviously it doesn't have to be simple logical statements, but also bigger questions like "how come the 2nd world war happened despite X person and Y person being on good speaking terms as evidenced by all these letters in the archives?"
When AI can do that, it should be able to turn our body of knowledge into a much bigger/more useful one by raising questions that arise from data we already have, but never noticed.
I immediately saw this and knew it was BS. It’s a search problem, the human brain even does a search. Model internally is scanning each position and determining the next probably position. That is a predictive search, you can’t just restructure the problem.
Now arguably it’s doing it differently, maybe? But still a search
i dont follow... even if its trained anc doesnt use search isnt the act of it deciding the next move a sortof search anyway based off its training? Ive heard people describe LLMs as extremely broad search, basically attempting to build world model and then predicting the next world based on that. Is this fundamentally different from search? Am i wrong in my assumptions here?
We know the model is approximating the results of a search. We don’t know whether it is actually searching.
At the most basic level, the model is just giving probabilities for the next moves, or in the case of value approximation, guessing which bucket the value falls into.
Text generation without human writers
Image generation without human artists
Video generation without production crews
The rub is that all the training data has been done by lots and lots of search, human writing, human artists, and human production crews. At every step, they curated the results and generated a beautiful latent space that the model was then trained on.
The impressive part about AlphaZero is that it did the Monte-Carlo Tree Search itself, without being trained on human decision-making.
While this ... well, this is just taking credit for all the search and work that was done by Stockfish, or humans, at a massive scale over centuries and then posted nearly for free online. It's then using all that data and generating similar stuff. Whoop de doo. Oh, and it's even used cheap labor for the last mile, too:
It's not the same thing as actual search to, for example, automatically derive scientific laws (e.g. Kepler's laws of motion) from raw data fed to it (e.g. of star movements). AI doing that can actually model the real space, not the latent space. It can go out and learn without humans or stockfish massively bootstrapping its knowledge.
I mean, don't get me wrong ... learning a lot about the latent space is what students strive to do in schools and universities, and the AIs are like a very smart student. In fact, they can be huge polymaths and polyglots and therefore uncover a lot of interesting connections and logical deductions from the latent space. They can do so at a huge scale... and I have often said that swarms of AIs will be unstoppable. So at the end of the day, although this isn't very impressive when it comes to the credit of who did the search and curation, AI is going to be extremely impressive with what it can do with the results on the next N levels.
I would very much prefer a model trained on human data that had absorbed a part of human values, than a completely de novo intelligence that bootstrapped itself using experimentation in the physical world.
"We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine, leading to roughly 15 billion data points [...] without any domain-specific tweaks or explicit search algorithms."
This used to be a comforting thought whenever computers beat humans in chess, but I think that time has passed. The paper mentions AlphaZero [1], which has beaten AlphaGo, which beat Lee Sedol back in 2016 [2].
My pithy comment probably wasn't enough to express what I meant. :)
I know that computers have already beaten humans at Go. But what's interesting is that in both the chess and Go cases, a lot of real-time compute was necessary to win the games. Now we have a potential way to build the model ahead of time such that the computer during interactive play is much smaller.
This means that we can be much more portable with the solution, and it also means that for online game companies, they can spend a lot less money on gameplay, especially if gameplay is most of their compute.
The "no-search" chess engine uses search (Stockfish) in in two ways:
1. To score positions in the training data. This is only training data, no search is performed when actually playing.
2. To play moves when the position has many options with a 99% win rate. This is to prevent pathological behavior in already won positions, and is not "meaningful" in the objective of grandmaster-level play.
Thus, "without search" is a valid description.