If you're going to suggest something you think an LLM can't do I think at the very least as a show of good faith you should try it out. I've lost count of the number of times people have told me LLMs can't do shit that they very evidently can.
I explicitly say that LLMs could do it in my response. As a show of good faith you should try reading the entire comment.
Yes, I'm using simple examples to demonstrate a particular difference, because using "real" examples makes getting the point across a lot harder.
You're also just wrong. I did in fact test, and both GPT 3.5 Turbo and 4o failed. Not only with the rule change, but with the mere task of providing possible moves. I only included the admission that they may succeed as a matter of due diligence, in that I cannot conclusively rule out they can't get the right answer because of the randomization and API-specific pre-prompting involved.
> "For chess board r1bk3r/p2pBpNp/n4n2/1p1NP2P/6P1/3P4/P1P1K3/q5b1 (FEN notation), what are the available moves for pawn B5"
I did read your entire comment, and that is what prompted my response, because from my perspective your entire premise was based on LLMs failing at simple examples, and yet despite admitting you thought there was a chance an LLM would succeed at your example, it didn't seem you'd bothered to check.
The argument you are making is based on the fact that the example is simple. If the example were not simple, you would not be able to use it to dismiss LLMs.
I am not surprised that GPT 3.5 and 4o failed, they are both terrible models. GPT4-o is multimodal, but it is far buggier than gpt-4. I tried with claude 3.5 sonnet and it got it first try. It also was able to compute the moves when told the rule change.