Well, not everyone. I wasn't the only one to mention this, so I'm surprised it didn't show up in the list of theories, but here's e.g. me, seven days ago (source https://news.ycombinator.com/item?id=42145710):
> At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training.
This is not the same thing as cheating/replacing the LLM output, the theory that's mentioned and debunked in the article. And now the follow-up adds weight to this guess:
> Here’s my best guess for what is happening: ... OpenAI trains its base models on datasets with more/better chess games than those used by open models. ... Meanwhile, in section A.2 of this paper (h/t Gwern) some OpenAI authors mention that GPT-4 was trained on chess games in PGN notation, filtered to only include players with Elo at least 1800.
To me, it makes complete sense that OpenAI would "spike" their training data with data for tasks that people might actually try. There's nothing unethical about this. No dataset is ever truly "neutral", you make choices either way, so why not go out of your way to train the model on potentially useful answers?
I made a suggestion that they may have trained the model to be good at chess to see if it helped with general intelligence, just as training with math and code seems to improve other aspects of logical thinking. Because, after all, OpenAI has a lot of experience with game playing AI. https://news.ycombinator.com/item?id=42145215
I think this is a little paranoid. No one is training extremely large expensive LLMs on huge datasets in the hope that a blogger will stumble across poor 1800 Elo performance and tweet about it!
'Chess' is not a standard LLM benchmark worth Goodharting; OA has generally tried to solve problems the right way rather than by shortcuts & cheating, and the GPTs have not heavily overfit on the standard benchmarks or counterexamples that they so easily could which would be so much more valuable PR (imagine how trivial it would be to train on, say, 'the strawberry problem'?), whereas some other LLM providers do see their scores drop much more in anti-memorization papers; they have a clear research use of their own in that very paper mentioning the dataset; and there is some interest in chess as a model organism of supervision and world-modeling in LLMs because we have access to oracles (and it's less boring than many things you could analyze), which explains why they would be doing some research (if not a whole lot). Like the bullet chess LLM paper from Deepmind - they aren't doing that as part of a cunning plan to make Gemini cheat on chess skills and help GCP marketing!
Yup, I remember reading your comment and that making the most sense to me.
OpenAI just shifted their training targets, initially they thought Chess was cool, maybe tomorrow they think Go is cool, or maybe the ability to write poetry. Who knows.
But it seems like the simplest explanation and makes the most sense.
Yes, and I would like this approach to also be used in other, more practical areas. I mean, more "expert" content than "amateur" content in training data, regardless of area of expertise.
Well, not everyone. I wasn't the only one to mention this, so I'm surprised it didn't show up in the list of theories, but here's e.g. me, seven days ago (source https://news.ycombinator.com/item?id=42145710):
> At this point, we have to assume anything that becomes a published benchmark is specifically targeted during training.
This is not the same thing as cheating/replacing the LLM output, the theory that's mentioned and debunked in the article. And now the follow-up adds weight to this guess:
> Here’s my best guess for what is happening: ... OpenAI trains its base models on datasets with more/better chess games than those used by open models. ... Meanwhile, in section A.2 of this paper (h/t Gwern) some OpenAI authors mention that GPT-4 was trained on chess games in PGN notation, filtered to only include players with Elo at least 1800.
To me, it makes complete sense that OpenAI would "spike" their training data with data for tasks that people might actually try. There's nothing unethical about this. No dataset is ever truly "neutral", you make choices either way, so why not go out of your way to train the model on potentially useful answers?