This is most certainly true. If you look back to my comment and the discussion from the main thread I have two quotes from the GPT 4 technical paper
> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.
> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.
These are not great at building confidence that OpenAI does not have spoilage. Given what we know about the dedupe process (even from early 2023) this is not enough to purge contamination. Exact string matching has been the de facto method for quite some time and for quite some time we've known that this has issues. Just that 5 years ago these issues weren't as critical as they are today because performance was much lower back then.
I am not that verses on this topic but am curious what would be the biggest impact of leakage/spoilage on LLM perfermance? Is it similar to overfitting?
Yes, it'll generally lead to overfitting. This will look a lot like memorization btw. And just an fyi, you can still not diverge on the train/test split (as is common) and still overfit. That's an obvious signal but there are many ways to have a model overfit. As far as I'm aware, all giant LLMs and image generators show signs of overfitting. But note that sometimes this can be helpful. Obviously these tools are useful still so it is more about where they break than anything.
> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.
> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.
These are not great at building confidence that OpenAI does not have spoilage. Given what we know about the dedupe process (even from early 2023) this is not enough to purge contamination. Exact string matching has been the de facto method for quite some time and for quite some time we've known that this has issues. Just that 5 years ago these issues weren't as critical as they are today because performance was much lower back then.