Does anyone know how much spoilage are in these datasets? Common crawl has a lot of websites in it, including Reddit and Stack*. I'm certain there are lots of questions in those datasets and we want to differentiate recall from problem solving (often confused). I have a deep distrust when using large datasets like this given a common one with 60 authors assumed writing leet code style programs by hand would mean they wouldn't appear in the training data (github) and didn't even bother to check. It's really hard to sanitize datasets of this size and deduplication is a much harder task than many realize.
>To avoid benchmark contamination, we follow Guo et al. (2024) to filter out web pages containing questions or answers from English mathematical benchmarks such as GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) and Chinese benchmarks such as CMATH (Wei et al., 2023) and AGIEval (Zhong et al., 2023). The filtering criteria are as follows: any text segment containing a 10-gram string that matches exactly with any sub-string from the evaluation benchmarks is removed from our math training corpus. For benchmark texts that are shorter than 10 grams but have at least 3 grams, we employ exact matching to filter out contaminated web pages.
However, benchmark contamination is difficult, and ngram matching is often insufficient. See https://arxiv.org/pdf/2311.04850.pdf for some examples of how this approach can fail.
In general, if a benchmark is available online before a model's dataset is collected, I put very little stock into that model's performance on that benchmark. It's just too hard to know what's a true improvement and what's contamination. It's especially true for a paper like this that specifically hunts down MATH-like data.
So I think we're in agreement and I find very little discussion about this within the community (being a researcher myself). This wouldn't particularly bug me if we were saying that the measurements do not distinguish the ability to recall with generalization, but I find that the discussion is always about generalization and AGI, leading to a very confused public.
Unfortunately I'm just not aware of any metric that can adequately quantify meaningful similarity between data. Curse of dimensionality I suppose. Personally I try not to lean too hard on benchmark results not only because the aforementioned spoilage, but due to metric limitations as well. Personally I think our progress has out paced our ability to properly measure and it feels like we've only become more reliant upon them rather than more nuanced in our evaluations (am I alone in this?). I am wondering if this will create a stall or plateau (or even reversal) in practical performance as our measurements become less meaningful as our quality increases. I'm in vision, so a good example is how it is common to think that the L2 distance between a norm layer of a classification network (even if better than InceptionNet) is an accurate measurement of visual fidelity. Or to even think we have such metrics in even special cases (I guess PSNR or SSID are closest but that's more accurately described as reconstruction quality).
Btw, I think you might like the second paper I linked. It's a META/Stanford paper and mostly deals with vision (LAION) but a bit with C4. The short of it is that they can prune about 40% of LAION and still get good "Zeroshot" ImageNet accuracy. I actually found the results for random pruning quite enlightening, especially around all the toy datasets (Fig A4).
Zeroshot in quotes because it's pretty dubious to call ImageNet out of distribution (same with COCO) when a model is trained on LAION considering all the classes (at least an abstracted version of the class since LAION is more specific. i.e. ImageNet _distribution_ ⊂ LION _distribution_).
Another pet peeve of mine is arxiv links direct to PDF ;)
Some have a lot and those models are often ignored (except by lay people or hobbyists which is a different problem), but many serious submissions from serious groups for benchmarks like this check for contamination to specifically avoid the problem you’re suggesting. Process for decontamination has been outlined by many groups so you can often check it out.
So I am a ML researcher. Note that part of my comment is specifying how difficult it actually is to ensure lack of spoilage. The second paper I link is actually a pretty good proof of this. Though I'll say that I wish they had been a bit more explicit about how a random pruning significantly improves results. Because that is quite the result in of itself, given that the datasets they look at are already filtered. Dedupe is fucking hard. So I'm not looking for a handwavy "trust me" I'm looking for the explicit vetting processes applied to these specific datasets. It's incredibly important to know the limits of your tools and that includes datasets (as well as metrics).
As my sibling comment notes their decontamination process was outlined in the paper and you can reproduce it, though it may not be sufficient. That was my initial point, I wasn’t giving you handwavy trust claims, I was saying you’d likely find it in the paper.
Sure, but I'm your sibling is essentially in agreement with what I said about the difficulties of deduplication and this is in direct contention with your initial response to me
> many serious submissions from serious groups for benchmarks like this check for contamination to specifically avoid the problem you’re suggesting
Despite explicitly linking a serious work (HumanEval is still a common dataset to use[0]) and how the second work demonstrates roughly half of LAION containing dupes. Since they demonstrate that hashing and exact dedupe isn't a great way to actually get the desired outcome.
I'm not sure I actually buy the strong claim you make of "serious groups" considering we find this in works like GPT.
> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.
This does not seem a serious decontamination effort. It should not give confidence that the data is actually properly deduplicated/decontaminated. Especially when followed by
> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any
particular question contaminated. However we did not check explicitly.
Not to mention for the explicit reasons given by your sibling and in my original comment. So I hope this clarifies what I mean by "handwavy trust claims" especially when I'm asking for explicit information about how this particular niche tends to do things. I'm left unconvinced that there is any model which doesn't include significant amounts of contamination.
But I'll give credit to the authors of this work, as there is a comment in the main thread (@deepseekfake) that acknowledges the spoilage. Though this is not mentioned on the repo and there is, as you may notice, not a paper linked.
[0] My gripe with HumanEval is that it is a mostly OpenAI work and the claim is that there is no contamination because they wrote the questions by hand. But you can literally go take snippets from the dataset, paste it into GitHub search, limit the search to before 2021 and find results.
Are these exact? No. Are they the same? Yes. Granted, GitHub search isn't very good but I didn't write the paper and this was only a very quick search. I think we should be unsurprising that we get dupes given the nature of the question and anyone that writes that line that thinks no one else ever possibly wrote the exact same line is fooling themselves. If you cared about variables names, give it something crazy unique to ensure this. But if we're talking semantically similar (which we should be considering augmentations that go into data) then yeah I think it is worth questioning someone's legitimacy if they think snippets like these are nowhere to be found on GitHub: "return number % 1.0 ", "return [x for x in strings if substring in x]", "return ' '.join([str(x) for x in range(n + 1)])", "return len(set(string.lower()))", "return len(string)", "for i in reversed(range(n)): if n % i == 0: return i", "return ''.join(strings)". I'm not even really cherry picking other than just looking for short answers so I can easily post here.
It isn't in direct contention, I'm not certain you're reading my original message correctly. I never made any claim about difficulty though you're welcome to show that claim to me, that is your insertion. I said that serious groups do it and make their methodology available. You're welcome to reproduce these things yourself to find fault, which you clearly do.
I will say I didn't intend to debate this point. People try to solve the problem It may not be satisfactory or meet your standards, and I can't do much for your there. Sorry and good luck.
Yeah I'm not going to claim you intended to say anything or not. Miscommunication happens. Sorry about that.
How I read your original message is that the I do not need to worry about contamination because the big labs already account for this and their methods are specified in the papers. I was dissatisfied by this response because I thought I had demonstrated working knowledge of these methods through my mention of contents of the training data as well as provided links to two relevant papers, one I was criticizing about how the decontamination method was far more naive than one might expect and the other was a work showing a significant amount of contamination.
I understand and sorry as well. I was being casual with my initial commenting so if I implied that it's a solved problem I didn't mean to. I know I'm replying kind of late to this, but I appreciate your response and us both stepping back to collect ourselves.
- In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party;
- For military use in any way;
- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
- To generate or disseminate verifiably false information and/or content with the purpose of harming others;
- To generate or disseminate inappropriate content subject to applicable regulatory requirements;
- To generate or disseminate personal identifiable information without due authorization or for unreasonable use;
- To defame, disparage or otherwise harass others;
- For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
- For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
- To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
- For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
You either wouldn't be liable anyway (not responsible for what people use it for) or will still be held liable (took no measures to prevent malicious use).
Who would qualify as a user that seeks to violate applicable laws and yet is somehow identified as an official part of some legally recognized military? Furthermore, how would anybody know?
As a dumb Army guy if I were doing military research I would just keep it on my private military internet that does not exist for non-military users.
That's good to know, and better to admit. Earns a lot of respect, at least for me. Recall is still a pretty useful task. I just wish more would be less afraid to admit spoilage.
There's a good chance that's also true for GPT-4 given how they train. Without known completely new evals, it's hard to say that any LLM benchmark results aren't leakage.
This is most certainly true. If you look back to my comment and the discussion from the main thread I have two quotes from the GPT 4 technical paper
> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.
> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.
These are not great at building confidence that OpenAI does not have spoilage. Given what we know about the dedupe process (even from early 2023) this is not enough to purge contamination. Exact string matching has been the de facto method for quite some time and for quite some time we've known that this has issues. Just that 5 years ago these issues weren't as critical as they are today because performance was much lower back then.
I am not that verses on this topic but am curious what would be the biggest impact of leakage/spoilage on LLM perfermance? Is it similar to overfitting?
Yes, it'll generally lead to overfitting. This will look a lot like memorization btw. And just an fyi, you can still not diverge on the train/test split (as is common) and still overfit. That's an obvious signal but there are many ways to have a model overfit. As far as I'm aware, all giant LLMs and image generators show signs of overfitting. But note that sometimes this can be helpful. Obviously these tools are useful still so it is more about where they break than anything.
If you're trying to prove the model has reasoning abilities, ask it the question in a language other than English, even better give it multiple sentences in different languages and tell it to answer the question without first translating the sentences.
That's not a great metric and is going to be incredibly language dependent. For example, the European languages all have a lot of similarities and so it should be unsurprising that a model trained on English can do pretty well on French and German. But then if you are to ask it a language that is fairly disjoint (say Chinese) then you are held back by the lack of language data from that dataset (or you run into the exact same issue as previously).
It's definitely a metric worth trying, but we also must recognize the extreme limits of it too. Evaluation is quite difficult and the better our models perform the more difficult evaluation actually becomes. Anyone saying otherwise is likely trying to sell you something.
I think this is self-conflicting. If the evaluation is proprietary then it is most certainly not reputable. We'd want open metrics where we can analyze the limitations. Of course, we'd need open data too, but that's exceptionally rare these days. Plus, a metric isn't going to really tell us if we have have spoilage or not. You can get some evidence for spoilage through a trained model, but it is less direct, fuzzier, and more tells us about what information it was able to memorize rather than if the data was spoiled.
i don’t buy that premise. in practice we’re seeing a lot of evidence that you can’t trust the open evals because of contamination (maybe accidental, though there’s definitely incentive to cheat and move up the leaderboards).
closed/subjective ranking and evaluation has been around since there were critics. yes it’s hard to bootstrap trust, but i can’t see a way around it because the open evals can’t really be trusted either.
I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.
It's actually interesting results in a sense we see the limitation of LLM to memorize complicated information correctly. Gemini ultra also reported around 50% accuracy
https://arxiv.org/abs/2107.03374
https://arxiv.org/abs/2303.09540