Hacker News new | past | comments | ask | show | jobs | submit login

I have spoken to team members, and they all say the results of this and coder are very, very much leakage (no suprisse given the result!!)



That's good to know, and better to admit. Earns a lot of respect, at least for me. Recall is still a pretty useful task. I just wish more would be less afraid to admit spoilage.


There's a good chance that's also true for GPT-4 given how they train. Without known completely new evals, it's hard to say that any LLM benchmark results aren't leakage.


This is most certainly true. If you look back to my comment and the discussion from the main thread I have two quotes from the GPT 4 technical paper

> We measure cross-contamination between our evaluation dataset and the pre-training data using substring match. Both evaluation and training data are processed by removing all spaces and symbols keeping only characters (including numbers). For each evaluation example, we randomly select three substrings of 50 characters (or use the entire example if it’s less than 50 characters). A match is identified if any of the three sampled evaluation substrings is a substring of the processed training example. This yields a list of contaminated examples. We discard these and rerun to get uncontaminated scores.

> The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.

These are not great at building confidence that OpenAI does not have spoilage. Given what we know about the dedupe process (even from early 2023) this is not enough to purge contamination. Exact string matching has been the de facto method for quite some time and for quite some time we've known that this has issues. Just that 5 years ago these issues weren't as critical as they are today because performance was much lower back then.


I am not that verses on this topic but am curious what would be the biggest impact of leakage/spoilage on LLM perfermance? Is it similar to overfitting?


Yes, it'll generally lead to overfitting. This will look a lot like memorization btw. And just an fyi, you can still not diverge on the train/test split (as is common) and still overfit. That's an obvious signal but there are many ways to have a model overfit. As far as I'm aware, all giant LLMs and image generators show signs of overfitting. But note that sometimes this can be helpful. Obviously these tools are useful still so it is more about where they break than anything.


If you're trying to prove the model has reasoning abilities, ask it the question in a language other than English, even better give it multiple sentences in different languages and tell it to answer the question without first translating the sentences.


That's not a great metric and is going to be incredibly language dependent. For example, the European languages all have a lot of similarities and so it should be unsurprising that a model trained on English can do pretty well on French and German. But then if you are to ask it a language that is fairly disjoint (say Chinese) then you are held back by the lack of language data from that dataset (or you run into the exact same issue as previously).

It's definitely a metric worth trying, but we also must recognize the extreme limits of it too. Evaluation is quite difficult and the better our models perform the more difficult evaluation actually becomes. Anyone saying otherwise is likely trying to sell you something.


yea, that's my first thought seeing the result too. we need a reputable proprietary eval.


> reputable proprietary eval

I think this is self-conflicting. If the evaluation is proprietary then it is most certainly not reputable. We'd want open metrics where we can analyze the limitations. Of course, we'd need open data too, but that's exceptionally rare these days. Plus, a metric isn't going to really tell us if we have have spoilage or not. You can get some evidence for spoilage through a trained model, but it is less direct, fuzzier, and more tells us about what information it was able to memorize rather than if the data was spoiled.


i don’t buy that premise. in practice we’re seeing a lot of evidence that you can’t trust the open evals because of contamination (maybe accidental, though there’s definitely incentive to cheat and move up the leaderboards).

closed/subjective ranking and evaluation has been around since there were critics. yes it’s hard to bootstrap trust, but i can’t see a way around it because the open evals can’t really be trusted either.


I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.


It's actually interesting results in a sense we see the limitation of LLM to memorize complicated information correctly. Gemini ultra also reported around 50% accuracy


I think the SOTA is GPT4+tool use? I heard near 80%


Yes, tools help to advance over LLM limitations. GPT4 without tools is about 50% too.


So you created this account just to make this comment.


Why haven't they updated the github page?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: