Possibly dumb question: How do you ensure there's no data leakage when benchmark...

craffel · on Oct 25, 2019

Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:

- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.

- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.

- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.

- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.

throwaway_bad · on Oct 25, 2019

> Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps

I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...

Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).

craffel · on Oct 25, 2019

It actually can be more pernicious than that: https://arxiv.org/abs/1802.08232

However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.

calabin · on Oct 25, 2019

I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?

taneq · on Oct 25, 2019

You don't. Even humans frequently leak information like this. It's just a consequence of having incomplete or incompletely analyzed information.

jcims · on Oct 25, 2019

Not dumb at all and probably a major challenge when developing benchmarks.