Hacker News new | past | comments | ask | show | jobs | submit login

Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?

For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.




Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:

- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.

- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.

- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.

- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.


> Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps

I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...

Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).


It actually can be more pernicious than that: https://arxiv.org/abs/1802.08232

However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.


I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?


You don't. Even humans frequently leak information like this. It's just a consequence of having incomplete or incompletely analyzed information.


Not dumb at all and probably a major challenge when developing benchmarks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: