Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?
For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.
Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:
- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.
- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.
- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.
- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.
However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.
I think that this is a good question that I would also like to know the answer to. Additionally, are there other benchmarks or tests where this issue (possibly) presents itself?
For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.