> The paper’s repo does minimal processing on the datasets. It turns out that these problems exist in the source Huggingface datasets. The two worst ones can be checked quickly using only Huggingface’s datasets.load_dataset:
I'm really surprised HuggingFace isn't doing filtering/evaluation of the datasets they're presenting. This ought to be a simple check for them.
Is there feature for hf's datasets platform that makes load_dataset throw an exception if you try to load a known-dubious dataset unless you explicitly provide a kwarg like 'allow_dubious=True'? If not, that might be a boon for the whole field.. might nip the propagation of false results at the outset
That's a tall order. While the cases here are simple and more obvious, they don't scale well. It can also be problematic if an official dataset has the error, as now they've created a different one. They have 48,627 datasets. Their goal is not to validate datasets (which is far more difficult than checking for dupes (not easy btw)), but to be like github so that others (like Ken) can review the work of his peers and check for mistakes. Due to this, HF has to allow for uploading of arbitrary datasets, because they cannot be an arbitrator of what is good or bad, since that depends on what's being solved. They could probably set a flag for datasets (and maybe even some statistics!) that are under a few gigs in size, but they cannot and should not filter them.
I appreciate there is nuance, and some checks would be computationally expensive, but something like training data and evaluation data being literally identical seems like it would be pretty straightforward to check for and a very simple quick rejection.
For this case, yes. In the general sense, no. That's what I'm saying.
And to be clear, this is actually a common problem, not an uncommon one. Here's a bit more why. In general, can you tell me how I can identify duplicates in my dataset? Ken's methods only work under certain assumptions. The Filipino test only works because there is an exact match. It would not work if one was a subset of the other. Kinnews does a bit better, but also assumes precise matches. It's also important to remember that these are not very large datasets. Filipino is <1Mb and Kinnews is ~5M (the one used). MNIST is twice as large. The images also make it unhashable. So now we gotta do a double for loop. Each test image (10k) needs to be compared against each train image (60k). Granted, these are both trivially parallelizable loops, but I wanted to get an estimate and it too about 15 minutes (serially) to compute this and some 4GB. You can do much better, but that scale is going to eat you up. We're only talking about 784 dims. CIFAR-10, which is still small, is 3072 dims (almost 4x). ImageNet-1k ~= 200k (over a million train images and 100k test images), a 512x512 image is 786k, and 1024 is 3.1M.
So what do you do? Probabilistic method like a bloom filter? What about semantically similar data points? How do we define this? That's still an open problem[0]. Is this image[1] and this image[2] the same? What about this one?[3] I grabbed these with clip retrieval using "United Nations logo"[4] which is looking at the LAION 5B dataset. But you can also explore COCO[5], which mind you, people use COCO and ImageNet as a "zero-shot" classifier for models trained on LAION.
These images won't be exact matches and they aren't easy to filter out. Hell, matching images to pixel perfect values is a well known graphics problem and is how people do Canvas Fingerprinting. The silicon lottery plays a big role in this difference and so just using different machines to scrape the web can result in two people grabbing the exact same image with those images not matching.
I know that this problem looks easy at face value, but what I'm trying to tell you is that it is actually incredibly complex. The devil lives in the details. And like I said, they could do vetting for exact duplication and small datasets, but that only goes so far. This is a nasty problem and there's a reason people are so critical of LLMs. Because you bet there's test set spoilage. Anyone saying there isn't is either lying or ignorant.
Do not fool yourself into thinking a problem is easier than it is. You'll get burned.
I'm really surprised HuggingFace isn't doing filtering/evaluation of the datasets they're presenting. This ought to be a simple check for them.