I appreciate there is nuance, and some checks would be computationally expensive...

godelski · on July 30, 2023

For this case, yes. In the general sense, no. That's what I'm saying.

And to be clear, this is actually a common problem, not an uncommon one. Here's a bit more why. In general, can you tell me how I can identify duplicates in my dataset? Ken's methods only work under certain assumptions. The Filipino test only works because there is an exact match. It would not work if one was a subset of the other. Kinnews does a bit better, but also assumes precise matches. It's also important to remember that these are not very large datasets. Filipino is <1Mb and Kinnews is ~5M (the one used). MNIST is twice as large. The images also make it unhashable. So now we gotta do a double for loop. Each test image (10k) needs to be compared against each train image (60k). Granted, these are both trivially parallelizable loops, but I wanted to get an estimate and it too about 15 minutes (serially) to compute this and some 4GB. You can do much better, but that scale is going to eat you up. We're only talking about 784 dims. CIFAR-10, which is still small, is 3072 dims (almost 4x). ImageNet-1k ~= 200k (over a million train images and 100k test images), a 512x512 image is 786k, and 1024 is 3.1M.

So what do you do? Probabilistic method like a bloom filter? What about semantically similar data points? How do we define this? That's still an open problem[0]. Is this image[1] and this image[2] the same? What about this one?[3] I grabbed these with clip retrieval using "United Nations logo"[4] which is looking at the LAION 5B dataset. But you can also explore COCO[5], which mind you, people use COCO and ImageNet as a "zero-shot" classifier for models trained on LAION.

These images won't be exact matches and they aren't easy to filter out. Hell, matching images to pixel perfect values is a well known graphics problem and is how people do Canvas Fingerprinting. The silicon lottery plays a big role in this difference and so just using different machines to scrape the web can result in two people grabbing the exact same image with those images not matching.

I know that this problem looks easy at face value, but what I'm trying to tell you is that it is actually incredibly complex. The devil lives in the details. And like I said, they could do vetting for exact duplication and small datasets, but that only goes so far. This is a nasty problem and there's a reason people are so critical of LLMs. Because you bet there's test set spoilage. Anyone saying there isn't is either lying or ignorant.

Do not fool yourself into thinking a problem is easier than it is. You'll get burned.

[0] https://arxiv.org/abs/2303.09540

[1] https://vignette3.wikia.nocookie.net/overwatchfanon/images/b...

[2] https://nevensuboticstiftung.de/cms/wp-content/themes/nss-th...

[3] https://cmea-agmc.ca/sites/default/files/styles/150px_wide_x...

[4] https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

[5] https://cocodataset.org/#explore