The article has a link[1] to a discussion between the blog author and the paper author that I find revealing.
Perhaps as a reminder, the issue is that the paper’s implementation of their 2-nearest neighbor secretly uses an oracle to break ties, which obviously inflates the accuracy compared to a real-world kNN classifier that has to choose heuristically. To be fair, this could be a weird implementation accident and not malice. But I think it does invalidate the results.
But rather than admit error, the author defends this choice, and does so using (in my opinion) dubious statistical arguments. Which leads me to believe that — at least at this point — they know they made a mistake and just won’t admit it.
They claim that instead of a real-world accuracy, they wanted to find the “max” accuracy that their classifier was statistically capable of. That is, the accuracy you get if the stars happen to align and you get the luckiest possible result. Well, not only is this creative new metric not described in the paper, it’s also not applied to the other algorithms. For example, I think a neural network is capable of achieving a “max” accuracy of 100%, if all the initial weights happen to perfectly encode both the training and test sets. But of course they just use standard training to give the numbers for those algorithms.
> They claim that instead of a real-world accuracy, they wanted to find the “max” accuracy that their classifier was statistically capable of
Yeah, I read this on the GitHub issue a week ago and couldn't believe it. Ideally, their profile(1) should allow them to quickly admit they were wrong on such a simple issue. Pursuit of truth and knowledge, etc.
(1) a young PhD from a prestigious university
> For example, I think a neural network is capable of achieving a “max” accuracy of 100%
Why reach for such powerful tools? f(x) = random(num_classes), achieves 100% "upper bound" accuracy.
If I was confronted with this kind of nonsense in my data science job, I would lose all respect for the person who produced it and never thereafter trust anything they said without thoroughly vetting it.
There's only two options here, deceptive or hopelessly incompetent.
Ideally, their profile(1) should allow them to quickly admit they were wrong on such a simple issue.
Academia doesn't have a culture of admitting mistakes. Retracting a paper is typically seen as something shameful rather than progress (by scrutinizing results).
Combined with pressure to publish and sometimes limited engineering skills it leads to a volatile mix. There are a lot of published results that are not reproducible when you'd try (not saying anything new here, see replication crisis).
In its native environment, the scientists reserves its fiercest attacks for its competitors: by fighting tooth and nail, it can render its environment uninhabitable for nearly all but the most determined adversary. Sometimes, this ensures access to desirable mates, but not always.
Well put. Yes, I mention a similar case towards the end of that exchange: Consider a random-guess classifier. That has a max accuracy of 100%. Clearly, not a useful measure on its own.
In academia, it's better to cling to obviously false justifications to dismiss criticism and keep a paper accepted than to admit fault and potentially be forced to retract.
Retracting is extremely rare in computer science, which is why instead many conferences have started "stamping" papers that have artifacts which provide reproducible results.
A couple of AI hype cycles ago, everyone was abuzz about genetic algorithms. I recall a cautionary tale that was related about someone using FPGAs to do genetic algorithms.
After a while they noticed several disturbing things. One, that the winners had fewer gates than theory thought was necessary to solve the problem. Two, some days the winners didn't work, and three, sometimes the winners didn't work on a different FPGA.
After much study the answer was that the winning candidates were treating the gate logic as analog. Manufacturing flaws or PSU fluctuations would result in the analog aspects behaving differenty.
To fix this, they split the fitness test in two passes. All implementations that actually worked got re-run in an emulator, which of course treats the behavior as purely digital. Only if they worked with both did they avoid being culled.
Iirc there was a somewhat famous case where the design involved some gates that were obviously not connected to the rest of the logic. But if removed the results were different. Iirc the explanation was that the genetic algorithm created an oscillator circuit that became part of the program.
There are people in the research triangle park area that know stuff little known in Silicon Valley.
Aome contractors that were building NLP solutions for three-letter agencies told me the secret to hyper precise systems is to start with a classier that separates easy and hard cases and the create a series of classifiers that can solve the hard cases and fall back on an oracle for the really hard cases. That’s how the original IBM Watson worked.
Of course they won’t admit they made a mistake, they’re watching a career-making paper become a retraction (especially with the training data contamination issues).
Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.
The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.
ex. from my unit tests:
"what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'
Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.
For me, no. Mainly because "text classification" is a pretty limited application and one I don't plan to spend much time on. For NLP tasks that require a deeper "understanding", I don't see how compression algorithms can help much (at least directly).
One could say that you need to understand something about the artifact you are compressing, but, to be clear, you can compress text without understanding anything about its semantic content, and this is what gzip does. The only understanding needed for that level of compression is that the thing to be compressed is a string in a binary alphabet.
Of course, which is why gzip is a good baseline for "better" compressors that do have semantic understanding.
The whole idea of an autoencoder is conceptual compression. You take a concept (say: human faces) and create a compressor that is so overfit to that concept that when given complete goobldygook (random seed data) it decompresses that to something with semantic meaning!
It may sound strange out of context, but the most memorable quote I've encountered in any book or any piece of writing anywhere, at least in terms of informing my own understanding of language and the construction of meaning through communication, came in a book on screen writing by William Goldman. The guy who wrote The Princess Bride, of all things.
The sentence was simply, (and in capitals in the original), "POETRY IS COMPRESSION."
Yes, I agree. That's why I said directly (with regards to compression algorithms used for understanding). Indirectly, yes, compression and intelligence/understanding are closely related.
Thanks for linking to these other results. I found them very interesting. The latter is simply doing set intersection counts to measure distance, and it works well relative to the original technique. Has anyone compared the accuracy of these to naive bayes?
I really think that the numbers were inflated because the prolific benchmarkism that goes on in ML. Basically if you don't beat SOTA, you don't get published. Usually you need SOTA on MULTIPLE datasets. Which is is problematic, because plenty of non SOTA methods are useful (forget novel). Given the results Ken/ks2048 calculated, I am pretty confident that the work wouldn't have made it in. BUT I think the results given the other features does make the work quite useful! I agree Ken, that it unfairly boosts their work, but I understand why they're bending over backwards to defend it. I wish people would just admit mistakes but that risks (probably not) losing a paper. This is probably the same reason they didn't think to double check the suspicious results like the Filipino dataset too (btw, not uncommon for datasets to be spoiled people. Always be suspicious!).
I'm not trying to give them a pass, but we do need to discuss the perverse incentives we've set up that make these kinds of things prolific. The work should be good on its own, but good doesn't mean it'll get published in a journal. And frankly, it doesn't matter how many citations your arxiv paper has, people will still say "it isn't peer reviewed" and it won't help you get a job, graduate, or advance in academia. Which I think we should all agree is idiotic, since citations are indicating peer review too.
I don't blame them for failing to double check their results.
I blame them for giving obviously incorrect excuses on GitHub when such an obvious mistake is pointed out.
There is no way they could be at the stage they claim to be in their program (having just defended their thesis) and think the excuses they gave on GitHub are reasonable.
Yeah, I fully agree. They should just admit the mistake rather than try to justify it. I was just trying to explain the incentive structure around them that encourages this behavior. Unfortunately no one gives you points for admitting your mistakes (in fact, you risk losing points) and you are unlikely to lose points for doubling down on an error.
> There is no way they could be at the stage they claim to be in their program (having just defended their thesis) and think the excuses they gave on GitHub are reasonable.
Unfortunately it is a very noisy process. I know people from top 3 universities that have good publication records and don't know probabilities from likelihoods. I know students and professors at these universities that think autocasting your model to fp16 reduces your memory by half (from fp32) and are confused when you explain that that's a theoretical (and not practical) lower bound. Just the other day I had someone open an issue on my github (who has a PhD from one of these universities and is currently a professor!) who was expecting me to teach them how to load a pretrained model. This is not uncommon.
> Scientific research typically has been founded on high ethical standards established by researchers in academia and health care research institutions. Scientific fraud, an act of deception or misrepresentation of one's own work, violates these ethical standards.
And according to Ken Schutte:
> this method uses the test label as part of its decision process which is not the standard classification setting and can't be fairly compared to others that don't.
Can anyone make the case that these two descriptions don't overlap? Personally I can't see how the original author can be so blasé about this.
I try to explain in this comment[0]. I agree that this is unethical behavior, but we need to also be aware of what pressures are encouraging this behavior. I also think Ken is romanticizing the standards of science a bit here. This would be great, but it is not what happens in practice. Unfortunately. Mostly unintentionally, but there is intentional ones too.
> The paper’s repo does minimal processing on the datasets. It turns out that these problems exist in the source Huggingface datasets. The two worst ones can be checked quickly using only Huggingface’s datasets.load_dataset:
I'm really surprised HuggingFace isn't doing filtering/evaluation of the datasets they're presenting. This ought to be a simple check for them.
Is there feature for hf's datasets platform that makes load_dataset throw an exception if you try to load a known-dubious dataset unless you explicitly provide a kwarg like 'allow_dubious=True'? If not, that might be a boon for the whole field.. might nip the propagation of false results at the outset
That's a tall order. While the cases here are simple and more obvious, they don't scale well. It can also be problematic if an official dataset has the error, as now they've created a different one. They have 48,627 datasets. Their goal is not to validate datasets (which is far more difficult than checking for dupes (not easy btw)), but to be like github so that others (like Ken) can review the work of his peers and check for mistakes. Due to this, HF has to allow for uploading of arbitrary datasets, because they cannot be an arbitrator of what is good or bad, since that depends on what's being solved. They could probably set a flag for datasets (and maybe even some statistics!) that are under a few gigs in size, but they cannot and should not filter them.
I appreciate there is nuance, and some checks would be computationally expensive, but something like training data and evaluation data being literally identical seems like it would be pretty straightforward to check for and a very simple quick rejection.
For this case, yes. In the general sense, no. That's what I'm saying.
And to be clear, this is actually a common problem, not an uncommon one. Here's a bit more why. In general, can you tell me how I can identify duplicates in my dataset? Ken's methods only work under certain assumptions. The Filipino test only works because there is an exact match. It would not work if one was a subset of the other. Kinnews does a bit better, but also assumes precise matches. It's also important to remember that these are not very large datasets. Filipino is <1Mb and Kinnews is ~5M (the one used). MNIST is twice as large. The images also make it unhashable. So now we gotta do a double for loop. Each test image (10k) needs to be compared against each train image (60k). Granted, these are both trivially parallelizable loops, but I wanted to get an estimate and it too about 15 minutes (serially) to compute this and some 4GB. You can do much better, but that scale is going to eat you up. We're only talking about 784 dims. CIFAR-10, which is still small, is 3072 dims (almost 4x). ImageNet-1k ~= 200k (over a million train images and 100k test images), a 512x512 image is 786k, and 1024 is 3.1M.
So what do you do? Probabilistic method like a bloom filter? What about semantically similar data points? How do we define this? That's still an open problem[0]. Is this image[1] and this image[2] the same? What about this one?[3] I grabbed these with clip retrieval using "United Nations logo"[4] which is looking at the LAION 5B dataset. But you can also explore COCO[5], which mind you, people use COCO and ImageNet as a "zero-shot" classifier for models trained on LAION.
These images won't be exact matches and they aren't easy to filter out. Hell, matching images to pixel perfect values is a well known graphics problem and is how people do Canvas Fingerprinting. The silicon lottery plays a big role in this difference and so just using different machines to scrape the web can result in two people grabbing the exact same image with those images not matching.
I know that this problem looks easy at face value, but what I'm trying to tell you is that it is actually incredibly complex. The devil lives in the details. And like I said, they could do vetting for exact duplication and small datasets, but that only goes so far. This is a nasty problem and there's a reason people are so critical of LLMs. Because you bet there's test set spoilage. Anyone saying there isn't is either lying or ignorant.
Do not fool yourself into thinking a problem is easier than it is. You'll get burned.
might do good for you to google the minimum length principle (MDL). minimum length classifier has been used for image classifying, so not a new invention. our research group used minimum description length for generating new test material. the archive test suite that found several cvs from the gzip itself was done like so also.
There is this sense of deflation of effort in tech right now (always?) where, if you can just wait a moment longer to start coding, you can just adopt something else and save yourself from the rat race.
Even though it's fundamentally flawed and the author failed to own up to it, I like that paper because it's simple and intriguing. So I'm not mad that the author will keep the CV entry.
<offtopic probably>Haven't read the article but nowadays there's no reason to use either GZIP, or bzip2 when ZSTD is available. It's just so much better than both, I've no idea why people haven't replaced everything with ZSTD, except for XZ/7-Zip which can provide much high compression ratios at the cost of very slow compression and insane RAM requirements (3840MB dictionary with at least two threads).</offtopic probably>