Gzip beats BERT? Part 2: dataset issues, improved speed, and results

beefman · on July 29, 2023

Part 1 discussed here: https://news.ycombinator.com/item?id=36758433

dang · on July 29, 2023

Thanks! Macroexpanded:

Bad numbers in the “gzip beats BERT” paper? - https://news.ycombinator.com/item?id=36758433 - July 2023 (128 comments)

codeflo · on July 29, 2023

The article has a link[1] to a discussion between the blog author and the paper author that I find revealing.

Perhaps as a reminder, the issue is that the paper’s implementation of their 2-nearest neighbor secretly uses an oracle to break ties, which obviously inflates the accuracy compared to a real-world kNN classifier that has to choose heuristically. To be fair, this could be a weird implementation accident and not malice. But I think it does invalidate the results.

But rather than admit error, the author defends this choice, and does so using (in my opinion) dubious statistical arguments. Which leads me to believe that — at least at this point — they know they made a mistake and just won’t admit it.

They claim that instead of a real-world accuracy, they wanted to find the “max” accuracy that their classifier was statistically capable of. That is, the accuracy you get if the stars happen to align and you get the luckiest possible result. Well, not only is this creative new metric not described in the paper, it’s also not applied to the other algorithms. For example, I think a neural network is capable of achieving a “max” accuracy of 100%, if all the initial weights happen to perfectly encode both the training and test sets. But of course they just use standard training to give the numbers for those algorithms.

[1] https://github.com/bazingagin/npc_gzip/issues/3

pedrosorio · on July 29, 2023

> They claim that instead of a real-world accuracy, they wanted to find the “max” accuracy that their classifier was statistically capable of

Yeah, I read this on the GitHub issue a week ago and couldn't believe it. Ideally, their profile(1) should allow them to quickly admit they were wrong on such a simple issue. Pursuit of truth and knowledge, etc.

(1) a young PhD from a prestigious university

> For example, I think a neural network is capable of achieving a “max” accuracy of 100%

Why reach for such powerful tools? f(x) = random(num_classes), achieves 100% "upper bound" accuracy.

civilized · on July 30, 2023

If I was confronted with this kind of nonsense in my data science job, I would lose all respect for the person who produced it and never thereafter trust anything they said without thoroughly vetting it.

There's only two options here, deceptive or hopelessly incompetent.

danieldk · on July 30, 2023

Ideally, their profile(1) should allow them to quickly admit they were wrong on such a simple issue.

Academia doesn't have a culture of admitting mistakes. Retracting a paper is typically seen as something shameful rather than progress (by scrutinizing results).

Combined with pressure to publish and sometimes limited engineering skills it leads to a volatile mix. There are a lot of published results that are not reproducible when you'd try (not saying anything new here, see replication crisis).

dekhn · on July 30, 2023

In its native environment, the scientists reserves its fiercest attacks for its competitors: by fighting tooth and nail, it can render its environment uninhabitable for nearly all but the most determined adversary. Sometimes, this ensures access to desirable mates, but not always.

ks2048 · on July 29, 2023

Well put. Yes, I mention a similar case towards the end of that exchange: Consider a random-guess classifier. That has a max accuracy of 100%. Clearly, not a useful measure on its own.

lalaland1125 · on July 29, 2023

In academia, it's better to cling to obviously false justifications to dismiss criticism and keep a paper accepted than to admit fault and potentially be forced to retract.

Publish or perish

bonzini · on July 29, 2023

Retracting is extremely rare in computer science, which is why instead many conferences have started "stamping" papers that have artifacts which provide reproducible results.

hinkley · on July 29, 2023

A couple of AI hype cycles ago, everyone was abuzz about genetic algorithms. I recall a cautionary tale that was related about someone using FPGAs to do genetic algorithms.

After a while they noticed several disturbing things. One, that the winners had fewer gates than theory thought was necessary to solve the problem. Two, some days the winners didn't work, and three, sometimes the winners didn't work on a different FPGA.

After much study the answer was that the winning candidates were treating the gate logic as analog. Manufacturing flaws or PSU fluctuations would result in the analog aspects behaving differenty.

To fix this, they split the fitness test in two passes. All implementations that actually worked got re-run in an emulator, which of course treats the behavior as purely digital. Only if they worked with both did they avoid being culled.

fho · on July 30, 2023

Iirc there was a somewhat famous case where the design involved some gates that were obviously not connected to the rest of the logic. But if removed the results were different. Iirc the explanation was that the genetic algorithm created an oscillator circuit that became part of the program.

PaulHoule · on Aug 2, 2023

There are people in the research triangle park area that know stuff little known in Silicon Valley.

Aome contractors that were building NLP solutions for three-letter agencies told me the secret to hyper precise systems is to start with a classier that separates easy and hard cases and the create a series of classifiers that can solve the hard cases and fall back on an oracle for the really hard cases. That’s how the original IBM Watson worked.

catgary · on July 30, 2023

Of course they won’t admit they made a mistake, they’re watching a career-making paper become a retraction (especially with the training data contamination issues).

ks2048 · on July 29, 2023

This is my blog post, if anyone has any questions.

I'll add that since I wrote these two blog posts, other people have sent me their other interesting work:

(1) I link to this at the end of the post (using zstd dictionaries): https://github.com/cyrilou242/ftcc

(2) today someone sent me this (bag-of-words better than gzip): https://arxiv.org/abs/2307.15002v1

cs702 · on July 29, 2023

No questions from me. Just want to say: Thank you for doing all this work!

p1esk · on July 29, 2023

Your conclusion: “using ideas from text compression for text classification tasks is an interesting idea and may lead to other interesting research.”

Would you say this idea is interesting enough for you personally to research it further?

refulgentis · on July 29, 2023

Text-similarity embeddings aren't very interesting and will correlate with gzip, especially when the test is text similarity, especially when they're distinct vocabularies being tested.

The really useful ones are based on SBERT, and measure the likelihood that the answer is contained in the text that was embedded.

ex. from my unit tests: "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but a very weak match to 'my jewelry is stored safely in the safe'

I learned this from https://news.ycombinator.com/item?id=35377935: thank you to whoever posted this, blew my mind and gave me a powerful differentiator

darkteflon · on July 30, 2023

Found this fascinating, thanks. I’ve been circling SBERT over the last few weeks (along with a range of other techniques to improve the quality of retrieval). Reading your comment and the linked post and comments has really cemented for me that we’re on the right track.

ks2048 · on July 29, 2023

For me, no. Mainly because "text classification" is a pretty limited application and one I don't plan to spend much time on. For NLP tasks that require a deeper "understanding", I don't see how compression algorithms can help much (at least directly).

nico · on July 29, 2023

Just conceptually, compression is an analog of understanding

To be able to compress something, you need to understand it first

We use this everyday, we compress things by naming them

Once we name something, we don’t need to explain or describe, we can just use the name instead

That allows us to compress our communications and it directly affects the parties understanding of the information

That’s just conceptually. At a math/algorithm level I don’t really know the specifics of your research or the paper in question

mannykannot · on July 29, 2023

One could say that you need to understand something about the artifact you are compressing, but, to be clear, you can compress text without understanding anything about its semantic content, and this is what gzip does. The only understanding needed for that level of compression is that the thing to be compressed is a string in a binary alphabet.

joshuamorton · on July 29, 2023

Of course, which is why gzip is a good baseline for "better" compressors that do have semantic understanding.

The whole idea of an autoencoder is conceptual compression. You take a concept (say: human faces) and create a compressor that is so overfit to that concept that when given complete goobldygook (random seed data) it decompresses that to something with semantic meaning!

ChainOfFools · on July 29, 2023

It may sound strange out of context, but the most memorable quote I've encountered in any book or any piece of writing anywhere, at least in terms of informing my own understanding of language and the construction of meaning through communication, came in a book on screen writing by William Goldman. The guy who wrote The Princess Bride, of all things.

The sentence was simply, (and in capitals in the original), "POETRY IS COMPRESSION."

quickthrower2 · on July 29, 2023

Would make a good haiku line 2

ks2048 · on July 29, 2023

Yes, I agree. That's why I said directly (with regards to compression algorithms used for understanding). Indirectly, yes, compression and intelligence/understanding are closely related.

WoodenChair · on July 30, 2023

Thanks for linking to these other results. I found them very interesting. The latter is simply doing set intersection counts to measure distance, and it works well relative to the original technique. Has anyone compared the accuracy of these to naive bayes?

phyzome · on July 29, 2023

I'm idly curious how much of a speedup you achieved.

ks2048 · on July 29, 2023

I don't have complete numbers on this (I think it depends a lot on the size of training set), but for one dataset, normalized time for a batch:

    original    : 1.000
    precomputed : 0.644 (first improvement)
    gziplength  : 0.428 (+ 2nd improvement)

godelski · on July 29, 2023

I really think that the numbers were inflated because the prolific benchmarkism that goes on in ML. Basically if you don't beat SOTA, you don't get published. Usually you need SOTA on MULTIPLE datasets. Which is is problematic, because plenty of non SOTA methods are useful (forget novel). Given the results Ken/ks2048 calculated, I am pretty confident that the work wouldn't have made it in. BUT I think the results given the other features does make the work quite useful! I agree Ken, that it unfairly boosts their work, but I understand why they're bending over backwards to defend it. I wish people would just admit mistakes but that risks (probably not) losing a paper. This is probably the same reason they didn't think to double check the suspicious results like the Filipino dataset too (btw, not uncommon for datasets to be spoiled people. Always be suspicious!).

I'm not trying to give them a pass, but we do need to discuss the perverse incentives we've set up that make these kinds of things prolific. The work should be good on its own, but good doesn't mean it'll get published in a journal. And frankly, it doesn't matter how many citations your arxiv paper has, people will still say "it isn't peer reviewed" and it won't help you get a job, graduate, or advance in academia. Which I think we should all agree is idiotic, since citations are indicating peer review too.

lalaland1125 · on July 29, 2023

I don't blame them for failing to double check their results.

I blame them for giving obviously incorrect excuses on GitHub when such an obvious mistake is pointed out.

There is no way they could be at the stage they claim to be in their program (having just defended their thesis) and think the excuses they gave on GitHub are reasonable.

godelski · on July 29, 2023

Yeah, I fully agree. They should just admit the mistake rather than try to justify it. I was just trying to explain the incentive structure around them that encourages this behavior. Unfortunately no one gives you points for admitting your mistakes (in fact, you risk losing points) and you are unlikely to lose points for doubling down on an error.

> There is no way they could be at the stage they claim to be in their program (having just defended their thesis) and think the excuses they gave on GitHub are reasonable.

Unfortunately it is a very noisy process. I know people from top 3 universities that have good publication records and don't know probabilities from likelihoods. I know students and professors at these universities that think autocasting your model to fp16 reduces your memory by half (from fp32) and are confused when you explain that that's a theoretical (and not practical) lower bound. Just the other day I had someone open an issue on my github (who has a PhD from one of these universities and is currently a professor!) who was expecting me to teach them how to load a pretrained model. This is not uncommon.

Goodhart's Law is a bitch.

luc4sdreyer · on July 29, 2023

> Scientific research typically has been founded on high ethical standards established by researchers in academia and health care research institutions. Scientific fraud, an act of deception or misrepresentation of one's own work, violates these ethical standards.

And according to Ken Schutte:

> this method uses the test label as part of its decision process which is not the standard classification setting and can't be fairly compared to others that don't.

Can anyone make the case that these two descriptions don't overlap? Personally I can't see how the original author can be so blasé about this.

[1] https://pubmed.ncbi.nlm.nih.gov/2061524/

godelski · on July 29, 2023

I try to explain in this comment[0]. I agree that this is unethical behavior, but we need to also be aware of what pressures are encouraging this behavior. I also think Ken is romanticizing the standards of science a bit here. This would be great, but it is not what happens in practice. Unfortunately. Mostly unintentionally, but there is intentional ones too.

[0] https://news.ycombinator.com/item?id=36922708

Twirrim · on July 29, 2023

> The paper’s repo does minimal processing on the datasets. It turns out that these problems exist in the source Huggingface datasets. The two worst ones can be checked quickly using only Huggingface’s datasets.load_dataset:

I'm really surprised HuggingFace isn't doing filtering/evaluation of the datasets they're presenting. This ought to be a simple check for them.

_delirium · on July 29, 2023

I think of HuggingFace as essentially a GitHub for ML stuff. They just provide infrastructure that anyone can upload to.

pizza · on July 29, 2023

Is there feature for hf's datasets platform that makes load_dataset throw an exception if you try to load a known-dubious dataset unless you explicitly provide a kwarg like 'allow_dubious=True'? If not, that might be a boon for the whole field.. might nip the propagation of false results at the outset

godelski · on July 29, 2023

That's a tall order. While the cases here are simple and more obvious, they don't scale well. It can also be problematic if an official dataset has the error, as now they've created a different one. They have 48,627 datasets. Their goal is not to validate datasets (which is far more difficult than checking for dupes (not easy btw)), but to be like github so that others (like Ken) can review the work of his peers and check for mistakes. Due to this, HF has to allow for uploading of arbitrary datasets, because they cannot be an arbitrator of what is good or bad, since that depends on what's being solved. They could probably set a flag for datasets (and maybe even some statistics!) that are under a few gigs in size, but they cannot and should not filter them.

Twirrim · on July 29, 2023

I appreciate there is nuance, and some checks would be computationally expensive, but something like training data and evaluation data being literally identical seems like it would be pretty straightforward to check for and a very simple quick rejection.

godelski · on July 30, 2023

For this case, yes. In the general sense, no. That's what I'm saying.

And to be clear, this is actually a common problem, not an uncommon one. Here's a bit more why. In general, can you tell me how I can identify duplicates in my dataset? Ken's methods only work under certain assumptions. The Filipino test only works because there is an exact match. It would not work if one was a subset of the other. Kinnews does a bit better, but also assumes precise matches. It's also important to remember that these are not very large datasets. Filipino is <1Mb and Kinnews is ~5M (the one used). MNIST is twice as large. The images also make it unhashable. So now we gotta do a double for loop. Each test image (10k) needs to be compared against each train image (60k). Granted, these are both trivially parallelizable loops, but I wanted to get an estimate and it too about 15 minutes (serially) to compute this and some 4GB. You can do much better, but that scale is going to eat you up. We're only talking about 784 dims. CIFAR-10, which is still small, is 3072 dims (almost 4x). ImageNet-1k ~= 200k (over a million train images and 100k test images), a 512x512 image is 786k, and 1024 is 3.1M.

So what do you do? Probabilistic method like a bloom filter? What about semantically similar data points? How do we define this? That's still an open problem[0]. Is this image[1] and this image[2] the same? What about this one?[3] I grabbed these with clip retrieval using "United Nations logo"[4] which is looking at the LAION 5B dataset. But you can also explore COCO[5], which mind you, people use COCO and ImageNet as a "zero-shot" classifier for models trained on LAION.

These images won't be exact matches and they aren't easy to filter out. Hell, matching images to pixel perfect values is a well known graphics problem and is how people do Canvas Fingerprinting. The silicon lottery plays a big role in this difference and so just using different machines to scrape the web can result in two people grabbing the exact same image with those images not matching.

I know that this problem looks easy at face value, but what I'm trying to tell you is that it is actually incredibly complex. The devil lives in the details. And like I said, they could do vetting for exact duplication and small datasets, but that only goes so far. This is a nasty problem and there's a reason people are so critical of LLMs. Because you bet there's test set spoilage. Anyone saying there isn't is either lying or ignorant.

Do not fool yourself into thinking a problem is easier than it is. You'll get burned.

[0] https://arxiv.org/abs/2303.09540

[1] https://vignette3.wikia.nocookie.net/overwatchfanon/images/b...

[2] https://nevensuboticstiftung.de/cms/wp-content/themes/nss-th...

[3] https://cmea-agmc.ca/sites/default/files/styles/150px_wide_x...

[4] https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

[5] https://cocodataset.org/#explore

lalaland1125 · on July 29, 2023

It's not the job of HuggingFace to certify datasets. It's simply outside the scope of their work.

recov · on July 29, 2023

Related, sentdex did a video and implementation on it as well: https://www.youtube.com/watch?v=jkdWzvMOPuo

dekhn · on July 29, 2023

This is a masterwork of analysis and improvement of a method.

yigitkonur35 · on July 30, 2023

This Gzip thing and ReAct in LLMs show that AI researchers must ask more questions to AI to name the technologies they have created.

mindesc · on July 30, 2023

might do good for you to google the minimum length principle (MDL). minimum length classifier has been used for image classifying, so not a new invention. our research group used minimum description length for generating new test material. the archive test suite that found several cvs from the gzip itself was done like so also.

juliushuijnk · on July 30, 2023

sentdex youtuber on this:

https://m.youtube.com/watch?v=jkdWzvMOPuo

He tests the algo without the flaw and is surprised it still works rather well.

__vec__ · on July 29, 2023

Anyone recreate all of the results with un"contaminated datasets"?

birdyrooster · on July 29, 2023

There is this sense of deflation of effort in tech right now (always?) where, if you can just wait a moment longer to start coding, you can just adopt something else and save yourself from the rat race.

faangiq · on July 30, 2023

Is BERT the ultimate midwit tier algo?

JoeyBananas · on July 30, 2023

Even though it's fundamentally flawed and the author failed to own up to it, I like that paper because it's simple and intriguing. So I'm not mad that the author will keep the CV entry.

itvision · on July 29, 2023

<offtopic probably>Haven't read the article but nowadays there's no reason to use either GZIP, or bzip2 when ZSTD is available. It's just so much better than both, I've no idea why people haven't replaced everything with ZSTD, except for XZ/7-Zip which can provide much high compression ratios at the cost of very slow compression and insane RAM requirements (3840MB dictionary with at least two threads).</offtopic probably>