I ran an experiment like this and in my setting it didn't help. Not saying there...

abeppu · on July 24, 2023

When you say it didn't help, can you clarify what you're measuring? In the context of this post, I think both the performance your task, and the number of outlier weights (and their magnitude) are important.

chessgecko · on July 24, 2023

I was just looking at doing this in pretraining, so I was looking at pretraining losses. The difference was within the range of usual noise so I didn't keep trying.

lucidrains · on July 24, 2023

this is fixing a different issue, not the one you are measuring.

chessgecko · on July 24, 2023

It wasn't really the goal of my experiment to fix this issue for sure, I was trying to see if you could improve attention by decoupling the key used by a position for itself and for future tokens.

Open to being wrong here, but wouldn't it be functionally similar to adding a constant to the softmax denom? the function could sort of learn a specific position to have sink and q multiply to one, then removing it before multipling with v would be exactly identical?

waynecochran · on July 24, 2023

The question concerns outliers ... how did the change manage them?

gwern · on July 24, 2023

He's advertising it as fixing the spiking outliers. Did your variant have those outliers beforehand?

chessgecko · on July 24, 2023

I guess yeah I was mostly responding to

Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).

I didn't test for outliers, but I don't think this will lead to a large improvement in attention overall/it will fix a lurking bug.

zackangelo · on July 24, 2023

He’s not trying or claiming to improve attention. He’s trying to reduce outliers to improve the ability to quantize the parameters.

chessgecko · on July 24, 2023

He refers all over the blog post to an "error" in attention. specifically says

The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man.

I'm saying it uses the current position to do this, that if it was a significant error I would expect it to improve the training loss. I sort of interpreted the blog post as being a bit more positive on the idea than just being about improving the quantization

lyjackal · on July 25, 2023

I agree that he used the term error somewhat incorrectly. But he seems mainly to just be making the point that sumac introduces a large outlier which in turn is only now an issue that the community is now aggressively trying to quantize models

nextaccountic · on July 25, 2023

> I didn't test for outliers

Then you don't know if the approach he is advocating actually improves what he is aiming for