Hacker News new | past | comments | ask | show | jobs | submit login

I ran an experiment like this and in my setting it didn't help. Not saying there may not have been a bug or something, but I think attending over the current position sort of solves this problem. IE when it should not speak it just emits the current pos value.

edit to add details in case anyone is interested

I didn't add one to the softmax denom. I added a learned parameter (the attention sink) that would be appended to the beginning of QK but would be removed after softmax, so when multiplying by V the totals wouldn't sum to one. I tried variants that included looking at the current pos and not, and also variants that predicted used an ffn to generate the sink per position instead of a learned param. In my setting neither approach really made much of a difference. But I also had a bunch of other weird stuff in there too, so it may be worth trying again.




When you say it didn't help, can you clarify what you're measuring? In the context of this post, I think both the performance your task, and the number of outlier weights (and their magnitude) are important.


I was just looking at doing this in pretraining, so I was looking at pretraining losses. The difference was within the range of usual noise so I didn't keep trying.


this is fixing a different issue, not the one you are measuring.


It wasn't really the goal of my experiment to fix this issue for sure, I was trying to see if you could improve attention by decoupling the key used by a position for itself and for future tokens.

Open to being wrong here, but wouldn't it be functionally similar to adding a constant to the softmax denom? the function could sort of learn a specific position to have sink and q multiply to one, then removing it before multipling with v would be exactly identical?


The question concerns outliers ... how did the change manage them?


He's advertising it as fixing the spiking outliers. Did your variant have those outliers beforehand?


I guess yeah I was mostly responding to

Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).

I didn't test for outliers, but I don't think this will lead to a large improvement in attention overall/it will fix a lurking bug.


He’s not trying or claiming to improve attention. He’s trying to reduce outliers to improve the ability to quantize the parameters.


He refers all over the blog post to an "error" in attention. specifically says

The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man.

I'm saying it uses the current position to do this, that if it was a significant error I would expect it to improve the training loss. I sort of interpreted the blog post as being a bit more positive on the idea than just being about improving the quantization


I agree that he used the term error somewhat incorrectly. But he seems mainly to just be making the point that sumac introduces a large outlier which in turn is only now an issue that the community is now aggressively trying to quantize models


> I didn't test for outliers

Then you don't know if the approach he is advocating actually improves what he is aiming for




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: