Yeah we used to use this in our older models years ago... I don't recall the det...

ggerganov · on July 24, 2023

> I don't recall the details exactly, but I don't think it ever did very much.

How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data

danielmarkbruce · on July 24, 2023

Are you asking "why would you have bothered to look at"?

The "how" is pretty straightforward.

p1esk · on July 25, 2023

He's questioning the statement: "I don't think [the trick] ever did very much", because no one has yet looked at whether the trick helps reducing outliers in very large models. If it does help with this, as the blog author believes, then it is indeed a very useful trick.

danielmarkbruce · on July 25, 2023

Is he? A surface level reading suggests he's asking "how would you know".. and the answer is... by looking at the parameters. People do that.

>> because no one has yet looked at whether the trick helps reducing outliers in very large models

Given a softmax version doing exactly as the blog post says is baked into a google library (see this thread), and you can set it as a parameter in a pytorch model (see this thread), this claim seems off. "Let's try X, oh, X doesn't do much, let's not write a paper about it" is extremely common for many X.

tudorw · on July 25, 2023

This would seem like a really good argument as to why failures should be written up, otherwise where is the list of what has been tried before?

danielmarkbruce · on July 25, 2023

Yup, it is. But it isn't going to happen.

ggerganov · on July 25, 2023

Yes, I assumed that checking the weights for presence and amount of outliers is not something that is usually done and effects on this can be overlooked. If my assumption is wrong and researchers do usually look at such metrics, then my question is not very relevant.

Agree - the "how" is straightforward