sedael's comments

sedael · on July 24, 2023

>why did no one come up with this before

So it turns out someone did. Specifically google did. This exact same idea has been in flaxformers since at least November 2021.

https://github.com/google/flaxformer/blame/ee62754ebe5a5eeb1...

Specifically to save people a click it says:

> """Softmax function with an additional virtual logit equal to zero.

  For compatibility with some previously trained models.

  This is equivalent to adding one to the denominator.
  In the context of attention, it allows you to attend to nothing.

And creates the exact same modified softmax as this essay. I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows

littlestymaar · on July 24, 2023

> I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows

Maybe quantization wasn't as hot back then than it is now?

jablongo · on July 24, 2023

Yea the benefit is not going to come in terms of performance for a given model, but in terms of ability to be efficiently quantized.

toxik · on July 24, 2023

Or maybe it doesn’t really do anything to improve performance.