Hacker News new | past | comments | ask | show | jobs | submit | sedael's comments login

>why did no one come up with this before

So it turns out someone did. Specifically google did. This exact same idea has been in flaxformers since at least November 2021.

https://github.com/google/flaxformer/blame/ee62754ebe5a5eeb1...

Specifically to save people a click it says:

> """Softmax function with an additional virtual logit equal to zero.

  For compatibility with some previously trained models.

  This is equivalent to adding one to the denominator.
  In the context of attention, it allows you to attend to nothing.
And creates the exact same modified softmax as this essay. I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows


> I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows

Maybe quantization wasn't as hot back then than it is now?


Yea the benefit is not going to come in terms of performance for a given model, but in terms of ability to be efficiently quantized.


Or maybe it doesn’t really do anything to improve performance.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: