> """Softmax function with an additional virtual logit equal to zero.
For compatibility with some previously trained models.
This is equivalent to adding one to the denominator.
In the context of attention, it allows you to attend to nothing.
And creates the exact same modified softmax as this essay. I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows
> I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows
Maybe quantization wasn't as hot back then than it is now?
So it turns out someone did. Specifically google did. This exact same idea has been in flaxformers since at least November 2021.
https://github.com/google/flaxformer/blame/ee62754ebe5a5eeb1...
Specifically to save people a click it says:
> """Softmax function with an additional virtual logit equal to zero.
And creates the exact same modified softmax as this essay. I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows