I agree with your conclusions, but not necessarily with the reasons you present....

I agree with your conclusions, but not necessarily with the reasons you present. I don't think it's _that_ easy for a current transformer to pass the information unaltered (i.e. to effectively replace softmax with 0).

In particular, I think the feedforward point you list in your "Second" is actually wrong. Replacing a softmax with 0, as the OP wants to do, is tantamount to passing the information unchanged, because the attention block is within a residual (skip) connection. If it's set to zero, the next output is identical to the previous layer output. There is no way to recover this effect with the feedforward layer.

The part that you can set V to zero is true, but somehow a different idea: the Q and K should be able to set to 0 if no token wants to be "close" to some other token, in some sense. But the V layer shouldn't "know" about this, because it can't look at other tokens. This is of course only how we think of transformers, which might or might not (more likely, the latter) be how it actually works. But nevertheless, having a 0 value coming out of the K.Q^T part only would be very meaningful.

Your "first" point is technically true (albeit logically false): if you have a sequence of length 32k, like GPT4-32k, and your softmax logits all predict the same value, the result will be an average of the V layer, divided by 32k, which is effectively close to zero. However, calibrating "exactly the same value" is extremely hard for a neural network, and there is no "default value" it can predict to make sure that's the case - even if you push all the values to one side, the result doesn't change, because softmax is translation invariant. Plus, if you have a short sentence, that's not true anymore. If you only have two tokens, one of them must be activated, or both with only a 0.5 factor. Surely if you have very few tokens there's much more contamination between Q, K, and V, so in that case V can indeed take a 0 value, but it's non-trivial and requires more layers.

All in all, adding that "+1" isn't quite meaningless, I think. Nevertheless, I believe it won't change much: these very big models have ways to get around any kind of smart small modification you do. If the intuition is very right, it might be that you can squeeze 1% out more accuracy in a handful of tests, after you carefully optimize all other parameters, which would be enough to get you a paper in a top conference. And it might also be implemented as a standard from them on (because, in this case, it basically doesn't cost any more computations, so it's "free"). But I would bet it won't be a major revolution.

That said, as you say, the only way to know would be to train a few models with this option and check the actual quality of them (certainly not GPT-style, nor GPT4-size, models, to begin with, but something quicker to train and easier to test in a fully automated way; old "boring" models like those in the BERT family would be a good point to start testing). But to do that effectively, you'd need somebody skilled in training this kind of models, with the cleaned data ready at hand, etc. (and a small compute budget, of course, but nothing revolutionary, a few thousand $ in GPU credits could be enough)