This is a technique that's been known for years and is in PyTorch. It's not wide...

jszymborski · on July 25, 2023

> ...is in PyTorch

Could anyone kindly point me to this as I can't find it.

babel_ · on July 26, 2023

The add_zero_attn parameter in PyTorch is used for this, but by default their softmax is the regular kind. It has been in flaxformer for a couple years now though, however it claims to be a compatibility variant for older models [2] and I haven't seen any mention of it in their recent papers (though I've not checked exhaustively).

[1]: https://pytorch.org/docs/stable/generated/torch.nn.Multihead... [2]: https://github.com/google/flaxformer/blob/main/flaxformer/co...