This is a technique that's been known for years and is in PyTorch. It's not widely used because people tried it and, in practice, it doesn't work as well.
OP calling it a "bug that's been overlooked for 8+ years" is click bait.
The add_zero_attn parameter in PyTorch is used for this, but by default their softmax is the regular kind. It has been in flaxformer for a couple years now though, however it claims to be a compatibility variant for older models [2] and I haven't seen any mention of it in their recent papers (though I've not checked exhaustively).
OP calling it a "bug that's been overlooked for 8+ years" is click bait.