> why did no one come up with this before? Because the author is intimately fami...

> why did no one come up with this before? Because the author is intimately familiar with the softmax function from work outside of ML, and plausibly nobody who’s investigating these issues is remotely as familiar

I doubt that is true. Softmax is extremely well understood within the ML community. It's a very common trick, these properties are well-known as well. It feels very unlikely that nobody has thought of this before. That said, it's also plausible that the current softmax convention was chosen by accident and the author is right to identify this drawback.