The idea is that if your range of values is small enough you need fewer bits to ...

The idea is that if your range of values is small enough you need fewer bits to distinguish between meaningfully different values. The problem is that exp(x) << exp(y) for sufficiently wide ranges [x, y], so that when normalizing in the softmax and subsequently quantizing you don't get the fidelity you need and too much information is lost between layers. The proposed solution is that modifying the softmax step slightly brings x and y close enough to zero that exp(x) and exp(y) are close enough so that more compact quantizations are useful instead of useless.