I thinks that's far from the only problem.
To me the most obvious problem is that we use right-to-left numbers (think about the order you're writing digits when doing long addition) in a left-to-right language.
Without a special number-flipping step; the transformer is forced to produce the output token-by-token, i.e. from left-to-right. Without the ability to store additional internal state, this turns addition into an O(N²) problem purely due to the suboptimal output ordering!
The paper discusses this, and the approach taken in the paper implements a number-flip stage, so numbers are formatted with their least significant figure first.