> You turn a O(n^2) operation into O(nlog n). Sounds great until you realize tha...

> You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average.

Sure, but are long convolutions avoided precisely because they're expensive? This paper is talking about an alternative to an attention mechanism, which covers the entire context window, no? Isn't this paper saying: you could use a long convolution for this instead, and long convolutions don't have to be slow?

> you have to use complex numbers for calculations which are also less numerically stable

I haven't heard numerical stability being a big deal in neural nets; in fact don't people often use 16-bit floats as weights to save on space? Does the numerical stability of complex numbers exceed the precision dropped off by quantization anyway? Are complex numbers really inherently less numerically stable, or are we just not as good at using them yet?