"Overall, while approaches such as FNet, Performer, and sparse transformers demonstrate that either fixed or approximate token mixing can reduce computational overhead, our adaptive spectral filtering strategy uniquely merges the efficiency of the FFT with a learnable, input-dependent spectral filter. This provides a compelling combination of scalability and adaptability, which is crucial for complex
sequence modeling tasks."
Except that the paper is written as if they discovered that you can use an fft for attention. They even have a "proof". It's in the title. Then you discover everyone already knew this and all they do is as some extra learnable parameters.
Search engines don't always turn up prior art the way you'd like. Simple jargon discrepancies can cause a lot of mischief. Though I'm sure a case could be made about it being confirmation bias. It's hard to get people to search in earnest for bad news. If it's not in your face they declare absence of evidence as evidence of absence.
"Overall, while approaches such as FNet, Performer, and sparse transformers demonstrate that either fixed or approximate token mixing can reduce computational overhead, our adaptive spectral filtering strategy uniquely merges the efficiency of the FFT with a learnable, input-dependent spectral filter. This provides a compelling combination of scalability and adaptability, which is crucial for complex sequence modeling tasks."
And a comparison section after that.