Working in the Fourier domain has been a staple of scientific and engineering applications. Learning those interactions rather than just hardcoding them has been fairly widely explored as well - the term to look for is Fourier Neural Operators [1][2]. It turns out you can prove universality even if the nonlinearity remains in the real domain [3].
The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5].
A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7].
As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text.
The concept is fairly mainstream nowadays, to the degree that Jensen talked about it in his GTC keynote in 2021 [4] and there’s even a mainstage TED talk about its applications [5].
A nice property of doing things this way is that your model ends up being resolution-invariant which is particularly interesting for engineering domains. Scaling these methods has sparked the "let’s do a fully deep-learning-based weather model"-race [6][7].
As for using this on text data: my intuition would be that is going to not work as well because of a fairly unique property of text: for image, video and scientific data each individual element is of approximately equal importance, whereas in text you can have discrete tokens like a "not" somewhere in there that change the meaning of everything around it fairly significantly and you’d want that all to all interaction to capture that. Any kind of mixing that smoothes things out is going to inherently be at a disadvantage - probably true to some degree for most of those efficiency saving methods and why we’re seeing more limited adoption on text.
[1] https://arxiv.org/abs/2010.08895
[2] https://www.nature.com/articles/s42254-024-00712-5
[3] https://jmlr.org/papers/v22/21-0806.html
[4] https://www.youtube.com/watch?v=jhDiaUL_RaM&t=2472s
[5] https://www.ted.com/talks/anima_anandkumar_ai_that_connects_...
[6] https://arxiv.org/abs/2202.11214 (Feb 2022)
[7] https://www.wired.com/story/ai-hurricane-predictions-are-sto...