The benefits over regular transformers is that it is more efficient (does less operations), as the original transformer has a quadratic complexity in the number of input tokens.
https://twitter.com/tanmingxing/status/1359301186734620675
The benefits over regular transformers is that it is more efficient (does less operations), as the original transformer has a quadratic complexity in the number of input tokens.