Among other reasons, because the decoder-only version of the original transforme...

Among other reasons, because the decoder-only version of the original transformer architecture has proven weirdly resistant to these kinds of hacks and clever optimizations.

Ideas like sparse attention, tree attention, residual attention, etc, all sound good on paper, but when researchers try to reproduce them they either find no results or results that don't scale. Even AliBi is turning out to be less powerful than scaled-down positional embeddings. It's almost a bitter lesson on its own: you can't beat the original transformer.

Optimizations that do stick around tend to be the ones that preserve the original algorithm but help with caching or memory accesses.