Right, but transformers *are* artificial neural networks though, so it's not cle...

Right, but transformers are artificial neural networks though, so it's not clear to me what you mean.

The leap from RNNs to the Transformer architecture in 2017ish is similar to vision networks' leap from fully-connected layers to convolutional layers in 1987ish. In both cases, the key is to change the architecture to exploit some implicit structure in the data that wasn't quite used effectively before.

For images, each pixel is strongly related to its close neighbors, so convolution is a natural way to simplify models while capturing that locality.

For language, the opposite is true - almost any part of the text implicitly reference anything that's been said previously, so you need representations that are somewhat position-independent. Approaches that can model that structure naturally work better than an RNN munching tokens one by one.

Each one was certainly a leap forward that unlocked its respective field, but at the end of the day, it's "just" an architecture tweak, you know?