They cite the paper where the architecture was introduced. If you go to that pap...

They cite the paper where the architecture was introduced. If you go to that paper, you'll see that it mostly consists of a very detailed and careful comparison with Transformer-XL.

In the new paper, they plug their memory system into vanilla BERT. This makes the resulting model essentially nothing like Transformer-XL, which was a strictly decoder-only generative language model.