For a while now, an answer I've seen is to start with "Attention Is All You Need", the original Transformers paper. It's still pretty good, but over the past year I've led a few working sessions on grokking transformer computational fundamentals and they've turned up some helpful later additions that simplify and clarify what's going on.
You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:
Part of the problem with self studying this stuff is that it's hard to know which resources are good, without already being at least conversant with the material already.
You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:
(1) Transformers from Scratch: https://peterbloem.nl/blog/transformers
(2) Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...
(3) Formal Algorithms for Transformers: https://arxiv.org/abs/2207.09238