Hacker News new | past | comments | ask | show | jobs | submit login
The Transformer Family (lilianweng.github.io)
254 points by alexmolas on Jan 29, 2023 | hide | past | favorite | 46 comments



Somewhat off topic. As someone who did some neural network programming in Matlab a couple decades ago, I always feel a bit dismayed that I'm able to understand so little about modern AI given the explosion in advances in the field starting in about the late 00s or so with things like convolutional neural networks and deep learning, transformers, large language models, etc.

Can anyone recommend some great courses or other online resources for getting up to speed on the state-of-the-art with respect to AI? Not really so much looking for an "ELI5" but more of a "you have a strong programming and very-old-school AI background, here are the steps/processes you need to know to understand modern tools".

Edit: thanks for all the great replies, super helpful!


A course by Andrej Karpathy on building neural networks, from scratch, in code. We start with the basics of backpropagation and build up to modern deep neural networks, like GPT.

https://karpathy.ai/zero-to-hero.html


For a while now, an answer I've seen is to start with "Attention Is All You Need", the original Transformers paper. It's still pretty good, but over the past year I've led a few working sessions on grokking transformer computational fundamentals and they've turned up some helpful later additions that simplify and clarify what's going on.

You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:

(1) Transformers from Scratch: https://peterbloem.nl/blog/transformers

(2) Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547de...

(3) Formal Algorithms for Transformers: https://arxiv.org/abs/2207.09238


YouTube channel that explains the paper in detail: https://youtu.be/iDulhoQ2pro

And subsequent follow-ups (ROME, editing transformer arch): https://youtu.be/_NMQyOu2HTo

I find the channel amazing explaining super complex topics in simple enough terms for people who have some background in AI.


Yannic Kilcher is great but this video worked better for me:

"LSTM is dead. Long live transformers!" (Leo Dirac): https://www.youtube.com/watch?v=S27pHKBEp30


I second the recommendation for Peter Bloem’s tutorial.

I’m also about to read the transformer chapter from this excellent upcoming book by Simon Prince:

udlbook https://udlbook.github.io/udlbook/


Part of the problem with self studying this stuff is that it's hard to know which resources are good, without already being at least conversant with the material already.


That problem doesn't really disappear with teachers and classes ;)


The obvious answer is Fast AI's Practical Deep Learning for Coders - https://course.fast.ai/

They'll also be releasing the "From Deep Learning Foundations to Stable Diffusion" course soon, which is basically Part 2 of the course.


I think the concepts are simple. After all, everything is just a mutli-variable derivative. However, I find the choice of notation very confusing. Mostly because it's impossible to remember the shape of everything.

Even if this linked post, they have a "notations" section at the top. Almost immediately, they start using a value k that isn't defined anywhere.


Not that I think statistics terminology or notation is worthy of praise (it’s mostly horrible), but it frustrates me to no end how the ML world reappropriated terms seemingly just to be different.


I think this book is excellent explanation of some architectures, how they work, and how you can quickly build some with Keras

https://www.manning.com/books/deep-learning-with-python-seco...

Written by François Chollet, the creator of Keras


I put together a repository at the end of last year to walk through a basic use of a single layer Transformer: detect whether "a" and "b" are in a sequence of characters. Everything is reproducible, so hopefully helpful at getting used to some of the tooling too!

https://github.com/rstebbing/workshop/tree/main/experiments/...


For something totally different, try “Geometric Deep Learning”. There is a course by the authors of the book.

Once you understand how everything is operations on graphs, most kernels and functions become very easy to understand.


http://cs231n.stanford.edu/ is good for convolutional networks.


On the one hand, this looks really useful.

On the other hand:

> There are various forms of attention / self-attention, Transformer (Vaswani et al., 2017) relies on the scaled dot-product attention: given a query matrix , a key matrix and a value matrix , the output is a weighted sum of the value vectors, where the weight assigned to each value slot is determined by the dot-product of the query with the corresponding key

There HAS to be a better way of communicating this stuff. I'm honestly not even sure where to start decoding and explaining that paragraph.

We really need someone with the explanatory skills of https://jvns.ca/ to start helping people understand this space.


Complicated from whose perspective? I don't go around commenting on systems programming articles about how low-level memory management and currency algorithms are too complicated, or commenting on category theory articles that the terminology is too obtuse and monads are too hard.

I agree that there probably could be a better "on ramp" into this material than "take an undergraduate linear algebra course", but ultimately it is a mathematical model and you're going to have to deal with the math at some point if you want to actually understand what's going on. Linear algebra and calculus are entry-level table stakes for understanding how machine learning works, and there's really no way around that.


The idea of the transformer somehow being a trainable key-value store is kind of abstract and weird and has little to do with the mathematics of it. The math part of that is how the dot product encodes for similarity between vectors, but beyond that it really is a "if you get it you get it" kind of thing.


I am absolutely certain it is possible to explain this stuff without using jargon and mathematical notation that is impenetrable to the majority of professional software engineers.


At some point, at some level, you really do need to just learn what a dot product is and what a matrix is. It's not weird notation or jargon, these are fundamental concepts.

Just like if you want to really learn how programs work you can't refuse any explanation that talks about "variables" or "functions" because that's jargon.

You can explain it, but it's going to be more at the level of "the network looks at the words" type explanation.


I agree, but that doesn't mean that every article on the subject needs to be written for that audience in that language.


The Illustrated Transformer is pretty great. I was pretty hazy after reading the paper back in 2017 and this resource helped a lot.

https://jalammar.github.io/illustrated-transformer/


Thanks, that one is really useful.


Excellent. Thanks for sharing.


It took me a long time to wrap my head around the whole "key/query/value" explanation (and to be honest I regularly forget which vector is which). I find the "weighted sum of vectors" explanation much simpler/more intuitive; this blog post is IMO the best on the subject:

https://peterbloem.nl/blog/transformers


I found this article on “transformers from scratch”[0] to be a perfect (for me) middle ground between high level hand-wavy explanations and overly technical in-the-weeds academic or code treatments.

[0] https://e2eml.school/transformers.html


Vector, matrix, weighted sum, and dot product are good places to start. In fact, these concepts are so useful that they are good places to start pretty much no matter where you want to go. 3D graphics, statistics, physics, ... and neural networks.



Pretty sure the current state is simply one of confused exploration … hopefully soon to be synthesized and simplified.


Agree. Each new "advance" seems like just another trick/quirk (and adds jargon). Still waiting for the real advance.


This amount of diversity in transformers is very impressive, but what’s more impressive is that for models like GPT, scaling the models seems much more effective than engineering the models


I don't remember the paper, I think it's on the vision transformers paper, that they say something like "scaling the model and having more data completely beats inductive bias". It's impressive how we went from feature engineering in classical ML, to inductive bias in early deep learning, to just have more data in modern deep learning.


> scaling the model and having more data completely beats inductive bias

The analogy in my mind is this: "burning natural oil/gas completely beats figuring out cleaner & more sustainable energy sources"

My point is that "more data" here simply represents the mental effort that has already been exerted in pre-AI/DL era, which we're now capitalizing on while we can. Similar to how fossil fuels represent the energy storage efforts by earlier lifeforms that we're now capitalizing on, again while we can. It's a system way out of equilibrium, progressing while it can on borrowed resources from the prior generations.

In the long run, the AI agents will be less wasteful as they reach the limits of what data or energy is available on the margins to compete within themselves and to reach their goals. It's just we haven't reached that limit yet, and the competition at this stage is on processing more data and scaling the models at any cost.


>My point is that "more data" here simply represents the mental effort that has already been exerted in pre-AI/DL era, which we're now capitalizing on while we can

Not really. It's not simply that modern architectures are not adding additional inductive biases, they are actively throwing away the inductive bias that used to be used by everyone. For example, it was taken for granted that you should use CNNs to give you translation invariance, but apparently now visual transformers can match that performance with the same amount of compute.


There are bit caveats here:

Vision transformers outperform CNNs in HUGE data regimes. On small datasets, CNNs still shine.

Also, if you take a CNN with modern tricks, they can be on par with vision transformers, e.g convnext

Transformers really dominate when you scale the amount of data to infinity


Yes, that's what I was saying: scale > inductive bias. Apparently, scaling is all you need ;)


I don't think what you're saying contradicts what I'm saying. My baseline / reference point wasn't CNNs.


Perhaps another analogy is if you train something by repeatedly telling it many different stories about the same thing day in and day out, compared to mentioning something just once in passing, perhaps the system will know what it's been exposed to more. Replaying that event it was exposed to in passing in order to check it for parsimony requires more mental effort and seems like something that requires explicitly setting aside the work to do so.


I see no evidence of that, transformers seem to follow the same trend as other architectures with improved models coming out every month that demonstrate similar performance with orders of magnitude less parameters.


I think by "effective" they meant better overall performance, not necessarily better performance per parameter or hour of training.


> Later decoder-only Transformer was shown to achieve great performance in language modeling tasks, like in GPT and BERT.

Actually, BERT is an encoder-only architecture, not decoder-only. Aside from trying to solve the same problem, GPT and BERT are quite different. This kind of confusion on now "classic" transformer models makes me kind of dubitative that the more recent and exotic ones are described very accurately...

(Clicking on the link with more details on BERT actually doesn't dispel much of the confusion; it stresses the fact that unlike GPT it's bidirectional, and indeed bidirectional is the "B" in BERT, but that's quite a disingenuous choice of terms itself - it's not "bidirectional" as in Bi-LSTM, that go left-to-right and right-to-left separately, it does the whole sequence at once; that was the real innovation of BERT).

Scrolling down to Transformer-XL starts talking about segments, from the context I _think_ it means that the input text is split into segments that are dealt with separately to cut down on the O(N^2) dependency of the transformer, but I would have assumed this kind of information to be written in a survey article.

IMHO, review articles are really great and useful, because they allow to cut through the BS that every paper has to add to get published, unify notations, and summarize the main points clearly. This article does a commendable job on the second point and, partly, on the first, but sadly lacks the third. Given the enormous task that it certainly was to compile this list, it would probably have profited from treating fewer models but putting things a bit more into perspective...


Probably a strong neural net only needs sparse connections to learn well. However, we simple humans cannot predict which sparse connections are important. Therefore, the net needs to learn which connections are important, but learning the connections means it needs to compute all of them during the training process, so the training process is slow. It's very challenging to break this cycle!


What you are describing is known as the "lottery ticket hypothesis" [0] in the ML world -- it is a well-studied phenomenon!

[0] https://arxiv.org/abs/1803.03635


Another article on this blog talks about it: https://lilianweng.github.io/posts/2019-03-14-overfit/#the-l...


Great compilation, would be great to see Vision Transformer (ViT) included.

Andrej Karpathy's GPT video is a must have companion for this https://youtu.be/kCc8FmEb1nY I was going nuts trying to grok Key, Query,Position and Value until Andrej broke it down for me.


The Perceiver should be in there too, as well as it’s modified versions PerceiverIO and PerceiverAR.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: