Somewhat off topic. As someone who did some neural network programming in Matlab a couple decades ago, I always feel a bit dismayed that I'm able to understand so little about modern AI given the explosion in advances in the field starting in about the late 00s or so with things like convolutional neural networks and deep learning, transformers, large language models, etc.
Can anyone recommend some great courses or other online resources for getting up to speed on the state-of-the-art with respect to AI? Not really so much looking for an "ELI5" but more of a "you have a strong programming and very-old-school AI background, here are the steps/processes you need to know to understand modern tools".
Edit: thanks for all the great replies, super helpful!
A course by Andrej Karpathy on building neural networks, from scratch, in code.
We start with the basics of backpropagation and build up to modern deep neural networks, like GPT.
For a while now, an answer I've seen is to start with "Attention Is All You Need", the original Transformers paper. It's still pretty good, but over the past year I've led a few working sessions on grokking transformer computational fundamentals and they've turned up some helpful later additions that simplify and clarify what's going on.
You can quickly get overwhelmed by the million good resources out there so I'll keep it to these three. If you have a strong CS background, they'll take you a long way:
Part of the problem with self studying this stuff is that it's hard to know which resources are good, without already being at least conversant with the material already.
I think the concepts are simple. After all, everything is just a mutli-variable derivative. However, I find the choice of notation very confusing. Mostly because it's impossible to remember the shape of everything.
Even if this linked post, they have a "notations" section at the top. Almost immediately, they start using a value k that isn't defined anywhere.
Not that I think statistics terminology or notation is worthy of praise (it’s mostly horrible), but it frustrates me to no end how the ML world reappropriated terms seemingly just to be different.
I put together a repository at the end of last year to walk through a basic use of a single layer Transformer: detect whether "a" and "b" are in a sequence of characters. Everything is reproducible, so hopefully helpful at getting used to some of the tooling too!
> There are various forms of attention / self-attention, Transformer (Vaswani et al., 2017) relies on the scaled dot-product attention: given a query matrix , a key matrix and a value matrix , the output is a weighted sum of the value vectors, where the weight assigned to each value slot is determined by the dot-product of the query with the corresponding key
There HAS to be a better way of communicating this stuff. I'm honestly not even sure where to start decoding and explaining that paragraph.
We really need someone with the explanatory skills of https://jvns.ca/ to start helping people understand this space.
Complicated from whose perspective? I don't go around commenting on systems programming articles about how low-level memory management and currency algorithms are too complicated, or commenting on category theory articles that the terminology is too obtuse and monads are too hard.
I agree that there probably could be a better "on ramp" into this material than "take an undergraduate linear algebra course", but ultimately it is a mathematical model and you're going to have to deal with the math at some point if you want to actually understand what's going on. Linear algebra and calculus are entry-level table stakes for understanding how machine learning works, and there's really no way around that.
The idea of the transformer somehow being a trainable key-value store is kind of abstract and weird and has little to do with the mathematics of it. The math part of that is how the dot product encodes for similarity between vectors, but beyond that it really is a "if you get it you get it" kind of thing.
I am absolutely certain it is possible to explain this stuff without using jargon and mathematical notation that is impenetrable to the majority of professional software engineers.
At some point, at some level, you really do need to just learn what a dot product is and what a matrix is. It's not weird notation or jargon, these are fundamental concepts.
Just like if you want to really learn how programs work you can't refuse any explanation that talks about "variables" or "functions" because that's jargon.
You can explain it, but it's going to be more at the level of "the network looks at the words" type explanation.
It took me a long time to wrap my head around the whole "key/query/value" explanation (and to be honest I regularly forget which vector is which). I find the "weighted sum of vectors" explanation much simpler/more intuitive; this blog post is IMO the best on the subject:
I found this article on “transformers from scratch”[0] to be a perfect (for me) middle ground between high level hand-wavy explanations and overly technical in-the-weeds academic or code treatments.
Vector, matrix, weighted sum, and dot product are good places to start. In fact, these concepts are so useful that they are good places to start pretty much no matter where you want to go. 3D graphics, statistics, physics, ... and neural networks.
This amount of diversity in transformers is very impressive, but what’s more impressive is that for models like GPT, scaling the models seems much more effective than engineering the models
I don't remember the paper, I think it's on the vision transformers paper, that they say something like "scaling the model and having more data completely beats inductive bias". It's impressive how we went from feature engineering in classical ML, to inductive bias in early deep learning, to just have more data in modern deep learning.
> scaling the model and having more data completely beats inductive bias
The analogy in my mind is this: "burning natural oil/gas completely beats figuring out cleaner & more sustainable energy sources"
My point is that "more data" here simply represents the mental effort that has already been exerted in pre-AI/DL era, which we're now capitalizing on while we can. Similar to how fossil fuels represent the energy storage efforts by earlier lifeforms that we're now capitalizing on, again while we can. It's a system way out of equilibrium, progressing while it can on borrowed resources from the prior generations.
In the long run, the AI agents will be less wasteful as they reach the limits of what data or energy is available on the margins to compete within themselves and to reach their goals. It's just we haven't reached that limit yet, and the competition at this stage is on processing more data and scaling the models at any cost.
>My point is that "more data" here simply represents the mental effort that has already been exerted in pre-AI/DL era, which we're now capitalizing on while we can
Not really. It's not simply that modern architectures are not adding additional inductive biases, they are actively throwing away the inductive bias that used to be used by everyone. For example, it was taken for granted that you should use CNNs to give you translation invariance, but apparently now visual transformers can match that performance with the same amount of compute.
Perhaps another analogy is if you train something by repeatedly telling it many different stories about the same thing day in and day out, compared to mentioning something just once in passing, perhaps the system will know what it's been exposed to more. Replaying that event it was exposed to in passing in order to check it for parsimony requires more mental effort and seems like something that requires explicitly setting aside the work to do so.
I see no evidence of that, transformers seem to follow the same trend as other architectures with improved models coming out every month that demonstrate similar performance with orders of magnitude less parameters.
> Later decoder-only Transformer was shown to achieve great performance in language modeling tasks, like in GPT and BERT.
Actually, BERT is an encoder-only architecture, not decoder-only. Aside from trying to solve the same problem, GPT and BERT are quite different. This kind of confusion on now "classic" transformer models makes me kind of dubitative that the more recent and exotic ones are described very accurately...
(Clicking on the link with more details on BERT actually doesn't dispel much of the confusion; it stresses the fact that unlike GPT it's bidirectional, and indeed bidirectional is the "B" in BERT, but that's quite a disingenuous choice of terms itself - it's not "bidirectional" as in Bi-LSTM, that go left-to-right and right-to-left separately, it does the whole sequence at once; that was the real innovation of BERT).
Scrolling down to Transformer-XL starts talking about segments, from the context I _think_ it means that the input text is split into segments that are dealt with separately to cut down on the O(N^2) dependency of the transformer, but I would have assumed this kind of information to be written in a survey article.
IMHO, review articles are really great and useful, because they allow to cut through the BS that every paper has to add to get published, unify notations, and summarize the main points clearly. This article does a commendable job on the second point and, partly, on the first, but sadly lacks the third. Given the enormous task that it certainly was to compile this list, it would probably have profited from treating fewer models but putting things a bit more into perspective...
Probably a strong neural net only needs sparse connections to learn well. However, we simple humans cannot predict which sparse connections are important. Therefore, the net needs to learn which connections are important, but learning the connections means it needs to compute all of them during the training process, so the training process is slow. It's very challenging to break this cycle!
Great compilation, would be great to see Vision Transformer (ViT) included.
Andrej Karpathy's GPT video is a must have companion for this https://youtu.be/kCc8FmEb1nY I was going nuts trying to grok Key, Query,Position and Value until
Andrej broke it down for me.
Can anyone recommend some great courses or other online resources for getting up to speed on the state-of-the-art with respect to AI? Not really so much looking for an "ELI5" but more of a "you have a strong programming and very-old-school AI background, here are the steps/processes you need to know to understand modern tools".
Edit: thanks for all the great replies, super helpful!