Hacker News new | past | comments | ask | show | jobs | submit login
What is a transformer model? (2022) (nvidia.com)
301 points by Anon84 on June 23, 2023 | hide | past | favorite | 58 comments



If you'd rather prefer something readable and explicit, instead of empty handwaving and uml-like diagrams, read "The Transformer model in equations" [0] by John Thickstun [1].

[0] https://johnthickstun.com/docs/transformers.pdf

[1] https://johnthickstun.com/docs/


See also the original paper:

https://arxiv.org/abs/1706.03762


The original paper is quite deceptive and hard to understand, IMHO. It relies on jumping between several different figures and mapping between shapes, in addition to guessing at what the unlabeled inputs are.

Just a few more labels, making the implicit explicit, would make it far more intelligible. Plus, last time I went through it Im pretty sure that there's either a swap on the order of the three inputs between different figures, or that it's incorrectly diagrammed.


Some other good resources:

[0]: The original paper: https://arxiv.org/abs/1706.03762

[1]: Full walkthrough for building a GPT from Scratch: https://www.youtube.com/watch?v=kCc8FmEb1nY

[2]: A simple inference only implementation in just NumPy, that's only 60 lines: https://jaykmody.com/blog/gpt-from-scratch/

[3]: Some great visualizations and high-level explanations: http://jalammar.github.io/illustrated-transformer/

[4]: An implementation that is presented side-by-side with the original paper: https://nlp.seas.harvard.edu/2018/04/03/attention.html


Done 1. It is a drawdropper! Especially if you have done the rest of the series and seen results of older architectures. And I was like “where is the rest of it, you ain’t finished!” … and then … ah I see why they named the paper attention is all you need.

But even the crappy (small 500k param IIRC) Transformer model trained on a free colab in a couple of minites was relatively impressive. Looking at only 8 chars back and train on a HN thread it got the structure / layout of the page pretty good, interspersed with drunken looking HN comments.



Maybe this is more of a general ML question but I faced it when transformers became popular. Do you know of a project-based tutorial that talks more about neural net architecture, hyperparameters selection and debugging? Something that walks through getting poor results and make explicit the reasoning for tweaking?

When I try to use transformers or any AI thing on a toy problem I come up with, it never works. And there's this blackbox of training that's hard to debug into. Yes, for the available resources, if you pick the exact problem, the exact NN architecture and exact hyperparameters, it all works out. But surely they didn't get that on the first try. So what's the tweaking process?


There is A. Karpathy's recipe for training NNs but it is not a walkthrough with an example:

https://karpathy.github.io/2019/04/25/recipe/

but the general idea of "get something that can overfit first" is probably pretty good.

In my experience getting the data right is probably the most underappreciated thing. Karpathy has data as step one, but in my experience, also data representation and sampling strategy does quite the miracle.

In Part II of our book we do an end-to-end project including e.g. a moment where nothing works until we crop around "regions of interest" to balance the per-pixel classes in the training data for the UNet. This has been something I have pasted into the PyTorch forums every now and then, too.


Thanks for linking me to that post! Its much better at expressing what I'm trying to say. I'll have a careful read of it now.

I think I'm still at a step before the overfit. It doesn't converge to a solution on its training data (fit or overfit). And all my data is artificially generated so no cleaning is needed (though choosing a representation still matters). I don't know if that's what you mean by getting the data right or something else. Example problems that "don't work": fizzbuzz, reverse all characters in a sentence.


[1] is thoroughly recommended.


it really is amazing. to be fair if you actually are following along and writing the code yourself, you have to stop and playback quite frequently, and the parts around turning the attention layer into a "block" is a little hard to grok because he starts to speed up around 3/4 through, but yeah this is amazing. I went through it week before starting as lead prompt engineer at an AI startup, and it was super useful and honestly a ton of fun. Reserve 5 hours of your life and go through it if you like this stuff! It's an incredibly great crash course for any interested devs


Recommended for who ?


Masochists! In a good way! I recommend you do the full course not jump into that video. I did the full course, paused to do some of a university course around lecture 2 to really understand some stuff then came back and finishing it off.

Bu the end of you would have done stuff like hand working out back-propagation though sums, broadcasting, batchnorm etc. Fairly intense for a regular programmer!


From looking at the video probably someone who has good working knowledge of PyTorch, familiarity with NLP fundamentals and transformers, and somewhat of a working understanding of how GPT works.


I found this lecture and the one following it very helpful as well: https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4...


And also the ones before that explain the attention mechanism:

https://youtu.be/wzfWHP6SXxY?t=4366

https://youtu.be/gKD7jPAdbpE (up to 25:42)


I feel like I have a schematic understanding of how transformer models process text. But I struggle to understand how the same concept could be used with images or other less linear data types. I understand that such data can be vectorized so it’s represented as a point or series of points in a higher dimensional space, but how does such a process capture the essential high level perceptual aspects well enough to be able to replicate large scale features? Or equivalently, how are the dimensions chosen? I have to imagine that simply looking at an image as an RGB pixel sequence would largely miss the point of what it “is” and represents.


Just to add, the first user who replied to you is quite wrong. You can use CNNs to get features first...but it doesn't happen anymore. It's unnecessary and adds nothing.

Pixels don't get fed into transformers but that's more expense than anything. Transformers need to understand how each piece relates to every other piece. That gets very costly very fast when the "pieces" are pixels. Images are split into patches instead with positional embeddings.

as for how it learns the representations anyway. Well it's not like there's any specific intuition to it. after all, the original authors didn't anticipate the use case in Vision.

and the fact that you didn't need CNNs to extract features first didn't really come into light till this paper - https://arxiv.org/abs/2010.11929

It basically just comes down to lots of layers and training


The original commenter is trying to build intuition so my statement was simply to help the commenter understand that Transformers can operate on patches.

As for > after all, the original authors didn't anticipate the use case in Vision.

To quote the original "Attention is all you need" paper: "We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video"

To say that the original authors did not anticipate the Transformers use in Vision is false.


I'm talking about using transnformers on their own, beating SOTA CNNs. People thought you needed CNNs even with transformers...until that was shown wrong. The point is that there isn't any special intuition that makes this fit vision.


Why would you feed pixels into transformers? Surely you could translate the overall image into frequencies using Fourier transforms and then into embeddings which feed the LLM?


It's essentially a process if denoising a spectrogram, and as such the data is processed that way.


If I understand what you're asking, the Transformer isn't initially treating the image as a sequence of pixels like p1, p2, ..., pN. Instead, you can use a convolutional neural network to respect the structure of the image to extract features. Then you use the attention mechanism to pay attention to parts of the image that aren't necessarily close together but that when viewed together, contribute to the classification of an object within the image.


Vision Transformers don't use CNNs to extract anything first. (https://arxiv.org/abs/2010.11929). You could but it's not necessary and it doesn't add anything so it doesn't happen anymore.

Vision transformers won't treat the image as a sequence of pixels but that's mostly because doing that gets very expensive very fast. The image is split into patches and the patches have positional embeddings.


They split the image into small patches (I think 16x16 is standard), and then treat each patch as a token. The image becomes a sequence of tokens, which gets analyzed the same as if it was text.

Doing it this way is obviously throwing away a lot of information, but I assume the advantages of using a transfomer outweigh the disadvantages.


This is a very useful answer, thanks. So is every possible 16x16 grid of pixels a legal token value and they are used more or less verbatim or are they encoded in some other way?

Prompted by some of the other replies I’ve read up a bit on positional embeddings. That they are used even with text transformers which otherwise lack the linearity of RNNs etc helps tremendously to clarify things.


> how does such a process capture the essential high level perceptual aspects well enough to be able to replicate large scale features

Layers and lots of training. Upper layers can capture/recognize large scale features


(2022)

It feels like a joke, but the fact they are mentioning GPT-3 and Megatron-Turing as the hottest new things makes this piece seem so outdated.


The relevant lifespan of a paper in this field appears to average about three months. Can't really hold that against them.


If you say "Transformer" and "Megatron" in the same sentence, you have my full-attention for at least enough time to start making a serious observation about ML, or frankly, anything.


> If you say "Transformer" and "Megatron" in the same sentence, you have my full-attention

And as they say, attention is all you need


Once they realize they can distil and quantize Megatron, I suppose they'll name that one "Starscream"?


Any self-respecting AI researcher writes self-updating papers.


Papers that update themselves with GPT-4?


GPT-42


Self-writing papers, AI quine


There was an excellent thread/discussion about it a while ago:

https://news.ycombinator.com/item?id=35977891


Great piece. At the end though there is a concerning bit.

> researchers are studying ways to eliminate bias or toxicity if models amplify wrong or harmful language. For example, Stanford created the Center for Research on Foundation Models to explore these issues

We've seen numerous times now, censoring/railroading degrades quality. This is why SD 2 was so bad at the human form over 1.4

I get they want a PC AI but this isn't the way.


Sutton’s bitter lesson http://www.incompleteideas.net/IncIdeas/BitterLesson.html resonates.

General learning + search methods that scale with data + compute win out.

Transformers in a sense is one of the most general algorithms able to tackle a whole bunch of domains from audio, video, images, language.

I’m sure another better super algorithm will be invented but Transformer is currently what we have.


It’s interesting though, isn’t RL even more general?


I think this is a transformer model: https://t.ly/se-M


There's one thing I haven't quite grasped about transformers. Are the query, key, and value vectors trained per token? Does each token have specific QVK vectors in each head, or does each attention head have one set that are trained using a lot of tokens?


The q,k,v projections are trained, usually per head. But each token goes through the same projection (per head). In some architectures (see the Falcon model) the k, v projections are shared across the heads.

During the forward pass a "query" is "created" for each token using the query projection (again, one projection per head, all the tokens run through the same projection). The keys are created using the key projection, the values are created using the dot product of query with the keys for each "token" then projected using the value projection.

But again, different models do different things. Some models bring in the positional encoding into the attention calculation rather than adding it in earlier. Practically every combination of things has been tried since that paper was published.


Thanks, good explanation. More questions.

Models have the initial N embeddings passed through, processed by the attention+linear blocks which add something to the input. Sort of memory bus. After the last block we still have array of the same dimensions like N initial embeddings, but they mean something else now. How do we select the next token, based on the last element in that array?

Another question, that bus technically doesn't have to have the same width, right? It should be possible, for the same model's size, to trade bus width for the number of heads. Even have sort of U-Net.

And the last, on each loop resulting embedding (1) is converted into token, which is being added to the input. Converted to embedding first, it should be possible to just reuse embedding (1), and use token only for user output, right?

PS: not sure about every combination tried, we just started. Some of the problems still don't have satisfying solutions. Like hallucinations, online training. Or responsibility.


in go X tokens. Out comes a probability distribution across every possible next token.

In your analogy, the bus could be any width. In practice people tend to trade "bus width" for heads, but it need not be that way.

I'm not sure understand the last part.

It's quite easy to try stuff with LLMs/transformers. The fact that a paper hasn't been written on every combination doesn't mean the haven't been tried in some way. It's not as though the architecture is the only thing.


It's more than meets the eye


I think it’s actually much harder to explain this stuff with hand wavy names. I want to see the actual code, how this stuff ACTUALLY works in reality.


Great summary. I'm still working through a class that's just getting to RNN and CNN. Crazy it seems like an obsolete tech already.


While it is largely obsolete for practical purposes, learning about them is still valuable as they illustrate the natural evolution in the thought process behind the development of transformers.


You learn linked lists in comp sci yet rarely if ever use them. Still, they are a basic concept required to understand trees, DAGs, etc later…


CNNs are still relatively competitive at vision tasks.


This is from March 25, 2022 - over a year old now.


It's not Kanye's latest pair of shoes.


Weird, still no diagram of a transformer based connectome. Or did I missi it??


Best teacher of transformer model is ChatGPT itself


One thing I wonder about is where it gets its data about itself from. Did they feed it a bunch of accurate information about itself?


I reckon they did


Quite possible! But given how ChatGPT hallucinates, and my general lack of knowledge about LLMs in general and ChatGPT in particular, I would be hesitant to take what it says at face value. I'm especially hesitant to trust anything it says about itself in particular, since much of its specifics are not publicly documented and are essentially unverifiable.

I wish there were some way for it to communicate that certain responses about itself were more or less hardcoded.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: