The author is suggesting that we add 1 to the denominator of the softmax that is used within attention mechanisms (not the final output softmax).
The softmax inside an attention unit allows it to see key/query matches as probabilities; those probabilities support a continuous-valued version of a key-value lookup (instead of 1/0 output of a lookup, we get weights where a high weight = the desired key-value lookup).
Adding 1 to the denominator would change an attention unit by no longer working with a true probability vector of weights, but rather working with weights that add up to less than 1. The motivation is that the network can learn to provide high weights so that the adjusted softmax is very close to a probability vector; and it has a new option to provide all-low weights which give all-low output weights, meaning it can opt out of having high confidence in anything.
(switching to opinion mode)
2. How can we tell if this is good?
2a. We should just try it out: Train an LLM with this, see if it works.
2b. There are two reasons I suspect it won't make a big difference.
First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output. Then we're basically taking an average of a bunch of vectors (vs a weighted average that is more like choosing one of them). Statistically, we expect that averaged vector to be close to zero. In other words, the node already has a way to effectively opt-out by providing a near-zero output vector.
Second, in a transformer, each attention unit has many other learned weights that can support the ability to opt out. Both the V matrix and the feed-forward layer after the attention unit give that module a way to provide low values to the activation function after the feed-forward layer, which would result in a value as small as you like — again, a way to opt out.
3. I appreciate the non-academic tone of the article and the willingness to play around with fundamental ideas. Although I'm not totally convinced by the note, I'd love to read more stuff like this.
The way I understood it, the author is saying that, with this change, big values disappear, and we can then use fewer bits to encode the output of transformers, which means reducing the memory requirements of the network. Memory being the limiting factor to running models large, this would be a big deal.
> The Qualcomm AI researchers found that 97%+ of outlier activations in LLMs occur in whitespace and punctuation positions.
This is striking. If true, why not try to ignore whitespace and puctuation?
In old Latin, scripto continua [1] was a way to write continuously, for the exact same reason: to save space. Other modern languages still do that, and are no less parseable.
Granted, it's unlikely a commercial LLM would become popular if it produced output without spaces or punctuation; but an open source one that promised to be much more compressible, and therefore work on smaller machines, might be super useful.
It's not hard for a human to add spaces afterwards. It used to be a job for beginning journalists at the time of telex machines: press releases were sent in all caps without spaces, and interns were tasked with adding slashes between words. In French it was called "bâtonner les dépêches" (literally: add sticks to press releases -- not sure about the idiomatic English translation).
In the Qualcomm paper cited, they explain/hypothesize that Transformers learn to attend to these low-meaning tokens when they want to avoid adding too much extra info to the residual stream. So it's not an issue that the models attend to spaces and punctuation during in these outliers – it's the workaround the models come up with to get around the fact that attention has to go somewhere.
This post's author has a different solution, and one that theoretically could avoid causing large outliers that prevent efficient quantization. These large outliers seem to be an unfortunate side-effect of the models' learned solution.
So getting rid of spaces would do nothing to solve the problem, and would instead force the models to learn a new solution, one that presumably isn't as optimal.
> This is striking. If true, why not try to ignore whitespace and puctuation?
It is initially, but thinking about it some more, there's a lot of information packed in whitespace and punctuation choice.
Scripto continua may have worked because the few readers who lived back then expected it to encode some form of legal or religious prose, but even then they could learn things from the overall shape of the document. LLMs are working in a much richer domain of document types, but the only thing they can "see" is a stream of tokens. There's no spatial or geometric data attached there. So whitespace and punctuation are the only thing an LLM has to make inferences about otherwise textually identical inputs. Such as:
(see: other) -- vs -- {see: other}
One being likely a text fragment, the other likely a piece of code.
Or how spacing may imply Markdown or YAML being used. Or how it may imply a list. Or a poem. Or a song. Or specific writing style, such as "lol im a casual who not care bout comms" vs. "I am a distinguished professor, about to retire. Elites like us put two spaces after full stop."
those paratextual phenomena probably are important for the model's representations.. not to get rid of and not easily compressable either. have a look at predicitive features for authorship attribution in stylometry for example. whitespace and punctuation are always decisive.
For the spaces and for some (maybe most?) languages you don't even need a NN to add spaces: as words made of two or more words aren't that common, and when those occur you probably want to use the composite one, it boils down to start from the beginning of the text and look in a dictionary what's the longest string that is a valid word. The only language that I know of that uses a lot of composite words (I mean words made by sticking two or more words togheter) is German, but I think that looking for the longest sequence occurring in a dictionary would be correct most of the times.
I think you're significantly underestimating how many words could be retokenized into multiple words even before considering how concatenation affects things. For example: Concatenate is a word, but so are con, catenate, cat, and enate. Yes, no two of those are likely to be used in sequence, but I don't think that's a very reliable rule overall—"a" and "an" are both common words and negative prefixes.
Yeah, good to bring it back to the original point. Reading the article felt exciting, but in hindsight I am now missing a key detail.
The equations all seem to be matrix operations with a fixed number of rows / columns (you can take me as a real layman here). Unless you change that, I don't understand _how_ you can reduce memory needs. Granted, I'm probably putting my foot in my mouth not understanding transformers.
More ELI5 than the other comments. Considering the softmax network:
During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20. Remembering that 95% of our values are below 100, we would only have about 5 discrete values for 95% of our values - so we would be losing a lot of "resolution" (entropy/information). For example (assuming rounding is used), an original value of 19 would be quantized to 20 and 30 would be quantized to 40. The original values differ by 11, but the quantized values differ by 20!
This is where exotic encodings come into play. We might try to use a logarithmic scheme, for example. This would result in higher value densities at lower values - but we would probably still waste bits and it would require more APU cycles.
Now switch to the softmax1 network:
The range of values is less important than the distribution - instead of 95% of the values falling in a small range, we would see the values more evenly spread out. Assuming that the range is now 105 (so the 5% outlying neurons from the softmax network are still >100), we would have 243 values to represent everything under 100. The same example with 19 and 30 would result in 19.27 and 30.34 respectively, a difference of 11.07 - which is very close to the unquantized difference of 11. We have retained more information in the quantized version of the network.
Information is lost either way, but what's important is how much information is lost.
The reason that the large values appear is because the heads attempt to "scream really loud" when they are certain that they are right. This is an emergent behavior due to softmax - it ironically sucks at paying attention to a few of the heads: it boosts the volume of the heads that are trying to abstain, and mutes the volume of the heads that are trying to vote.
> During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20.
Instead of using an 8bit integer with even step size quantification, wouldn't they still use an 8bit float?
No one quantizes blindly without accounting for data. If 95% of your values are in 0-100 you’ll probably do something like have 20 values for 0-100 and the remaining 12 for 101-5000. You don’t have to apply a uniform distribution and shouldn’t when your data is that concentrated.
If I'm following correctly, does this mean that with this change along with a model being quantized, we could see models that are 5% the size (on file system) and memory usage but almost identical in output?
It has to do with the precision of the values stored in those rows and columns. If they could be coerced into a narrower range (without losing information) then we could effectively store them each with 8 bits or something. The +1 prevents blowups when the denominator in its current form approaches 0, and without those blowups, then we can use less bits, in theory.
That is only true if the using the new softmax changes the dynamic range of the values. We are using floating point not fixed point. So if before our values went from 1 to 5000 and now they go from 0.0002 to 1 we still have the same dynamic range and so still need the same resolution.
The activations (outputs) of one layer must be encoded in the same way as the weights of that layer as well as the weights of the next layer or the computation fails (unless you manage to write clever kernels for doing math at different levels of precision simultaneously, but even then you're introducing even more lossiness than just using a binary representation for those values).
Example: multiplying a bunch of float16s together gives you a float16. That is passed on to the next layer of float16s. Why should forcing the output of the first step to be float8 confer any advantage here? The only way I can see this argument working is if you make all the layers float8 too, and the reason you can do that is that the output of the first step can be faithfully represented as float8 because it doesn't ever blow up. If that's what the author is saying, it wasn't very clear.
I actually prefer the conceptual model the author suggests:
> Originally I wanted to call this function ghostmax, as you can think of there being an extra zero-valued entry in x (as exp(0)=1), as well as a zero vector in the V matrix that attenuates the result.
Don't think of this as weighting the options so that some of the time none of them is chosen. ("Weights that add up to less than 1.") Instead, think of this as forcing the consideration of the option "do nothing" whenever any set of options is otherwise considered. It's the difference between "when all you have is a hammer, everything looks like a nail [and gets hammered]" and "when all you have is a hammer, nails get hammered and non-nails get ignored".
I like this framing because, as an example, it bothers me that our speech-to-text systems use this method:
1. A human predetermines what language the input will use.
2. Audio in that language is fed to transcribing software.
3. You get, with modern technology, a pretty decent transcription.
3(a). ...if the audio sample was really in the language chosen in step 1.
If you ignore the choice of language and feed French audio to an English transcriber, you get gibberish. This is wildly at odds with how humans do transcription, where absolutely the first thing that a system that only knows how to transcribe English will do, when given French audio, is object "hey, this is definitely not English".
Most STT systems also tend to still train on normalized text which is free of the punctuation and capitalization complexities and other content you find in text LLMs. I suspect we continue in this way in part due to lack of large scale resources for training, and due to quality issues - Whisper being an outlier here. Anecdotally 8bit quantization of larger pre-normalized STT models seems to not suffer the same degradation you see with LLMs but I can't speak to whether that's due to this issue.
This seems like a good way to look at it. Another way to put it is, there is a certain "origin" or "default" confidence which is pinned to some fixed value pre-softmax, ie, all outputs are necessarily compared to that fixed value (pretending zero is another input to the softmax) rather than merely each other.
I like your description because it's relatively succinct and intuitively suggests why the modified softmax can help the model handle edge cases. It's nice to ask: How could the model realistically learn to correctly handle situation X?
It doesn't need to be two huge models. If there is an advantage to doing this, I'd expect that you would see it even in a small test case. I'm sure we'll see something by the end of the week if not earlier if there's something to it.
One of the most significant quantization papers of the last year [1] found precisely that these outliers only start occuring with LLMs at 6.7B parameters and above.
One of the most important keys to the success of deep learning in the last couple years has been the fact that emergent features exist after certain scales, so I wouldn't be too quick to dismiss things that don't help at smaller scales, nor would I be certain that all the tricks that help in small data/parameter regimes will necessarily help in larger models. Unfortunately!
Looking at that paper, they appear to be saying that 6.7B is where the problem becomes so intense that no single quantization method can keep up. From what I gather, the paper claims that such outliers start occur down to 125M param models, then at around 1.3B they begin to affect the FFN, and at around 6.7B is when the issue really starts to become apparent because "100% of layers use the same dimension for outliers."
So while you obviously wouldn't be able to conclusively prove the idea fixes the issue in larger models, if you know what you are looking for you should be able to validate that the method works in general down to very small models.
That said, consumer grade cards should be able to train an 8B model with quantization, so you might as well train the whole thing.
The reason it might need to be huge is because the long tail of extreme weights might only begin to show up then, but yes best to just start w something you can run on a laptop.
That is a good start. I wonder though if the change affects the ideal hyperparameters. Do you need more or less dropout if you make the change? What about learning rate?
So you might want to re-search the hyper params for a fair shot.
> First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output.
Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.
This is a technique that's been known for years and is in PyTorch. It's not widely used because people tried it and, in practice, it doesn't work as well.
OP calling it a "bug that's been overlooked for 8+ years" is click bait.
The add_zero_attn parameter in PyTorch is used for this, but by default their softmax is the regular kind. It has been in flaxformer for a couple years now though, however it claims to be a compatibility variant for older models [2] and I haven't seen any mention of it in their recent papers (though I've not checked exhaustively).
> Statistically, we expect that averaged vector to be close to zero.
I'm not sure that's the case, especially in high dimensions.
The expected value of the absolute value n random variables, uniform [-1,1], grows with n. I'm pretty sure it's proportional to the sqrt of n.
Also, random walks in high dimension return to zero with probability zero, so the sum of random variables in high dimensions going close to zero seems unlikely as well.
Both of your points are basically true, but I think a better way to model the problem is as a set of similar-length vectors being linearly combined by a probability vector.
Mathematically, we can write v_out = V * w,
where v_out is the vector of output from the attention unit, w is the probability vector from the softmax, and V is the set of input vectors, where each column is an input vector.
For a moment, pretend that the columns of V are orthonormal to each other. This might not be true, but it's an interesting case.
When the model wants the output to be small, it can set w = 1/n, meaning all coordinates of vector w are 1/n. (n = the number of columns in V)
In that case, the length ||v_out|| will be 1/sqrt(n) exactly, which is small compared to the input lengths of 1 (since we're pretending they were orthonormal).
Now if we stop pretending they are orthonormal, the worst case is that they're all the same vector, in which case the weights w can't change anything. But that's a mighty weird case, and in high dimensions, if you have any randomness at all to a set of vectors, they tend to point in wildly different directions with dot products close to zero, in which case the same intuition for the orthonormal case applies, and we'd expect a uniform distribution coming out of the softmax to give us a vector that's much smaller than any of the input vectors.
One caveat is that the average of many normally distributed vectors in many dimensions is normally distributed with 0 mean but is not typically close to 0. In fact the average norm is quite large. Try it yourself and see!
Don't most softmax implementations include an epsilon in the denominator which likely serves the same purpose? So the suggestion is to set that epsilon to 1?
I agree with your conclusions, but not necessarily with the reasons you present. I don't think it's _that_ easy for a current transformer to pass the information unaltered (i.e. to effectively replace softmax with 0).
In particular, I think the feedforward point you list in your "Second" is actually wrong. Replacing a softmax with 0, as the OP wants to do, is tantamount to passing the information unchanged, because the attention block is within a residual (skip) connection. If it's set to zero, the next output is identical to the previous layer output. There is no way to recover this effect with the feedforward layer.
The part that you can set V to zero is true, but somehow a different idea: the Q and K should be able to set to 0 if no token wants to be "close" to some other token, in some sense. But the V layer shouldn't "know" about this, because it can't look at other tokens. This is of course only how we think of transformers, which might or might not (more likely, the latter) be how it actually works. But nevertheless, having a 0 value coming out of the K.Q^T part only would be very meaningful.
Your "first" point is technically true (albeit logically false): if you have a sequence of length 32k, like GPT4-32k, and your softmax logits all predict the same value, the result will be an average of the V layer, divided by 32k, which is effectively close to zero. However, calibrating "exactly the same value" is extremely hard for a neural network, and there is no "default value" it can predict to make sure that's the case - even if you push all the values to one side, the result doesn't change, because softmax is translation invariant. Plus, if you have a short sentence, that's not true anymore. If you only have two tokens, one of them must be activated, or both with only a 0.5 factor. Surely if you have very few tokens there's much more contamination between Q, K, and V, so in that case V can indeed take a 0 value, but it's non-trivial and requires more layers.
All in all, adding that "+1" isn't quite meaningless, I think. Nevertheless, I believe it won't change much: these very big models have ways to get around any kind of smart small modification you do. If the intuition is very right, it might be that you can squeeze 1% out more accuracy in a handful of tests, after you carefully optimize all other parameters, which would be enough to get you a paper in a top conference. And it might also be implemented as a standard from them on (because, in this case, it basically doesn't cost any more computations, so it's "free"). But I would bet it won't be a major revolution.
That said, as you say, the only way to know would be to train a few models with this option and check the actual quality of them (certainly not GPT-style, nor GPT4-size, models, to begin with, but something quicker to train and easier to test in a fully automated way; old "boring" models like those in the BERT family would be a good point to start testing). But to do that effectively, you'd need somebody skilled in training this kind of models, with the cleaned data ready at hand, etc. (and a small compute budget, of course, but nothing revolutionary, a few thousand $ in GPU credits could be enough)
I might be missing something obvious, but I am not sure why everyone in the comments think it's a big deal. I've seen this trick in practice multiple times.
Yeah we used to use this in our older models years ago... I don't recall the details exactly, but I don't think it ever did very much.
I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf
> I don't recall the details exactly, but I don't think it ever did very much.
How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data
He's questioning the statement: "I don't think [the trick] ever did very much", because no one has yet looked at whether the trick helps reducing outliers in very large models. If it does help with this, as the blog author believes, then it is indeed a very useful trick.
Is he? A surface level reading suggests he's asking "how would you know".. and the answer is... by looking at the parameters. People do that.
>> because no one has yet looked at whether the trick helps reducing outliers in very large models
Given a softmax version doing exactly as the blog post says is baked into a google library (see this thread), and you can set it as a parameter in a pytorch model (see this thread), this claim seems off. "Let's try X, oh, X doesn't do much, let's not write a paper about it" is extremely common for many X.
Yes, I assumed that checking the weights for presence and amount of outliers is not something that is usually done and effects on this can be overlooked. If my assumption is wrong and researchers do usually look at such metrics, then my question is not very relevant.
If popular models are still making this mistake then it still seems noteworthy and making a blog post or paper to increase awareness definitely seems worthwhile. Also multiple independent discovery of good ideas is quite common.
The question is whether people have attempted quantization (the int8 / GGML / GPTQ approaches) and whether the "flattening" of distribution due to a larger denominator results in a better quantization behavior. You'd have to specifically try quantization with and without the +1 to understand the advantage. OP argues that the advantage could be be significant.
Technically softmax is not implemented as presented but through exp(x_i-max(x)), and summing over it in the denom. But maybe I am missing something.
Furthermore, the residuals are used exactly because the networks cant learn the identity function; but they can learn zero; at which point the residual is `f(x): x+g(x)` with being `g:x ~> 0` (ie approximately 0).
It is also the case that `f(x): x+g(x)` makes it easier for gradients to flow through.
How can `x -> -inf` occur in the first place when nearly everything is within [-2,2] and doing a dot product plus before that there's normalization too?
The use of the "nearly" in your comment is exactly occluding the issue as presented.
Enough weights don't fall under that "nearly" that we require more bits per weight to cover those edge cases. If we were able to delete the "nearly" we would need fewer bits (smaller models).
The idea is that if your range of values is small enough you need fewer bits to distinguish between meaningfully different values. The problem is that exp(x) << exp(y) for sufficiently wide ranges [x, y], so that when normalizing in the softmax and subsequently quantizing you don't get the fidelity you need and too much information is lost between layers. The proposed solution is that modifying the softmax step slightly brings x and y close enough to zero that exp(x) and exp(y) are close enough so that more compact quantizations are useful instead of useless.
This trick “they found” is part of the standard torch implementation of multi head attention, namely it is called, add_zero_attention. They add a zero to the logits, resulting in a one in the denominator as e^0=1 https://pytorch.org/docs/stable/generated/torch.nn.Multihead...
I find its documentation quite poor though: "If specified, adds a new batch of zeros to the key and value sequences at dim=1."
Doesn't describe the implications even briefly. If they add just your second sentence to that description, it'll immediately become so much more useful.
It probably means they have tried it for _some_ purpose, but not necessarily the one described in OP's post here. The claim is that this is specifically useful for quantization. It's seems reasonable to assume that this would have initially been tried and potentially discarded for having little or impact on general accuracy. But that's a different issue. I suppose we'll here something definitive in a month or so.
If you take the inner product between a lot of more or less random vectors (the key and query vectors in attention) most values are going to be close to 0. This means they contribute by e^0 to the denominator. Now, if you have a context length of say 2000, your denominator is already ~ 2000. Increasing it to 2001 doesn't really make a difference.
Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands.
That simple comment is a strong counterpoint to the entire blog post?
Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.
> it might be that the model trains all of the inputs to become very negative
It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.
Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.
Are dummy tokens just tokens that don't have an associated input/output token? Like, a way to give more computational power to the model without splitting the text into more actual tokens?
TL;DR sort of yes. But they're also useful for reasons not related to computational "power".
An example here with an actual algorithm, although it's been a couple of years so my explanation might be a bit wrong in places. and/or i might have gotten the completely wrong end of the stick with the current thread.
--
The CTC (Connectionist Temporal Classification [0]) algorithm maps a sequence x with length X -> sequence y with length Y.
i.e. in speech to text we might have some audio features that correspond to the following class predictions (post softmax classification)
x -> hellllloooooooooo wwwooorrrllld
we want to get this as the output
y -> hello world
we have the alphabet as classes we try to predict for each sequence item in x.
we could just removed all the duplicate in the first long sequence, but we would end up with `helo world` ... we need to preserve one of the early `l` characters in `hello` somehow
CTC uses a blank token (aka dummy) token to handle potentially deliberately repeated items in sequence x.
By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification)
y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d
The CTC decoder (non-ML decoding algo) heuristically removes repeated tokens. Turning the above into ...
y -> hello world
... the duplicate `o` and `~` characters are removed.
It was a decent enough algorithm for speech-to-text prior to attention/transformers etc.
However, it makes CTC vulnerable to well designed adversarial example attacks because there is a massive bias within models to predict the blank token -- meaning it's very easy to modify input sequence x to switch the output sequence y to include blank tokens for nefarious purposes (the subject of my unfinished phd).
> By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification)
> y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d
This is a great solution. Though that's a dummy token in the output rather than the input. I guess you could do something inverse to do text to speech, but it might be hard to say where to insert the dummy tokens in that case.
While not about AI or the algorithm mentioned, on the subject of little errors that you can't convince anyone are errors....
In 2011, I wanted to copy the reddit ranking algorithm in a project of my own, so I went to source code to look at it... the algorithm in the source code I found wasn't doing anything at all sensible with negative-sum voted posts.
I thought I discovered the error, some terms swapped in the simple equation, the sign for positive/negative was misapplied.
I blogged it, and [posted it to reddit](https://www.reddit.com/r/programming/comments/td4tz/reddits_...), only to have MANY people, including reddit employees, tell me I am definitely definitely wrong, and the algorithm was working as intended. And that I was in fact not the first to notice what I thought I noticed, and point it out, and be told by everyone I was wrong.
OK, I didn't really understand what was going on, I couldn't make sense of the algorithm if it wasn't wrong, but so be it. I updated my blog post to say that people smarter than me said there was no error in the reddit algorithm, all I can say is this variation makes more sense to me.
Then, three years later in 2014, a commit was made to the reddit source code with exactly the correction I (and others before me) had suggested all along. The one that everyone piled on to tell me how dare I have the temerity to suggest reddit source code is wrong.
Open source means there are lots of eyes that can find bugs, but sometimes they can't convince anyone they've found a bug. (And of course, then reddit close-sourced their code in 2017).
I never did end up using the ranking feature in my own project, that I had wanted to copy from reddit. I didn't end adding "vote" features to the app.
When I was an intern at Yahoo working on OAuth back in 2008 (2007? It was long ago and I'm old) I had the pleasure of implementing an internal tool for generating OAuth 1.0 URLs, which meant encoding a lot of things in query parameters. My tool did not generate URLs which were compatible with Yahoo's implementation (certain parameters effectively should be encoded twice, which my tool did). The implementing engineer insisted my tool was wrong, cited my status as a lowly intern, and even pulled out the OAuth spec and bent over backwards to say how his implementation was correct and I'm clearly reading it wrong. It literally took bringing in Eran Hammer-Lahav to weigh in on the topic to say I was correct, at which point the engineer agreed that of course that was correct. I got zero acknowledgment or apology for the days of ad hominem attacks against me.
I did learn an important lesson that more senior people are not always right, and as someone who's usually more senior than my colleagues now I try to remember it daily.
> It literally took bringing in Eran Hammer-Lahav to weigh in on the topic to say I was correct, at which point the engineer agreed that of course that was correct. I got zero acknowledgment or apology for the days of ad hominem attacks against me.
If it weren’t for the torturous gaslighting, this is borderline hilarious. Appeal-to-authority types have a way of submitting so effortlessly when a grander poobah comes around. Spine made of jelly.
I work at a FAANG and it was absolutely astonishing to find out how often this happens.
You can make a long, impactful career by just being "the guy who adds log statements throughout the codebase and reasons through it", doing this at even a simplistic level has always shown me an astonishing fix to some long-standing issue.
n.b. It also attracts a ton of political fun. People's first order reaction is denial, and it only gets worse from there. Absolutely no one except 1-2 colleagues will see it as "oh we should fix that", and at least one person will make sure your boss' boss' boss is CCd on an email with a nice version of "no he's just insufficiently concerned about {concurrency, memory management, take your pick}" Just wait it out quietly when that happens, do not engage or complain. If nothing happens and you're never asked about it by leadership, but your peers ask, make plans to move onto another team.
A long impactful career, or a career of horrible frustration and alienation as everyone gets mad at you for pointing out their bugs? (or, from their point of view, making trouble insisting that something is a bug which isn't and is causing no problems)
I've been at big tech companies for most of my career and I've never seen anyone deny the existence of a technical bug. I've seen plenty of teams mark a bug as lower priority and never fix it because other things are higher priority. But denying that the bug exists, especially after a detailed explanation? That doesn't resonate with my experiences.
It used to be writing the outputs from the C/C++ preprocessor (.i files) to disk took forever (5+ minutes IIRC) with Microsoft's compilers. I asked one of the lead compiler developers why, and he waved me away saying it was just really complicated. Around that time a bunch of tools existed for GCC that worked with .i files, but none existed in the Microsoft ecosystem likely because writing .i files was so slow.
I was on the compiler test team at the time and we did lots of stuff with .i files, our tests were distributed across a large cluster of test machines (see my post about that https://meanderingthoughts.hashnode.dev/how-microsoft-tested...) so it wasn't a big deal, but it still annoyed me.
One day I decided to find out what was going on, so I loaded up process monitor while outputting a .i file and watched what was happening. Much to my surprise, only 1 byte was being written at a time! No wonder writes were taking forever.
A quick dive into the source code revealed a comment above the file write call that read to the effect
// to work around a bug in windows 98
So anyway I opened a bug against the compiler saying we should probably fix that. :)
But that's not the type of story that's being claimed from the person I responded to.
Of course the lead developer waved you off. You wondered why things took forever, and the lead developer knew it was a complicated system and figured it wasn't worth their time investigating. It happened to be incorrect, but the lead developer wasn't in denial. They just filtered the issue out because they can't afford to go down every rabbit-hole they come across. I'm sure once you found the actual bug, it was later fixed.
The person I was responding to seems to think a large number of people are in denial when a bug is filed against them. That doesn't make sense, and isn't something I see. It'd be as if when you pointed out the actual bug, the lead developer continued to say it wasn't actually a bug (which is of course ridiculous and I bet didn't happen).
You are ascribing an absurdly maximalist viewpoint to me, one that would be obviously wrong at its face.
I know it's not so confusing as to get that sort of interpretation, because of the score on the comment, and comments like the above that explain to you how this happens.
As a result, I don't feel comfortable providing more detail publicly about my situation. That far off the mark tends to indicate an aggressive rather than curious interlocutor.
I am comfortable building on their example. The particulars of the issue are quite similar in a very helpful way.
I did the investigation, did a fix, worked it up to my manager and my managers manager. Elated, we work diligently for a couple weeks to document concisely, 3 page tech doc, briefest code deltas possible, one page + slides withs simple diagrams.
It gets bogged down at managers managers coleads submanager for the platform team implicated. They basically say "reading the single byte at a time means its provably serial and thus has no concurrency bugs.", as indicated in my original comment.
I used to work on a large-ish open source project. When I was bored I used to go bug-picking (not hunting, picking). I'd go and browse the source wherever my intuition told me there was likely a bug and I indeed found a few using with such a "method".
Oh, it was! And there was no way to "prove" it, it's just like "look, the equation this way actually makes a lot of sense and is very clever and produces results that seem reasonable and much respect to whoever designed it... and the equation the way it is in code now... does not. It seems like clearly it got accidentally transposed at some point?"
And the response was just like "We disagree, we think it makes sense the way it is and the product is correct"
That's kind of the end of the argument, there's nothing more one can say!
It didn't help that I came in assuming that of course everyone would see which version was correct (as you just did! although i didn't find it obvious, it took me lots of study to figure out), instead of producing a narrative designed to gently persuade them that. (That's on me -- I think I've learned something about technical communications around bugs and disagreements since then, although I'm still far from perfect).
The real answer, I think was given in one of the reddit thread comments -- the way it's broken for the most part _doesn't matter_ in the usual operations of reddit, it matters only in edge cases, and not very important ones, so really people mostly don't notice and we don't care.
Fair enough, I guess? But they did fix it three years later? I forget how I even found out they had fixed it; I can't at this point find any context for _why_ they fixed it, or who with power finally noticed/agreed it could use fixing why.
(And if it had happened three years after that, it would not have been in public source, and my gloating satisfaction would have been stolen!)
* Outlier values are used to prune values.
* Transformers seem to undergo a "phase shift" in how outlier features are treated around 6.7B parameters. This could complicate research on removing them.
Maybe you and Tim Dettmers would have a lot to talk about :)
The author identifies a real problem and poses a simple solution. It passes all my crank tests (why did no one come up with this before? Because the author is intimately familiar with the softmax function from work outside of ML, and plausibly nobody who’s investigating these issues is remotely as familiar, so despite researchers narrowing the issue down to “something to do with softmax”, they don’t have a deep enough understanding of softmax to see what’s wrong).
If the author is reading any of these comments, though, I would urge them to expand on their claim that “I’m 99.44% sure that it will resolve the outlier feedback loop”. As it stands, that’s the only explanation we get of how the outliers might be related to softmax!
> """Softmax function with an additional virtual logit equal to zero.
For compatibility with some previously trained models.
This is equivalent to adding one to the denominator.
In the context of attention, it allows you to attend to nothing.
And creates the exact same modified softmax as this essay. I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows
> I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows
Maybe quantization wasn't as hot back then than it is now?
You seem to really disregard the positions of this author. They seem to have invested substantial efforts in that specific area of research.
To validate the idea the author has, it would be required to train a LLM from zero. If the author is right, you would get similar results to the current generation of LLMs, but with (a lot) less space required for the intermediate layers.
The time to achieve that is still measured in kilo- to mega-dollars, why is it wrong to put that idea in the open to substantially criticize or adopt?
You don't need to train a ChatGPT-sized LLM, a toy nanoGPT would have been enough. You can train those on a consumer GPU in an afternoon.
And yes I do disregard his research effort. There are hundreds of well-justified and well-researched "clever tricks" for improving Transformers, and almost all of them don't work. I'll believe it when I see the results.
Outliers only begin to appear around 3B parameters (as per the original LLM.int8 paper) so unfortunately not consumer GPU in an afternoon kinda stuff to prove you've managed to suppress them.
I tried to test this with nanoGPT in an afternoon, since the code change is pretty minimal. It's hard to get conclusive results at that scale though - to be able to say anything with confidence you'd need to run multiple tests, figure out if the 'outliers' mentioned only appear above a certain scale, find good tests for quantization performance that work on small enough models that you can iterate quickly ... It's doable but still lots of work, enough that putting out the idea and hoping others with more time+compute will try it out seems a valid strategy to me :)
More generally though I definitely agree that the trend among 'improvements' to transformers has been things that don't turn out to work in practice.
Do you know of handy testing steps? I suppose I could ask ChatGPT, but if someone has a validated "here, this is how you do it" I have a 3090 that I can do it on, but I'm not keen to debug anything here.
Testing steps (based on thinking about this for 30 seconds - so probably can be improved):
Train a Transformer based model with and without the modified Softmax (Suggestions: GPT-2 or nanoGPT)
Measure performance - I'd probably start with Perplexity and see if there is any difference (we'd expect little difference).
Quantize both models with different quantization strategies.
Measure the perplexity of the quantized models of different sizes. We'd expect the performance to drop off quicker for the non-modified model than the modified one if this is working.
I was thinking about a different problem as I was typing that and got some mental memory alias bug. I wanted to know a set of steps to take to train a model. My apologies.
I did a writeup like this. (Not as nicely as Simon though) where I modal.com (cloud GPU, containers, quick starts, free $30/m spend) to use their GPUs (e.g. T4, A100).
T4 I think was good enough for the job, not much need for the A100.
Since this post I am working on an easy way to do this with a script called lob.py that requires no code changes to the nanoGPT repo (or whatever repo you are using) and runs in modal.com. The script exists but gets refined as I use it. Once it is battle tested a bit more I will do a post.
(It is named lob.py as it "lobs the code over to the server" where lob is UK slang for throw)
Thank you. FWIW I often find write-up + script superior to script because I often want to modify. e.g. I want to run GPU-only, but other script provide part-way solution when textual description added. Therefore, much appreciated.
I think there might be some curse of the auto-didact here, hinging on the meaning of publish: it would be embarrassing if he was capital-P publishing, as in a scientific paper.
The blog goes to great lengths to point out it is _not_ capital-P publishing.
> why did no one come up with this before? Because the author is intimately familiar with the softmax function from work outside of ML, and plausibly nobody who’s investigating these issues is remotely as familiar
I doubt that is true. Softmax is extremely well understood within the ML community. It's a very common trick, these properties are well-known as well. It feels very unlikely that nobody has thought of this before. That said, it's also plausible that the current softmax convention was chosen by accident and the author is right to identify this drawback.
And because the effects of the problem are subtle. Supposing the diagnosis is correct, full-precision LLMs still avoid the issue through large attention weights given to meaningless tokens to give harmless attention outputs. The problem only matters when quantizing weights, and quantized performance isn't really the goal of recent cutting-edge LLM development.
The I know it's on-vogue on HN to complain about academia, but the blog post is not making a good argument.
The post could have probably gotten the point across in less than 1/4 of the overall length (probably even less than 1/8th), instead the author wrapped the the post into lots of informalisms and a thinly veiled complained about academic publishing.
The result of this is reflected in the discussion here, nobody actually writes about the result/idea behind the post, instead we have ~200 comments discussing the merits of academic publishing vs blog posts and formal vs informal writing.
So I guess if you want to get your blog-post on the front page of HN it's a good writing style. If you want someone to consider and discuss the merits of your idea, maybe not so much.
Thats the fundamental reason we end up with an Attention Economy - People have limited Attention to pay to everything, But unlimited capacity/need to receive Attention (via Michael Goldhaber).
This plants the seed for the info explosion (those 200 bikeshedding comments or those 6 billion videos on how to boil an egg).
To counter it we have rankings of comments and links and news feeds from google to fb to hn. But its just another layer of bullshit cause most of the pool of what is being ranked is bullshit.
We are yet to design Information systems that take into account what Goldhaber said about Attention 3-4 decades ago.
You sneer at “getting on the front page of HN” but when you rephrase it to, “discuss something you’ve observed informally” your dismissal loses its oomph.
The point may be to entertain as well as inform. Many humans enjoy the unfocused discussion around the main point, and perhaps the author prefers it to the clinical and formal tone an academic paper tends to take.
For what it’s worth, someone pointed out that PyTorch has an optional workaround for it in the Multihead Attention API. But yes, I had to skip over 200 comments ranting off-topic that was mildly annoying (to me).
I ran an experiment like this and in my setting it didn't help. Not saying there may not have been a bug or something, but I think attending over the current position sort of solves this problem. IE when it should not speak it just emits the current pos value.
edit to add details in case anyone is interested
I didn't add one to the softmax denom. I added a learned parameter (the attention sink) that would be appended to the beginning of QK but would be removed after softmax, so when multiplying by V the totals wouldn't sum to one. I tried variants that included looking at the current pos and not, and also variants that predicted used an ffn to generate the sink per position instead of a learned param. In my setting neither approach really made much of a difference. But I also had a bunch of other weird stuff in there too, so it may be worth trying again.
When you say it didn't help, can you clarify what you're measuring? In the context of this post, I think both the performance your task, and the number of outlier weights (and their magnitude) are important.
I was just looking at doing this in pretraining, so I was looking at pretraining losses. The difference was within the range of usual noise so I didn't keep trying.
It wasn't really the goal of my experiment to fix this issue for sure, I was trying to see if you could improve attention by decoupling the key used by a position for itself and for future tokens.
Open to being wrong here, but wouldn't it be functionally similar to adding a constant to the softmax denom? the function could sort of learn a specific position to have sink and q multiply to one, then removing it before multipling with v would be exactly identical?
Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).
I didn't test for outliers, but I don't think this will lead to a large improvement in attention overall/it will fix a lurking bug.
He refers all over the blog post to an "error" in attention. specifically says
The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man.
I'm saying it uses the current position to do this, that if it was a significant error I would expect it to improve the training loss. I sort of interpreted the blog post as being a bit more positive on the idea than just being about improving the quantization
I agree that he used the term error somewhat incorrectly. But he seems mainly to just be making the point that sumac introduces a large outlier which in turn is only now an issue that the community is now aggressively trying to quantize models
I don't see any results, it'd be more impactful and convincing if there were numbers supplementing the theory. It's not that hard to finetune existing LM on a small data and verify that it works.
I am, however, of the similar opinion that there could be better attention formulations. A paper from 2020 https://arxiv.org/abs/2005.09561 helped a lot in one of the transformers model I trained (not a vanilla LM but a specialised multi-modal graph problem).
It proposes normalised attention which if I'm not wrong should help with the quantisation problem too.
This method was frequently used prior to the ubiquity of dummy tokens. XLNet was the paper that introduced me to this idea. I believe it’s been in PyTorch since 2019/2020. I would not be surprised if someone finds an earlier reference.
I’m surprised by the pompousness in the OP. Especially about something that most people who do transformer research understand. I’m also surprised that so many in the replies are taking the position of “this is what research should look like” when this is clearly an example of why research doesn’t work like this. Peer review is good for many things and one of those things is saving yourself some embarrassment.
He's not being pompous, people appreciate the informality and straightforwardness and self-deprecation, which are the opposite of pompus.
You are reading some of the more ambiguous self-deprecation as genuine claims.
TL;DR on why this is important and he's sharing: it's a sort of niche thing that really only matters if you're trying to run pale imitations of ChatGPT on constrained hardware. That's why it's entirely possible the big guns didn't see it as important, they're not trying to run LLMs on a 3090
He’s writing in a colloquial and self-deprecating and humorous tone. I can’t speak to the merits, but I can follow the reasoning perfectly fine. It’d be hard to find something further from pompous.
> saving yourself some embarrassment
Implying of course that being wrong, or not the first one to discover this, is embarrassing. And that’s not pompous?
This is similar the the (old) trick of adding a Uniform distribution component to a Mixture of Gaussians model. It doesn't really change the math wrt parameter optimization and probability evaluation, but provides a place to capture "background" or "unimportant" data points and improve the model robustness to outliers.
The motivation follows from the same problem the author points out in the original softmax formulation that it always "forces a choice" when it may be more useful to put a "Not Applicable" option into the model itself.
This reminds me of the normalization bug in StyleGAN. It had this obvious visual artifact of a 'blob' which would appear in otherwise photorealistic images, which was puzzling because it was so obvious how did the Discriminator not squash it? It turned out to be a flaw in the normalization of the AdaIn style layers, IIRC, where the Generator was pumping up numbers and doing weird things to force through information.
I don't really understand the subject matter enough, so I apologize in advance for the meta-comment...
The author mentions that he would maybe have written this as a scientific paper:
> I tried writing a serious-looking research paper about the bug and my proposed fix, but I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead. (History is written by the winners; blogs are written by…)
Honestly, thank god he didn't. This paper is so much more readable and approachable than what gets published in "serious" journals. The tone is self-effacing, it does not have an "ego" the way scientific papers tend to have. If all science read like this, and if we were "allowed" to cite research that reads like this, I think we would be much better off. This reads like a conversational, approachable textbook, not like an impenetrable wall.
Is it because I don't understand attention at a PhD level that I hold this opinion? Maybe. Could he be writing like this because he's a layman and utterly wrong about the topic, unlike those Serious Science Authors? Maybe, I don't know.
But my god, wouldn't it be nice to be allowed to write like this?
Nah, scientific papers are supposed to be precise and technical. This reads like those quite frequent suggestions here of switching all equations in papers to plain English or code: it honestly comes from a place of ignorance, and I say that as basically a layman myself.
What should be encouraged is for academics to blog about their research as well. It would even help when recruiting and onboarding new members. Right now the sociological and economical incentives don't promote this at all.
There was this sociologist who had written a paper for us all to read ahead of time. I started to read the damn thing, and my eyes were coming out: I couldn’t make head nor tail of it! I figured it was because I hadn’t read any of the books on the list. I had this uneasy feeling of “I’m not adequate,” until finally I said to myself “I’m gonna stop, and read one sentence slowly so I can figure out what the hell it means.”
So I stopped-at random-and read the next sentence very carefully. I can’t remember it precisely, but it was very close to this: “The individual member of the social community often receives his information via visual, symbolic channels.” I went back and forth over it, and translated. You know what it means? “People read.”
Then I went over the next sentence, and realised that I could translate that one also. Then it became a kind of empty business: “Sometimes people read; sometimes people listen to the radio,” and so on, but written in such a fancy way that I couldn’t understand it at first, and when I finally deciphered it, there was nothing to it.
-- Feynman
I disagree. After going through quite a few research papers in my time, I've found the best are the ones that are direct and to the point. Many papers I've spent many hours/days trying to unravel just to realize the concepts were straightforward, not very novel, and there wasn't much of real substance to the paper.
Meanwhile, some of the most impactful papers I've read are direct and to the point. Kadmellia, Bitcoin, BitTorrent, DynamoDB, Firecracker, etc.
It seems like, when you have something of substance to say, you say it. When you don't you overcompensate by falling back on building an intricate puzzle of jargon and convoluted equations in an attempt to make what you're saying sound far more important than it really is.
As LLMs get better, I look forward to the day where every journal has a standard LLM filter you're required to apply to your paper that unravels all of this nonsense and rewrites it a more straightforward way, if not to directly publish than just for the editors to verify there isn't a simpler way to convey your ideas. I suspect that if we had an EIL5 filter for most journal articles, we'd discover that a majority of the words that get published have very little substance at all.
Systems research papers do not represent all research papers out there, not even in computer science.
In cryptography, certainly a paper with formal definitions and proofs can be much more valuable than a corresponding blog post. It's a field where formalism is desired, if not necessary. Otherwise you can't check other people's "proofs", or even know what model you're working in.
I think, since people haven't come up with better formalisms, sometimes it's quite obtuse, which gets mistaken as "academic writing", when really it's a best effort to formalize.
Requiring formalism does not preclude attaching an informal but intuitional description of the formal definition or proof. Unless the authors don't understand very clearly what they are talking about, or they want to prevent others from understanding their concepts too easily, I don't see why there is a reason for the authors not to attach an EIL5 in addition to formalism.
Sure. But it's an ELI5 "in addition to formalism", not "in lieu of formalism". In theory conferences like STOC or FOCS, the first section of the paper often comprises such an overview.
Certainly some papers are better written than others. But sometimes a blog post cannot replace a paper, unless it also goes into the depth and detail that formalism requires. (Then it becomes a 30 page blog post, where most people don't read past the intro.)
Papers are mostly read by other researchers, where the added background is actively bad because it obscures the real meat of the paper to the main audience.
If you just wanted a digestible intro then you would usually buy a textbook.
I think the argument that every research paper ought to be a mashup of a textbook + the actual research to be a bit silly from a “people should specialize at what they’re good at” standpoint.
Put in another context, I also don’t want every recipe to reintroduce what it means to “fry” or “braise” or “marinate”. We have Google for that.
I've long wanted an informational slider to bits of text. Something where you can zoom in and out to the level of desired complexity. LLM's might be able to fill in some of those gaps. You could turn any paper into a introduction of the subject it's a part of.
This sounds like a good use case for local llm models. Browser plugin, precooked prompts for different levels of detail, maybe a lora to give the model some idea of expected output. I bet some of the 13b models could do a useful job on this even if they were imperfect.
I don't know that much about AI, but my experience in other areas has shown me that 'more grown up' literature that feels harder to parse when your starting out later becomes the precise technical information you need as you get deeper into a subject. Like W3Schools when you start out in web dev vs MDN when you're skills are more mature.
I believe Feynman understood that he was oversimplifying, and I believe he was able to do because his reason for reading the paper was not the same as the reason another sociologist might have. Thus a sentence like, "The individual member of the social community often receives his information via visual, symbolic channels", does, to a non-expert, mean "people read", but to another sociologist of a researcher in related fields, phrases like "individual member", "social community", and "visual, symbolic channels" would terms of art. That means an expert in the field could read "social community" and it would mean, cognitively, an entire set of concepts in the field.
In short, jargon matters. People here can talk about functional, procedural, and object-oriented programming because each of the three words has more than just the dictionary meaning - to those of use in the field. In the same way we can talk about linear algebra and know it doesn't mean "algebra on lines".
Yes, it's possible to write scientifically without jargon and wordiness, but it's a lot of effort and takes much more space to say "a group who follow a social structure within a society (culture, norms, values, status). They may work together to organise social life within a particular place, or they may be bound by a sense of belonging sustained across time and space"[1]
Visual symbols could be anything from written words to police uniforms. It's not oversimplifying— it's flat-out wrong. It would be like reading
Expressions representing numbers may be combined with an expression representing a primitive procedure (such as + or *) to form a compound expression that represents the application of the procedure to those numbers.
And an English professor haughtily responding, "you know what that means? 'Computers compute!' This SICP book is just a pile of jargon that could be dramatically simplified!"
His dismissal revealed nothing about the topic, but a whole lot about how so many in the "hard" sciences view others. Don't understand the text? It's the text's fault! For I am a real scientist, and if I don't understand it, it's not understandable!
He might have been a genius, but he should have stuck to subatomic particles and left exploring human behavior up to the people who'd done the prerequisite reading.
Well, maybe, but you can rationalize arbitrary amounts of pointless jargon that way.
Besides, in the example Faynman gives the simple sentence is actually shorter. Maybe that shorter sentence loses some information that the jargon carried, but Occam's razor suggests the writer was just trying to sound smarter.
Some bad writing certainly comes from trying to sound “academic” or “scholarly” but there’s more to it than that.
A lot of research involves lumping and splitting: what underlying properties do these seemingly-different share (or vice versa). For example, reading text is just one possible instantiation of a “visual symbolic channel.” Traffic lights, road signs, gauges and dials, logos, and clocks also carry information the same way. If you want to discuss “reading and reading-like activities”, you may want some kind of umbrella term.
Plus, you may want to contrast them with other ways of sharing information: non-symbolic systems that literally depict the item in question (photos on a picture menu, for example) or using a different sense altogether, like church bells for telling time.
Not an academic here, but I've read (and continue to read) through research papers regularly.
The original bitcoin paper is a great example. I was able to follow the paper almost fully at my first read itself—despite my not having a formal background in maths.
...and as you said, many of the insubstantial papers hide behind jargon and unnecessarily complex equations, just to camouflage their lack of substance. It's frustrating to spend time deciphering a paper, only to realize that you've essentially wasted that time.
The criticism was """Haraway's work has been criticized for being "methodologically vague"[39] and using noticeably opaque language that is "sometimes concealing in an apparently deliberate way""""
>Haraway's work has been criticized for being "methodologically vague"[39] and using noticeably opaque language that is "sometimes concealing in an apparently deliberate way
So you're saying that "Her work is basically handwaving and bullshitting".
Yes, but also, wrapping the handwaving and bullshitting in a layer of obfuscation:
"Michel Foucault’s biopolitics is a faccid premonition of cyborg politics, a very open feld.
By the late twentieth century, our time, a mythic time, we
are all chimeras, theorized and fabricated hybrids of machine
and organism—in short, cyborgs. The cyborg is our ontology; it
gives us our politics. The cyborg is a condensed image of both
imagination and material reality, the two joined centers structuring any possibility of historical transformation. In the traditions of “Western” science and politics—the tradition of racist,
male-dominant capitalism; the tradition of progress; the tradition of the appropriation of nature as resource for the productions of culture; the tradition of reproduction of the self from
the refections of the other—the relation between organism
and machine has been a border war"
Donna Haraway was born 6 years after “stay woke” in its sense as an admonition to maintain alertness to the racist context was coined. Leaving aside a debate over whether her work is a good match for “woke”, she very much cannot have been woke before woke was a thing. (Before its recent replacement of “politically correct” as the American Right’s preferred, meaning-stripped, label for everything it disagrees with, sure, but “woke” was a thing long before that.)
>Leaving aside a debate over whether her work is a good match for “woke”, she very much cannot have been woke before woke was a thing
A game of being pedantic is always welcome:
She very well could have been "woke before woke was a thing", because "woke" as the parent means it in her case, refers to the modern usage (of like, 2 decades), not the original term of the 40s that might have preceeded her birth.
So take the parent's comment to mean:
"She was woke, in the modern, circa-2000s+ sense, before woke, in the modern circa-2000s+ sense was a thing, not in the 1950s namesake sense".
Similar to how somebody could have been a hipster (in the 2000s+ sense [1]) before a hipster was a thing (before 2000s), even if they have been born in the 70s. Sure, the term already existed before the 70s, but it referred to a different thing.
The 1938 sense in which it was coin3d is exactly the sense of the 1950s and the sense that got ibcreased attention circa 2000s and catapulted to attention alongside BLM (which itself was a response to the same kind of even that the art in which the phrase was coined for responded to).
The only newer sense is the American Right’s use of the term to replace “political correctness” as an empty epithet for everything and everyone it disagrees with.
Language grows organically, and the American right gets as much as the American left to define what a word means or how proponents of a movement or social fad are seen in practice (besides woke's standard definition is just "awake", if someone insists on the "original meaning")
So, one side could see woke in theory as a noble activist/social consciousness practice, which can not go wrong and helps liberate us all.
The other side might see woke in practice as intolerable virtue signalling and self-aggrandizing whose actions often border on farcical.
Thanks; that's exactly what I meant. I leave these things out because I assume not everybody is pedantically waiting to call me out on a slight variation on their personal belief system.
> I disagree. After going through quite a few research papers in my time, I've found the best are the ones that are direct and to the point. Many papers I've spent many hours/days trying to unravel just to realize the concepts were straightforward, not very novel, and there wasn't much of real substance to the paper.
Can say the same thing about code. Some people just honestly don't want to give away how simple the core logic is seemingly, and will lead you through myriad twists and turns to finally see the point.
The writing quality of academic papers is very poor, whatever its intended characteristics are, and we deserve better.
I'm skeptical that the only way for them to be precise and technical is to make them impenetrable. I think there is a culture of academic writing (many different cultures, really) that has adopted a voice and writing style which became a parody of itself over time.
Here's a trivial example: You frequently see papers use the passive voice, something a middle school English teacher would mark with a red pen. 500 participants were asked, vs. we asked 500 participants. In what sense is the former more precise and technical? It's not. It does not convey any additional meaning. People use it to sound objective and distant, even when they really aren't.
Realistically, academic writers usually don't even think about it as much as that. They're just copying the tone of other papers, because there is a culture and it enforces certain behaviors on its members irrespective of the value.
In philosophy papers you see authors often use the pronoun "I", similar to blog posts. But they have other ways to make them hard to parse for outsiders.
Either your example is too trivial to justify your point, or the point itself is trivial. It's right for an academic to distance themselves from the subject of their study because we do need researchers who try not to be biased. If they fail that and then correct themselves, then what's the problem? Complaining about inconsequential uses of tone is obsessing about form over function and reeks too much of insecurity, to be honest.
Of course language does not guarantee that the study is objective—that would be in the design of the experiment, the reproducibility of results, and the absence of conflicts of interest among the researchers. Using the passive voice however elevates the outcomes being reported as facts that actually happened, instead of mere personal experiences.
People complain all the time about news being biased for being told from a reporter’s point of view, but complain all the same when events are reported in an encyclopedic manner as researchers do when they remove themselves from the events and the outcomes of their studies.
I'm convinced that the value of active voice is not precision and clarity, but rather the subliminal egocentrism away from the object (the research) towards the subject (the researchers) who need to receive credit for the work. The royal "we" also helps frame the work as a collaborative effort with the audience.
That's rubbish, passive voice has a number of detrimental effects, it increases text length without adding information, it makes subject (acting entity) and object (entity acted upon) easier to confuse and it confuses the reader about who actually did things (what some people often confuse with objectivity).
That said the assertion that most scientific articles are written in passive voice is outdated för quite some time. Most journal style guides advise to use active voice, e.g. https://www.nature.com/nature-portfolio/for-authors/write
> it confuses the reader about who actually did things
When scientific papers have a clear list of authors and delineated section headings, this point is moot. And in such papers, again, repetitive strings of sentences that begin with the same "we..." emphasizes the producers of the work over the work itself.
I agree with everything you say. Though papers really are a bit too hard to read sometimes, but I'd argue it's often not for an overly technical tone so much as writers cutting out a lot of background material for brevity and assumed familiarity.
>What should be encouraged is for academics to blog about their research as well. It would even help when recruiting and onboarding new members. Right now the sociological and economical incentives don't promote this at all.
I will add onto this that a lot of journals have been pushing for video abstracts and "plain English" abstracts. For the most part I don't see these too often but when they're there they're appreciated, and I vaguely recall that someone found that citations go up when they're used (specifically plain English, I don't think anything has been on video abstracts).
There are a lot of good blogs for computational academic subjects (ml, bioinformatics, comp neuro, etc) but I see less for bio and non-software engineering. Math and physics seems to have some really notable blogs, but beyond what gets posted to HN and linked further on those blogs, I can't comment.
"it honestly comes from a place of ignorance, and I say that as basically a layman myself"
Here is an added complication: succinct technical communication can be efficient when communicating to peers who work on the exactly same domain, similar problems as you, and want digest your main ideas quickly.
On the other hand, for any particular paper, the size of the audience to whom it is directly relevant and addressed to can be small. The size of the audience who got to reading it anyway may be vast. (Maybe I am reading your paper because someone cited a method paper that in lieu of a proof or explanation writes just two words and citation to your paper. Maybe I am a freshly minted new student reading it for my first seminar. Maybe I am from a neighboring field and trying to understand what is happening in yours. Maybe I tried to find what people have already done with particular idea I just had and search engine gave your paper. And so on.)
During my (admittedly lackluster) academic career I recall spending much more time trying to read and understand papers that were not addressed to me than papers that were and where I enjoyed the succinct style that avoids details and present the results. (Maybe it is just an idiosyncratic trust issue on my part, because I am often skeptical of stated results and their interpretation, finding the methods more interesting). But that is not all.
I also noticed that genuine misunderstandings coming from "brief" communication of technical "details" were quite common; two different researches would state they "applied method X to avoid Y/seek Z[citation]" in exactly so many and almost exactly same words, where X,Y and Z were complicated technical terms, yet the authors would have quite different opinion what the meaning of those words were and what would be the intended reading and how and why X should be implemented.
In conclusion, I think many a scientific field would benefit from a style where authors were expected to clearly explain what they did and why (as clearly as possible).
>Nah, scientific papers are supposed to be precise and technical.
They're also, more often than not, tedious, badly explained, error prone, oft-skipped, and hardly ever read carefully, even during peer review for the paper that contains them. That's how mistakes stay unnoticed for decades in influential papers with tons of citations.
In essense, a paper's tone and languge is often more formality, academic tradition, ritual, and padding for publication purposes, than serving a real purpose.
Well, I'm not so sure. It seems to me that someone could perfectly well devise an experiment based off of this (another poster chastised me for saying paper, so) blog post.
Equations are perfectly clear. I was able to follow his reasoning perfectly well.
I cannot say the same for so many papers (tm) that I've read. Mostly in a similarly computational (though non- deeplearning) applied math domain.
Strongly agree. “Why are academic papers always written in such mumbo jumbo?” is the same complaint as “Why are contracts written in such legalese?”, which is a manifestation of “I’m smart and I don’t get this, so the author is dumb for not writing clearly.” It’s a natural human bias that most HN denizens insist they don’t possess, but of course we do.
> Nah, scientific papers are supposed to be precise and technical.
> What should be encouraged is for academics to blog about their research as well.
Why so binary? A blog would be hard to find, why not have both in the paper?
My view is similar to that of code vs docs: code should be as small, and as precise as possible, whereas docs are best when they’re explaining to humans how things fit together, high level. Also easier to maintain.
Hyper technical natural language mixed in with math is almost the worst of both worlds: low density of the actual formulas, with an incomprehensible wall of text surrounding it. And clearly this is an issue also for phd domain experts.
Not saying academic writing could be super simple but I also see no reason that the status quo is optimized more for comprehension than say social posturing.
I disagree because it isn't possible for language to be precise on it's own syntactic merit. There is meaning and there is context and the biggest problem with research papers is that the context of many statements in the paper are incredibly ambiguous. The reason for that is that the papers are trying to be "concise". Context can only be disambiguated with more statements. You must eliminate potential interpretations that a reader could make.
"Spectrum sharing in an “apple-like” or a fixed set sense is not a coexistence. ". What does that mean? Coexist? Who knows, the author thought they were being precise, but they understood the statement they made with a head full of context that gave it precise meaning. As readers, we can only scratch our own heads as to what that context could possibly be.
Leslie Lamport definitely doesn’t share your opinion. A known fact about the Paxos paper is that there are no dumbed down summaries worth reading because the proper thing is so approachable. Not sure if you only have to sound smart if you’ve got nothing to say but certainly feels like it could be the case.
Paxos is so mistifyingly hard that Raft was invented as part of a project to understand Paxos (and the advisor and proponent of the project was John Ousterhout, who's pretty badass). There are also I believe a few papers trying to trying to explain Paxos more clearly
I've read a lot of scientific papers in the comp sci / machine learning space and they are rarely precise. It's been over a decade since I've ready many papers so maybe this has changed, but I remember reading a paper out of Microsoft about how to make spell correcting auto-completion for search, and it was nearly impossible to figure out precisely how it was implemented. Precision would have been achieved easily by providing code and a sample data set. instead of was a mix of prose and math equations with many gaps where you had to guess how to fill.
Ah yes, my old supervisor was very fond of that strategy.
"Make it sound like we do cool stuff; but don't make it so precise that they can re-implement what we do. Let them come to us so we can co-author papers."
I think maybe its because he didn't have experimental results that show that it worked. Not a knock against the author, there are just so many things that seem like good ideas that don't end up working well in practice, a paper like this without results is hard to value.
Yes, definitely. If he tried to have it published, the lack of experimental results would definitely be a glaring error.
But this is still scientific communication. It's really nice that it's legible!
> Even though softmax1 is facially quite boring, I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research. If you want to run some experiments and prove me right, DM me on Twitter and we’ll get a paper going.
I'm guessing that in the stodgy world of science, a communication like this might happen over lunch at a conference, limited to a small clique of researchers who are zealously guarding their next paper. Who could blame them, publish or perish!
But someone will probably test this theory out (after my read, it will probably happen in llama.cpp with preliminary results on GPT-2 by next week) and achieve results, and it will happen quickly and legibly to the outside world, because this was published openly and without all of the pretension that formal science (tm) has. If it works, it works. Stuff like this is the soul of the internet. Sharing knowledge and making it legible for all.
Then again, if you don't have access to giant compute clusters you can't test this, so it's either a blog post or nothing. I believe the outlier problem that this solves only appears for very large models.
That isn’t true at all. Train a smaller model on a smaller dataset. You can even train on your laptop. It’s definitely feasible. This is just a proof of concept, it doesn’t need to beat state of the art.
Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.
In a weird way my mental model is: blog posts are the recon team discovering a new idea. They might have errors. They might be incomplete. Maybe they’re outright wrong. Stakes are lower as it took less effort to get there and less loss if a position is abandoned.
Then papers are authored, often much later, and they’re the regulars coming in to fortify a newly captured idea. They provide (or at least are supposed to) rigor to the idea. A fortification of a position that we decide is worth holding.
Yeah, this analogy is probably sloppy. But in my brain there’s an eternal conflict against ignorance as we keep advancing into the unknown.
Counterargument: this blogpost is worthless. You get all the way to the end and then find out he hasn't actually tried it, not even on a toy model. It's just a neat idea he thinks will work.
Among other reasons, because the decoder-only version of the original transformer architecture has proven weirdly resistant to these kinds of hacks and clever optimizations.
Ideas like sparse attention, tree attention, residual attention, etc, all sound good on paper, but when researchers try to reproduce them they either find no results or results that don't scale. Even AliBi is turning out to be less powerful than scaled-down positional embeddings. It's almost a bitter lesson on its own: you can't beat the original transformer.
Optimizations that do stick around tend to be the ones that preserve the original algorithm but help with caching or memory accesses.
Because there are a thousand ideas a minute in this field that meet the "it's worth trying" bar but don't actually pan out to make any difference. It's the equivalent of a blogpost that says "if someone else turned my idea into a business, it would be a billion dollar business. But I won't bother."
With say system architecture, you can muse on stuff like "well if Kubernetes made this decision, it would definitely be more secure" or "it would scale up quicker" without empirical evidence and other people could argue "yes I agree because" or "no I don't because"... etc.
With large ML models, there probably is no intuition like this. We just don't know "if I do the common sense thing X, it surely will produce better results for a given benchmark" ... well we have no idea until it is tried out.
Well if you yourself are trying to publish in a scientific venue you can't always cite exactly what you want to cite. Though it's probably uncommon for a peer reviewer to ask for a specific citation to be removed, the review process absolutely does affect the references list, and expectations about this process affect it doubly so.
In ML, no one is going to police your citation list. I've cited some weird stuff in my papers, including ideas from tweets and random quotes from Jeff Dean. It's never been a problem.
A lot of thoughts in this thread on what academic papers are or should be, let me give my own opinion as a person who tries to write papers.
Papers should be structured like fractals - that is, they should be "self-similar". The main text of the paper after the introduction should go into all the necessary details demonstrating the origins of the idea and proving that it has value. Then the introduction section should summarize all this, and take a less rigorous tone. The abstract should be a summary of the introduction. And then the title should summarize the abstract. If you really have a lot of technical work to do, maybe you can write a super long appendix and have the main body summarize that.
I myself probably spend as much time reading paper introductions as I do reading paper bodies, which means that probably 90% of the papers I read, I only read the introduction. I do this because I enjoy it more - I like new ideas, and the intros are a great way to get a lot of them. This blog post reads like a great paper introduction to me. It's easy to trick yourself into believing something is easy though, so an academic paper would have to back this up with an experiment.
There isn't much difference between a blog and a whitepaper, in that people tend to write blogs more casually and whitepaper more seriously (and some academics event only accept things that look more serious).
But a good writer can write great articles in whatever format they wish.
> it does not have an "ego" the way scientific papers tend to have.
What do you call it when somebody takes the time to write about "a big discovery" they've made, but don't take the time to check if somebody else already did it? It's not like it's in some forgotten paper nobody has seen. It's in Pytorch itself.
Also this: "I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research."
It's interesting, because as a scientist who reads and writes these kinds of papers, my first impression was: This guy has a pretty big ego or is otherwise badly miscalibrated if he believes his genius idea has a "99.44%" chance of preventing outlier activations without doing any experiments.
I'm not an academic, but some of the notation and terminology they use makes me want to hunt them down and 'clockwork orange their eyes open' until they can show me how their math is "intended" to work.
Inconsistent math notatation in papers along with vague terms in descriptions makes me so mad.
The proposed replacement definitely makes more sense (and I've always found the absence of a "failed query" to be puzzling in standard attention), but, in deep learning, things that make more sense don't always actually get better results. So I'm curious whether this has been tried and carefully evaluated.
The "missing 1" is a waste-category that is implicitly re-scaled.
The explicit 1 formulation is used in binary softmax, and the implicit (not seen 1) is used in multinomial softmax. I suspect this is the old "notation B looks silly in terms of notation A's standards."
If I followed this correctly and didn’t mess up my indices, adding 1 to the softmax denominator is exactly equivalent to appending an extra zero to the softmax input (effectively casting an exp(0) vote for a new null option) and appending an extra row of zeros to V (so the null option is all zeros).
The latter seems like something training could figure out by itself (zero doesn’t seem like a hard place to land with the weights producing V, although a bunch of zero weights would be needed), but the former is a bit awkward, as QK^T is quadratic in the weights.
In any case, this seems intuitively quite reasonable. But I do wonder whether the 1 in the denominator (equivalent to an exp(0) vote) is the best choice if the goal is to quantize well. 0 is in the middle of the numerical range, and perhaps the implicit null vote should be weighted lower than the middle of the range.
I follow the argument but the proof of the pudding is in the eating. I don’t know what “battles” the author lost to PyTorch lately but a good test would be to modify one of the smaller models (maybe nanogpt) and swap out all of the softmax calls for his quiet softmax.
I didn’t see anything relevant on alternatives to softmax, since TFA is specifically questioning softmax in a multihead attention context.
Ultimately, neural networks are arbitrary function approximators. It doesn’t necessarily have to be “right” internally to fit the data. But if this new softmax allows transformers to learn more, that’s great.
1. Have attention spans been declining? (slimemoldtimemold.com)
338 points by janandonly 4 hours ago | flag | hide | 254 comments
2. Attention Is Off By One (evanmiller.org)
400 points by elbasti 4 hours ago | flag | hide | 129 comments
Note that the #1 post is probably there because the title earlier had the provacative "Yes, 65%" appended to it. So even more numerical.
The author's use of "kurtotic barbarities" to describe this situation is absolutely my new favorite phrase. English is a beautiful language in which to express frustrations.
The author says to add a unity vector to the context, i presume of each layer, to not mess with gradient calculations. But most modern DL frameworks compute the gradient for you, (i know this is true for JAX and Pytorch). Is it maybe that hand coded gradient for a well-known enough dl architecture like transformer is faster than letting the framework autodiff it?
Otherwise, i fear some of the 'magic' of transformer networks is that this amplification effect allows it to encode/memorize some results verbatim. And we often are seeing a heavily tuned internet regurgitator. So similar to the rise of RNNs with attention, which supposedly allowed them to focus on some things and ignore others but really often was just overfitting stuff, yielded more interesting results with the overfitting than without.
Don't transformers typically have a <bot> token at the beginning of the prompt? This seems equivalent to letting the network attend to this token, and produce a zero value if that's what it wants.
Yes, it has to in fact. If you have zero context to attend to in a transformer, and you try to predict the first token, you effectively are multiplying a zero-vector by the attention head, making all tokens equally likely in the final softmax (unless the lm_head has a bias, but at least in GPT it does not).
So the <|beginning of text|> token, with no context before it, learns to predict the first-token-in-a-document distribution. That's not quite the same as predicting nothing at all.
not a token, and not the transformers, but yes, commercial chat models are fine-tuned on text transcripts containing dialogues. (i believe llama-2 was as well)
Are you sure? I have never seen an LLM that did not have a special token for start of text, I'm certain that llama had one and I don't remember anywhere in the llama-2 paper where they said they removed it.
it's messy though, bear with me for the full explanation:
- your initial post says "<bot>" token, which looked like a mix of "chatbot" and ChatML, used by OpenAI
- there is a bo_S_ token, which acts as you described
- I averaged my attention over your post and the initial reply, which answers as if you were using "<bot>" in the misunderstood way
- when I go back and read your post, I realize the chatbot interpretation doesn't quite make sense, since you're referring to much more technical aspects than general "how do I AI", i.e. you understand <X> as a way to denote special tokens, not necessarily an XML tag
This part of his post where he explains vector embeddings of the input/output tokens just looks wrong to me:
>This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.
>Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than 2^16=65,384, we only need 16 bits to represent an entry, yeah?
>Well, here is what the Transformer is actually doing: it transforms (eh?) that input vector to an output vector of the same size, and that final 6KB output vector needs to encode absolutely everything needed to predict the token after the current one. The job of each layer of the Transformer is quite literally adding information to the original, single-word vector. This is where the residual (née skip) connections come in: all of the attention machinery is just adding supplementary material to that original two bytes’ worth of information, analyzing the larger context to indicate, for instance, that the word pupil is referring to a student, and not to the hole in your eye.
Firstly, he is confusing representation with encoding--he's right that 2 bytes is enough to encode any token. That is in fact approximately how it's done: a code book is indexed into (with a longint in pytorch, at least last I worked with it ~6 months ago). The purpose of the embedding is to allow the model to learn a representation of the token, a la word2vec. (Though this representation is purely based on the characters comprising the token and does not distinguish between "student" and "eye" in the case of "pupil" as in his example.)
Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.
Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.
> Secondly, his description of each layer's function as adding information to the original vector misses the mark IMO--it is more like the original input is convolved with the weights of the transformer into the output. I am probably missing the mark a bit here as well.
> Lastly, his statement that the embedding vector of the final token output needs all the info for the next token is plainly incorrect. The final decoder layer, when predicting the next token, uses all the information from the previous layer's hidden layer, which is the size of the hidden units times the number of tokens so far.
I think the author is correct. Information is only moved between tokens in the attention layers, not in the MLP layers or in the final linear layer before the softmax. You can see how it’s implemented in nanoGPT:
https://github.com/karpathy/nanoGPT/blob/f08abb45bd2285627d1...
At training time, probabilities for the next token are computed for each position, so if we feed in a sequence of n tokens, we basically get n training examples, one for each position, but at inference time, we only compute the next token since we’ve already output the preceding ones.
>This vector seems to get taller every model year, for example the recent LLaMA 2 model from Meta uses an embedding vector of length 3,204, which works out to 6KB+ in half-precision floating-point, just to represent one word in the vocabulary, which typically contains 30,000 - 50,000 entries.
>Now if you’re a memory-miserly C programmer like me, you might wonder, why in the world are these AI goobers using 6KB to represent something that ought to take, like 2 bytes tops? If their vocabulary is less than 2^16=65,384, we only need 16 bits to represent an entry, yeah?
The reason we have 3204 2B allocations is that each of the 2B contains info (latent space dimension). If you go to just the 2B representation, it is effectively one hot encoding which completely defeats the purpose of word embedding
> The reason we have 3204 2B allocations is that each of the 2B contains info (latent space dimension).
I think the author is more correct than you are. It is not necessarily the case that we need 3,204 dimensions to represent the information contained in the tokens; in fact, the token embeddings live in a low-dimensional subspace; see footnote 6 here:
> We performed PCA analysis of token embeddings and unembeddings. For models with large d_model, the spectrum quickly decayed, with the embeddings/unembeddings being concentrated in a relatively small fraction of the overall dimensions. To get a sense for whether they occupied the same or different subspaces, we concatenated the normalized embedding and unembedding matrices and applied PCA. This joint PCA process showed a combination of both "mixed" dimensions and dimensions used only by one; the existence of dimensions which are used by only one might be seen as a kind of upper bound on the extent to which they use the same subspace.
So some of the embedding dimensions are used to encode the input tokens and some are used to pick the output tokens (some are used for both), and everything else is only used in intermediate computations. This suggests that you might be able to improve on the standard transformer architecture by increasing (or increasing and then decreasing) the dimension, rather than using the same embedding dimensionality at each layer.
I know we're not allowed to talk about neuroscience in AI threads, but the "megalodon" reference got me thinking about pyramidal cells in the brain. I mean, why are they so big? Maybe this isn't a problem, maybe it's an external bit of evidence that, no, really, some things matter a lot more. Which definitely jives with lived experience.
TL;DR: The author proposes that instead of using the Softmax function in each head,
Softmax(x_i) = exp(x_i) / sum(exp(x_i)),
we should use instead what the author calls the Softmax_1 function,
Softmax_1(x_i) = exp(x_i) / (1 + sum(exp(x_i))),
which would make it possible for each transformer head's attention probabilities to be zero, i.e., attend to nothing, by computing x_i's with values well below zero.
Giving each transformer head the ability to ignore all tokens surely can't hurt, but it remains to be seen if it will actually improve transformer performance.
I also saw the author distinguished internal versus output softmax. I think he'd apply his modification only to internal softmax and let the external force an output.
Yes, it makes sense to apply this only to the Softmax we use to compute attention. It makes no sense to apply it to the output Softmax, which must compute a probability distribution over the vocabulary.
Activation sparsity and packing sparse matrices will surely be important, so there is one kind of performance. However the other, perplexity, needs a good demonstration. It might require a big model, but even 30B you can fine tune on nowadays on a big Cloud GPU box.
I'm not an expert in whether this technique yields better or worse results. But it seems plausible that the proposal would yield a reduction in memory requirements and thus beneficial.
But in terms of a written piece of technical content, this is brilliantly written. Easy to follow and stay engaged. Well done.
>The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man. The problem here is exacerbated with multi-head attention, as a specialized head is more likely to want to “pass” than a general-purpose one. These attention heads are needlessly noisy, a deafening democracy where abstention is disallowed.
Can't the MLP that processes the concatenated outputs the attention heads handle this? I don't understand why it should be critical that a head be allowed to put something close to zero in its segment of the concatenated vector if it's immediately going to get projected by an MLP anyway.
But you are wasting some of the model's capacity to learn to ignore some of that information. I think it wouldn't hurt. However, if I followed the reasoning correctly, I think the biggest win is to reduce the range of the weights more than improving performance.
> This is what’s been happening in LLMs – for reasons that are only partially understood, Transformer models contain these outlier weights and are emitting Black Swan mega-activations that are much, much, much larger, like orders of magnitude larger, than their peers ...
meaning that once quantized you can either have a finer quantization since the range of possible values is smaller or you can pick a coarser strategy that saves bits for each weight.
Right, I get the goal of removing the outlier activations, but I just don't understand why outlier activations are a consequence of the model trying to "pass". The story from the linked paper earlier in the post (https://arxiv.org/pdf/2306.12929.pdf) is that the model is doing the following:
-Learn a near-zero representation for some otherwise low-importance token, like delimiters or whitespace.
-When a head wants to "pass", emit an outlier activation to attend to that token nearly-exclusively.
But I'm surprised the model can't just use its existing tools (the post-concat projection layer and the following MLP block) to achieve the same thing. And if the answer is that it could do that, but tends to learn to use the outlier activation trick instead, will giving it a new tool that still allows the use of outlier activations be sufficient?
The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input.
Reading this I'm mostly thankful real brain power and the general smart programming community is seriously taking a close look at all these things. I barely feel the need to try to compete for insight gathering it feels very healtily analyzed from every perspective finally.
It's a funny turn of phrase and also either extremely sarcastic or extremely wrong. Everyone knows about exponentials blowing up. exp(40) or so blows up a float32.
Interesting read. As others have said, it will be much more convincing with some experimental numbers.
I'm confused what his goal is though:
I could imagine some theoretical reason to add a 1 there, but he starts by saying this can lead to smaller, more compactable models. Is he talking about the size the compressed weights? or pruning to a smaller model? or resistant more quantization?
Parts of the essay seemed to throw me off track, because I'm not sure if they are relevant at all to the proposal (eg the of the initial embedding and how many bits it would take the store the vocab size, etc).
He says this in the article: if you ever want to jam a multi-trillion-parameter model into a phone app or a Raspberry Pi, you must quantize. I've seen some quantization go from doubles to bytes (64 bits to 8) per weight, reducing the RAM requirement by 8x. A simple quantization (I'm sure there are much better ones) is to round everything the nearest 1/255th of your number range, then multiply by 255. So your resolution is (max-min)/255. You also store the min and max so you can reverse it, of course. Say you're trying to quantize these sets of numbers:
1. { -1.4, 0.8, 2.7, 7.3 } : With a range of 8.7, you have a resolution of 0.034. This set quantizes to { 0, 64, 120, 255 }.
2. { -1400, 800, 2700, 7300 } : Resolution 34.1, quantizing to the same as the above { 0, 64, 120, 255 }.
3. { -0.008, -0.001, 0.009, 0.019 } : resolution 0.000106. This set quantizes to { 0, 66, 161, 255 }.
3. { -1.4, 0.8, 2.7, 7329 } : Resolution 28.7. This set quantizes to {0, 0, 0, 255 }. Oops -- we can no longer tell most of our weights apart.
You can see how this quantization works really well when all the numbers are close together, regardless of their absolute scale. Major outliers completely mess up the entire system. You can make more and more complicated quantization algorithms, but those will always come with tradeoffs. The best option would be to tame your weights so that they are again close together.
Thanks for the reply, that makes sense. It's not immediately clear why the modified softmax (allowing output of 0) will "tame the weights", but I need to read the blog post more closely and think about it...
Seems to be entirely a conjecture ... not based on even limited empirical testing. Which then links to the elephant in the room, which is that the theoretical basis of why GPT's work so well is not well understood in the first place.
One of the downsides of having a field be so empirically driven is that theoretical arguments suggesting improvements just don't have much value until someone actually tests it and shows that it works (and not just tests it, but tests it at scale).
Like others have said i think this might already been explored. However, I like the view that you have to let the machine to be able to 'do nothing'. Which is what resnet was first for. But i'm willing to bet this so called wisdom can still be squeezed for performance, because people tend to not think about it.
My hot take is that if you dont do the trick, you basically get a mean of all vectors in the value matrix if all x are very small. Which then probably the next sequence of linear layers will be able to interpret the same way as if you do the +1 trick and prodce a 0?
I wonder why the author didn't try it on some metal?
What is the minimum cost to get information that would say "this is better" or at least "this has a good chance of being better, spending a million training a bigger model is worth it"?
Could you rent a bank for 8 A100's for a day and try it out on a smaller model and prove something. Not cheap, but doesn't need VC money either. Probably about $400 on LambdaLabs to test with/without "quiet attention"
The OP can be as sure as he wants, but it is not worth sounding all "I told you so" without a single benchmark. Should be more "what about this - did you miss it ML people...?" in tone.
Do outlier features emerge in sub-100M parameter models? I haven't seen any research discuss it below the 124M scale (bert-base). At that scale training a model takes ~4 days on an 8xA100 node.
That is a fair question, and in addition I'm unsure that a simple metric like perplexity is likely to pick it up.
However, I do think that if perplexity showed a lower drop-off using this modified softmax under quantization that would be an exciting finding and enough to indicate further experiments would definitely be worth doing.
But you are right - if it doesn't show an improvement it doesn't necessarily rule out that it could be helping.
Edit: In the Qualcomm AI paper mentioned in this post, they experiment on BERT uncased (109B param) and OPT 125M and are able to show the effects using perplexity.
I hadn't read the paper when I suggested the same approach, so I guess that is good validation it is worth trying.
Edit2: Actually they also test on ViT 22M, which would be even quicker to try I think.
It would be hard to say if either of the two completely crap models is more or less crap though. Maybe by repeating it and seeing consistent results despite changing other variables I guess?
I suggest ways to measure it here: https://news.ycombinator.com/item?id=36855881 but the TL;DR is to choose a metric and compare the reduction in performance for quantized versions of the LM compared to the same LM without the modified Softmax.
> I’m thinking those numbers will make for a handsome table in a soon-to-be influential arXiV paper, either when those Qualcomm AI researchers step off the plane from Italy, or someone in an LLM hacker channel figures out biblatex, whichever happens first.
If the author just proposed his new function neo_max() and quoted The Architect from The Matrix everyone here would be enjoying the April Fools Day joke for 2024…
“ Your life is the sum of a remainder of an unbalanced equation inherent to the programming of the matrix. You are the eventuality of an anomaly, which despite my sincerest efforts I have been unable to eliminate from what is otherwise a harmony of mathematical precision. While it remains a burden assiduously avoided, it is not unexpected, and thus not beyond a measure of control. Which has led you, inexorably, here.”
identify the algorithmic obstacle(s) that stand in your way.
summarize your results in the style of a blog post that will persuade a significant chunk of the AI developer community to commit resources to an effort to eliminate said obstacles.
make it interesting enough to entice FrameworkFred to begin to read the article, snarky enough that he finishes it, and use concepts and notational conventions that ensure while he reads it his inner dialogue will roughly approximate the mood evoked by Homer Simpson saying "it's nu-cul-ar"."""
Here is an article that explains more about the outliers that emerge in large transformer models, which is what this modified softmax is being proposed to fix:
Unless he gives a good reason why he has not demonstrated his claim (eg. "This effect only presents at a scale beyond my resource"), the thesis seems severely weakened by the lack of effort to prove it in a toy version.
He just says he doesn't want to spend any more time on it, which is unlikely to convince or motivate anybody else that he has discovered something important.
It’s interesting. It looks like if you’re trying to improve the accuracy/perplexity of the model and using fp32 it doesn’t make a difference , but if you want to quantize it/make it compressible a modified soft max makes a huge difference ( this is what I understand from the Qualcomm paper). Different goals, different findings ?
Whereas vanilla scaled dot-product attention forces outputs to fall in the convex hull of V, your proposal allows them to drift into the convex hull of V+{0}. It's pretty unlikely that the origin is already contained in the convex hull of V especially for long inputs, so your proposal genuinely changes the quality of outputs.
This reminds me of the time I implemented a semantic segmentation network using deconvolution and forgot to add an output layer for "this pixel is part of the background and not part of any of the classes". Until it was fixed, background pixels got lit up in different places in output layers and drove me nuts.
I believe the author is correct. I've read the paper he's referring to and the code in the pytorch lib, and linked Google code from sibling posts. To me it is extremely impressive to have come up with a simple change to a known algorithm to deal with a recently identified issue. There is a strange, but very human, tendency to discount discoveries which are not novel. As if somehow genius and insight can only occur once. (It is, unfortunately, especially common among academics.) So I hope the author feels pride in being right and knowing enough to come to an excellent conclusion.
I really really really hate the tone of this article. What would take me one sentence to understand, took me reading 5+ paras.
And then the sneering tone of this article, sounds unprofessional and disrespectful in my opinion.
I also am pretty sure he’s wrong or at least he has to change layernorm to make this work. Attention simply does a weighted average of the Value Vectors, his change breaks that and I think will push the output closer to 0 as you stack the layers (especially considering Layer Norm). He really should do some small experiments to validate his idea first!
I thought everyone knew that softmax (and specifically exp functions in it) are poison. I have always worked around them, for example by using large epsilons (approaching one actually), and using low-order polynomial approximations for the exp functions.
I thought everyone does that, because you don't need to work long with these models to get NaNs, and when you check why you see it's because of the exp functions. Then you fix it. Apparently people don't.
It's not like the neural models care if you approximate functions. They couldn't care less actually.
It is pretty easy to avoid NaNs when working with softmax, you certainly don't need any epsilons. Just subtract the largest value from everything, and you will have no rounding problems or catastrophic cancellation.
Clearly softmax is not too bad, if it is used extensively in all the most powerful models.
This makes sense. One tweak for the press: I think it would be an improvement to call it OptionalAttention rather than QuietAttention since the goal is to permit an attention head to opt-out.
You might attract more, ahem, attention if it was immediately apparent from the name only what this attention head does that the current one does not. There's also that small matter of distinguishing the internal vs output softmax functions.
A couple thoughts. 1) An alternative might be to have an extra NULL output where the attention can be diverted. This might be what existing models are using commas for, but make it explicit. 2) What he proposes has a similar effect on the other weights without explicitly having the NULL present. In this light it should work, but does it have the advantage he thinks?
OP is right in that his change would make the softmax in the attention output zero if it "has nothing to add" (QuietAttention, as he said).
Buuut, it's missing the forest for the trees. The goal of the last step of attention (ref., Fig. 2, left in https://arxiv.org/abs/1706.03762) is not to add/say anything (as the author is saying) but to compute the relationship between the tokens (QK^T) and V -- in layman terms, simplifying, which tokens are related to each other. The softmax is there because it gives a representation that is nicer to work with, it gives probabilities, instead of unscaled matrix multiplication.
TLDR; author isn't wrong but he isn't right, practically speaking, either.
What's wrong with unscaled matrix multiplication? Softmax has some kind of intuition in the context, but why not layer norm or something else instead (if anything is needed at all)?
If you hesitate to read it, let me say that the post denounces “kurtotic barbarities.” If that expression alone doesn’t convince you to read it, you might not be in the intended audience.
The author is suggesting that we add 1 to the denominator of the softmax that is used within attention mechanisms (not the final output softmax).
The softmax inside an attention unit allows it to see key/query matches as probabilities; those probabilities support a continuous-valued version of a key-value lookup (instead of 1/0 output of a lookup, we get weights where a high weight = the desired key-value lookup).
Adding 1 to the denominator would change an attention unit by no longer working with a true probability vector of weights, but rather working with weights that add up to less than 1. The motivation is that the network can learn to provide high weights so that the adjusted softmax is very close to a probability vector; and it has a new option to provide all-low weights which give all-low output weights, meaning it can opt out of having high confidence in anything.
(switching to opinion mode)
2. How can we tell if this is good?
2a. We should just try it out: Train an LLM with this, see if it works.
2b. There are two reasons I suspect it won't make a big difference.
First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output. Then we're basically taking an average of a bunch of vectors (vs a weighted average that is more like choosing one of them). Statistically, we expect that averaged vector to be close to zero. In other words, the node already has a way to effectively opt-out by providing a near-zero output vector.
Second, in a transformer, each attention unit has many other learned weights that can support the ability to opt out. Both the V matrix and the feed-forward layer after the attention unit give that module a way to provide low values to the activation function after the feed-forward layer, which would result in a value as small as you like — again, a way to opt out.
3. I appreciate the non-academic tone of the article and the willingness to play around with fundamental ideas. Although I'm not totally convinced by the note, I'd love to read more stuff like this.