Hacker News new | past | comments | ask | show | jobs | submit login
Attention Is Off By One (evanmiller.org)
1040 points by elbasti 11 months ago | hide | past | favorite | 331 comments



1. Summary

The author is suggesting that we add 1 to the denominator of the softmax that is used within attention mechanisms (not the final output softmax).

The softmax inside an attention unit allows it to see key/query matches as probabilities; those probabilities support a continuous-valued version of a key-value lookup (instead of 1/0 output of a lookup, we get weights where a high weight = the desired key-value lookup).

Adding 1 to the denominator would change an attention unit by no longer working with a true probability vector of weights, but rather working with weights that add up to less than 1. The motivation is that the network can learn to provide high weights so that the adjusted softmax is very close to a probability vector; and it has a new option to provide all-low weights which give all-low output weights, meaning it can opt out of having high confidence in anything.

(switching to opinion mode)

2. How can we tell if this is good?

2a. We should just try it out: Train an LLM with this, see if it works.

2b. There are two reasons I suspect it won't make a big difference.

First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output. Then we're basically taking an average of a bunch of vectors (vs a weighted average that is more like choosing one of them). Statistically, we expect that averaged vector to be close to zero. In other words, the node already has a way to effectively opt-out by providing a near-zero output vector.

Second, in a transformer, each attention unit has many other learned weights that can support the ability to opt out. Both the V matrix and the feed-forward layer after the attention unit give that module a way to provide low values to the activation function after the feed-forward layer, which would result in a value as small as you like — again, a way to opt out.

3. I appreciate the non-academic tone of the article and the willingness to play around with fundamental ideas. Although I'm not totally convinced by the note, I'd love to read more stuff like this.


The way I understood it, the author is saying that, with this change, big values disappear, and we can then use fewer bits to encode the output of transformers, which means reducing the memory requirements of the network. Memory being the limiting factor to running models large, this would be a big deal.


> The Qualcomm AI researchers found that 97%+ of outlier activations in LLMs occur in whitespace and punctuation positions.

This is striking. If true, why not try to ignore whitespace and puctuation?

In old Latin, scripto continua [1] was a way to write continuously, for the exact same reason: to save space. Other modern languages still do that, and are no less parseable.

Granted, it's unlikely a commercial LLM would become popular if it produced output without spaces or punctuation; but an open source one that promised to be much more compressible, and therefore work on smaller machines, might be super useful.

It's not hard for a human to add spaces afterwards. It used to be a job for beginning journalists at the time of telex machines: press releases were sent in all caps without spaces, and interns were tasked with adding slashes between words. In French it was called "bâtonner les dépêches" (literally: add sticks to press releases -- not sure about the idiomatic English translation).

[1] https://simple.wikipedia.org/wiki/Scriptio_continua


In the Qualcomm paper cited, they explain/hypothesize that Transformers learn to attend to these low-meaning tokens when they want to avoid adding too much extra info to the residual stream. So it's not an issue that the models attend to spaces and punctuation during in these outliers – it's the workaround the models come up with to get around the fact that attention has to go somewhere.

This post's author has a different solution, and one that theoretically could avoid causing large outliers that prevent efficient quantization. These large outliers seem to be an unfortunate side-effect of the models' learned solution.

So getting rid of spaces would do nothing to solve the problem, and would instead force the models to learn a new solution, one that presumably isn't as optimal.


> This is striking. If true, why not try to ignore whitespace and puctuation?

It is initially, but thinking about it some more, there's a lot of information packed in whitespace and punctuation choice.

Scripto continua may have worked because the few readers who lived back then expected it to encode some form of legal or religious prose, but even then they could learn things from the overall shape of the document. LLMs are working in a much richer domain of document types, but the only thing they can "see" is a stream of tokens. There's no spatial or geometric data attached there. So whitespace and punctuation are the only thing an LLM has to make inferences about otherwise textually identical inputs. Such as:

  (see: other)  -- vs -- {see: other}
One being likely a text fragment, the other likely a piece of code.

Or how spacing may imply Markdown or YAML being used. Or how it may imply a list. Or a poem. Or a song. Or specific writing style, such as "lol im a casual who not care bout comms" vs. "I am a distinguished professor, about to retire. Elites like us put two spaces after full stop."


> the few readers who lived back then expected it to encode some form of legal or religious prose

The Latin literature was extremely rich, from Cicero to Tacitus, and was certainly not limited to legal information.

Here's part of your comment with white space and punctuation stripped:

scriptocontinuamayhaveworkedbecausethefewreaderswholivedbackthenexpectedittoencodesomeformoflegalorreligiousprosebuteventhentheycouldlearnthingsfromtheoverallshapeofthedocumentllmsareworkinginamuchricherdomainofdocumenttypesbuttheonlythingtheycanseeisastreamoftokenstheresnospatialorgeometricdataattachedtheresowhitespaceandpunctuationaretheonlythinganllmhastomakeinferencesaboutotherwisetextuallyidenticalinputs

It's a little hard to read, but not that hard. I think one would get used to it.

Also, for creative use of LLM, it may be a feature, as trying to find the words could be inspiring.

I think it would be worth a try.


Now do a modern structured document with sections and bullet points and logical connectives.


    string.replace(/[\s\.\*<>\!\?,;:\-–\|"'\[\]\(\)]/g, '')


...there's only 1 space there though


those paratextual phenomena probably are important for the model's representations.. not to get rid of and not easily compressable either. have a look at predicitive features for authorship attribution in stylometry for example. whitespace and punctuation are always decisive.


LLMs are also pretty good at programming and I would expect this to nuke programming ability completely, wouldn't it?


Seems you could make a pipeline where a much simpler model adds spaces and punctuation to output from the main model.


I suspect punctuation adds significant meaning to the models, that could be why so much computation is applied to it.

That's not to say a pipeline couldn't be effective.


For the spaces and for some (maybe most?) languages you don't even need a NN to add spaces: as words made of two or more words aren't that common, and when those occur you probably want to use the composite one, it boils down to start from the beginning of the text and look in a dictionary what's the longest string that is a valid word. The only language that I know of that uses a lot of composite words (I mean words made by sticking two or more words togheter) is German, but I think that looking for the longest sequence occurring in a dictionary would be correct most of the times.


I think you're significantly underestimating how many words could be retokenized into multiple words even before considering how concatenation affects things. For example: Concatenate is a word, but so are con, catenate, cat, and enate. Yes, no two of those are likely to be used in sequence, but I don't think that's a very reliable rule overall—"a" and "an" are both common words and negative prefixes.


Maybe you're right. I was biased by my native language, which doesn't have the a/an problem that English has.


Yes I was thinking about that, it should be quite easy afterwards.


Yeah, good to bring it back to the original point. Reading the article felt exciting, but in hindsight I am now missing a key detail.

The equations all seem to be matrix operations with a fixed number of rows / columns (you can take me as a real layman here). Unless you change that, I don't understand _how_ you can reduce memory needs. Granted, I'm probably putting my foot in my mouth not understanding transformers.


More ELI5 than the other comments. Considering the softmax network:

During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20. Remembering that 95% of our values are below 100, we would only have about 5 discrete values for 95% of our values - so we would be losing a lot of "resolution" (entropy/information). For example (assuming rounding is used), an original value of 19 would be quantized to 20 and 30 would be quantized to 40. The original values differ by 11, but the quantized values differ by 20!

This is where exotic encodings come into play. We might try to use a logarithmic scheme, for example. This would result in higher value densities at lower values - but we would probably still waste bits and it would require more APU cycles.

Now switch to the softmax1 network:

The range of values is less important than the distribution - instead of 95% of the values falling in a small range, we would see the values more evenly spread out. Assuming that the range is now 105 (so the 5% outlying neurons from the softmax network are still >100), we would have 243 values to represent everything under 100. The same example with 19 and 30 would result in 19.27 and 30.34 respectively, a difference of 11.07 - which is very close to the unquantized difference of 11. We have retained more information in the quantized version of the network.

Information is lost either way, but what's important is how much information is lost.

The reason that the large values appear is because the heads attempt to "scream really loud" when they are certain that they are right. This is an emergent behavior due to softmax - it ironically sucks at paying attention to a few of the heads: it boosts the volume of the heads that are trying to abstain, and mutes the volume of the heads that are trying to vote.


> During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20.

Instead of using an 8bit integer with even step size quantification, wouldn't they still use an 8bit float?


Possibly, it depends on the distribution of the vales. It would also make my examples far less straightforward :)

Either way you would still only have 256 discrete values.


No one quantizes blindly without accounting for data. If 95% of your values are in 0-100 you’ll probably do something like have 20 values for 0-100 and the remaining 12 for 101-5000. You don’t have to apply a uniform distribution and shouldn’t when your data is that concentrated.


Third paragraph.


If I'm following correctly, does this mean that with this change along with a model being quantized, we could see models that are 5% the size (on file system) and memory usage but almost identical in output?


The vales are selected were arbitrary. The size reduction will be 32bits/8bits - so it will be 4 times smaller.


It has to do with the precision of the values stored in those rows and columns. If they could be coerced into a narrower range (without losing information) then we could effectively store them each with 8 bits or something. The +1 prevents blowups when the denominator in its current form approaches 0, and without those blowups, then we can use less bits, in theory.


That is only true if the using the new softmax changes the dynamic range of the values. We are using floating point not fixed point. So if before our values went from 1 to 5000 and now they go from 0.0002 to 1 we still have the same dynamic range and so still need the same resolution.


The quantized versions are not floats but ints.


The activations (outputs) of one layer must be encoded in the same way as the weights of that layer as well as the weights of the next layer or the computation fails (unless you manage to write clever kernels for doing math at different levels of precision simultaneously, but even then you're introducing even more lossiness than just using a binary representation for those values).

Example: multiplying a bunch of float16s together gives you a float16. That is passed on to the next layer of float16s. Why should forcing the output of the first step to be float8 confer any advantage here? The only way I can see this argument working is if you make all the layers float8 too, and the reason you can do that is that the output of the first step can be faithfully represented as float8 because it doesn't ever blow up. If that's what the author is saying, it wasn't very clear.


You can reduce the number of bits per float (scalar).


I actually prefer the conceptual model the author suggests:

> Originally I wanted to call this function ghostmax, as you can think of there being an extra zero-valued entry in x (as exp(0)=1), as well as a zero vector in the V matrix that attenuates the result.

Don't think of this as weighting the options so that some of the time none of them is chosen. ("Weights that add up to less than 1.") Instead, think of this as forcing the consideration of the option "do nothing" whenever any set of options is otherwise considered. It's the difference between "when all you have is a hammer, everything looks like a nail [and gets hammered]" and "when all you have is a hammer, nails get hammered and non-nails get ignored".

I like this framing because, as an example, it bothers me that our speech-to-text systems use this method:

1. A human predetermines what language the input will use.

2. Audio in that language is fed to transcribing software.

3. You get, with modern technology, a pretty decent transcription.

3(a). ...if the audio sample was really in the language chosen in step 1.

If you ignore the choice of language and feed French audio to an English transcriber, you get gibberish. This is wildly at odds with how humans do transcription, where absolutely the first thing that a system that only knows how to transcribe English will do, when given French audio, is object "hey, this is definitely not English".


Most STT systems also tend to still train on normalized text which is free of the punctuation and capitalization complexities and other content you find in text LLMs. I suspect we continue in this way in part due to lack of large scale resources for training, and due to quality issues - Whisper being an outlier here. Anecdotally 8bit quantization of larger pre-normalized STT models seems to not suffer the same degradation you see with LLMs but I can't speak to whether that's due to this issue.


This seems like a good way to look at it. Another way to put it is, there is a certain "origin" or "default" confidence which is pinned to some fixed value pre-softmax, ie, all outputs are necessarily compared to that fixed value (pretending zero is another input to the softmax) rather than merely each other.


I like your description because it's relatively succinct and intuitively suggests why the modified softmax can help the model handle edge cases. It's nice to ask: How could the model realistically learn to correctly handle situation X?


Yea - the way we can "tell if this is good" is by

a) train two identical models on a large dataset, one with the +1 in the denominator for the softmax steps of the attention modules, one without

b) show that they have similar performance (doubt the +1 will make performance better, but we need to show it doesn't make things worse)

c) show that there are less "blowups" in the model with +1, and therefore they are more effectively quantized.


> train two identical models on a large dataset

Yes but how much would this cost?

Would it be possible to build a small dataset that produces known outlier values, and test on that?


It doesn't need to be two huge models. If there is an advantage to doing this, I'd expect that you would see it even in a small test case. I'm sure we'll see something by the end of the week if not earlier if there's something to it.


One of the most significant quantization papers of the last year [1] found precisely that these outliers only start occuring with LLMs at 6.7B parameters and above.

One of the most important keys to the success of deep learning in the last couple years has been the fact that emergent features exist after certain scales, so I wouldn't be too quick to dismiss things that don't help at smaller scales, nor would I be certain that all the tricks that help in small data/parameter regimes will necessarily help in larger models. Unfortunately!

[1] https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...


Looking at that paper, they appear to be saying that 6.7B is where the problem becomes so intense that no single quantization method can keep up. From what I gather, the paper claims that such outliers start occur down to 125M param models, then at around 1.3B they begin to affect the FFN, and at around 6.7B is when the issue really starts to become apparent because "100% of layers use the same dimension for outliers."

So while you obviously wouldn't be able to conclusively prove the idea fixes the issue in larger models, if you know what you are looking for you should be able to validate that the method works in general down to very small models.

That said, consumer grade cards should be able to train an 8B model with quantization, so you might as well train the whole thing.


The reason it might need to be huge is because the long tail of extreme weights might only begin to show up then, but yes best to just start w something you can run on a laptop.


That is a good start. I wonder though if the change affects the ideal hyperparameters. Do you need more or less dropout if you make the change? What about learning rate?

So you might want to re-search the hyper params for a fair shot.


> First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output.

Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.

Can't speak to how useful it is though.


Surely you mean high-entropy, ie, uniform? We are talking about extremely low-entropy predictions as being the problem here.


yep - always get that the wrong way round haha


This is a technique that's been known for years and is in PyTorch. It's not widely used because people tried it and, in practice, it doesn't work as well.

OP calling it a "bug that's been overlooked for 8+ years" is click bait.


> ...is in PyTorch

Could anyone kindly point me to this as I can't find it.


The add_zero_attn parameter in PyTorch is used for this, but by default their softmax is the regular kind. It has been in flaxformer for a couple years now though, however it claims to be a compatibility variant for older models [2] and I haven't seen any mention of it in their recent papers (though I've not checked exhaustively).

[1]: https://pytorch.org/docs/stable/generated/torch.nn.Multihead... [2]: https://github.com/google/flaxformer/blob/main/flaxformer/co...


> Statistically, we expect that averaged vector to be close to zero.

I'm not sure that's the case, especially in high dimensions.

The expected value of the absolute value n random variables, uniform [-1,1], grows with n. I'm pretty sure it's proportional to the sqrt of n.

Also, random walks in high dimension return to zero with probability zero, so the sum of random variables in high dimensions going close to zero seems unlikely as well.


Both of your points are basically true, but I think a better way to model the problem is as a set of similar-length vectors being linearly combined by a probability vector.

Mathematically, we can write v_out = V * w,

where v_out is the vector of output from the attention unit, w is the probability vector from the softmax, and V is the set of input vectors, where each column is an input vector.

For a moment, pretend that the columns of V are orthonormal to each other. This might not be true, but it's an interesting case.

When the model wants the output to be small, it can set w = 1/n, meaning all coordinates of vector w are 1/n. (n = the number of columns in V)

In that case, the length ||v_out|| will be 1/sqrt(n) exactly, which is small compared to the input lengths of 1 (since we're pretending they were orthonormal).

Now if we stop pretending they are orthonormal, the worst case is that they're all the same vector, in which case the weights w can't change anything. But that's a mighty weird case, and in high dimensions, if you have any randomness at all to a set of vectors, they tend to point in wildly different directions with dot products close to zero, in which case the same intuition for the orthonormal case applies, and we'd expect a uniform distribution coming out of the softmax to give us a vector that's much smaller than any of the input vectors.


One caveat is that the average of many normally distributed vectors in many dimensions is normally distributed with 0 mean but is not typically close to 0. In fact the average norm is quite large. Try it yourself and see!


Don't most softmax implementations include an epsilon in the denominator which likely serves the same purpose? So the suggestion is to set that epsilon to 1?


I agree with your conclusions, but not necessarily with the reasons you present. I don't think it's _that_ easy for a current transformer to pass the information unaltered (i.e. to effectively replace softmax with 0).

In particular, I think the feedforward point you list in your "Second" is actually wrong. Replacing a softmax with 0, as the OP wants to do, is tantamount to passing the information unchanged, because the attention block is within a residual (skip) connection. If it's set to zero, the next output is identical to the previous layer output. There is no way to recover this effect with the feedforward layer.

The part that you can set V to zero is true, but somehow a different idea: the Q and K should be able to set to 0 if no token wants to be "close" to some other token, in some sense. But the V layer shouldn't "know" about this, because it can't look at other tokens. This is of course only how we think of transformers, which might or might not (more likely, the latter) be how it actually works. But nevertheless, having a 0 value coming out of the K.Q^T part only would be very meaningful.

Your "first" point is technically true (albeit logically false): if you have a sequence of length 32k, like GPT4-32k, and your softmax logits all predict the same value, the result will be an average of the V layer, divided by 32k, which is effectively close to zero. However, calibrating "exactly the same value" is extremely hard for a neural network, and there is no "default value" it can predict to make sure that's the case - even if you push all the values to one side, the result doesn't change, because softmax is translation invariant. Plus, if you have a short sentence, that's not true anymore. If you only have two tokens, one of them must be activated, or both with only a 0.5 factor. Surely if you have very few tokens there's much more contamination between Q, K, and V, so in that case V can indeed take a 0 value, but it's non-trivial and requires more layers.

All in all, adding that "+1" isn't quite meaningless, I think. Nevertheless, I believe it won't change much: these very big models have ways to get around any kind of smart small modification you do. If the intuition is very right, it might be that you can squeeze 1% out more accuracy in a handful of tests, after you carefully optimize all other parameters, which would be enough to get you a paper in a top conference. And it might also be implemented as a standard from them on (because, in this case, it basically doesn't cost any more computations, so it's "free"). But I would bet it won't be a major revolution.

That said, as you say, the only way to know would be to train a few models with this option and check the actual quality of them (certainly not GPT-style, nor GPT4-size, models, to begin with, but something quicker to train and easier to test in a fully automated way; old "boring" models like those in the BERT family would be a good point to start testing). But to do that effectively, you'd need somebody skilled in training this kind of models, with the cleaned data ready at hand, etc. (and a small compute budget, of course, but nothing revolutionary, a few thousand $ in GPU credits could be enough)


I am a transformer, it should definitely work


I might be missing something obvious, but I am not sure why everyone in the comments think it's a big deal. I've seen this trick in practice multiple times.

For example, see this snippet from an old Google repo: https://github.com/google/flaxformer/blob/ee62754ebe5a5eeb11...


Yeah we used to use this in our older models years ago... I don't recall the details exactly, but I don't think it ever did very much.

I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf


> I don't recall the details exactly, but I don't think it ever did very much.

How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data


Are you asking "why would you have bothered to look at"?

The "how" is pretty straightforward.


He's questioning the statement: "I don't think [the trick] ever did very much", because no one has yet looked at whether the trick helps reducing outliers in very large models. If it does help with this, as the blog author believes, then it is indeed a very useful trick.


Is he? A surface level reading suggests he's asking "how would you know".. and the answer is... by looking at the parameters. People do that.

>> because no one has yet looked at whether the trick helps reducing outliers in very large models

Given a softmax version doing exactly as the blog post says is baked into a google library (see this thread), and you can set it as a parameter in a pytorch model (see this thread), this claim seems off. "Let's try X, oh, X doesn't do much, let's not write a paper about it" is extremely common for many X.


This would seem like a really good argument as to why failures should be written up, otherwise where is the list of what has been tried before?


Yup, it is. But it isn't going to happen.


Yes, I assumed that checking the weights for presence and amount of outliers is not something that is usually done and effects on this can be overlooked. If my assumption is wrong and researchers do usually look at such metrics, then my question is not very relevant.

Agree - the "how" is straightforward


If popular models are still making this mistake then it still seems noteworthy and making a blog post or paper to increase awareness definitely seems worthwhile. Also multiple independent discovery of good ideas is quite common.


The question is whether people have attempted quantization (the int8 / GGML / GPTQ approaches) and whether the "flattening" of distribution due to a larger denominator results in a better quantization behavior. You'd have to specifically try quantization with and without the +1 to understand the advantage. OP argues that the advantage could be be significant.


The argument / reasoning is a bit dubious.

Technically softmax is not implemented as presented but through exp(x_i-max(x)), and summing over it in the denom. But maybe I am missing something.

Furthermore, the residuals are used exactly because the networks cant learn the identity function; but they can learn zero; at which point the residual is `f(x): x+g(x)` with being `g:x ~> 0` (ie approximately 0).

It is also the case that `f(x): x+g(x)` makes it easier for gradients to flow through.


You are misreading things.

Regardless of numerical stability tricks (e.g. exp(x_i-max(x))), you are still simply normalizing the logits such that the probabilities sum to 1.

The blog adds an additional hidden logit (equal to 0) to allow for softmax(x) = 0 when x -> -inf.


How can `x -> -inf` occur in the first place when nearly everything is within [-2,2] and doing a dot product plus before that there's normalization too?


The use of the "nearly" in your comment is exactly occluding the issue as presented.

Enough weights don't fall under that "nearly" that we require more bits per weight to cover those edge cases. If we were able to delete the "nearly" we would need fewer bits (smaller models).


So the concern is not that x->-inf due to values but it happens due to numerical issues arising out of lower precision?


The idea is that if your range of values is small enough you need fewer bits to distinguish between meaningfully different values. The problem is that exp(x) << exp(y) for sufficiently wide ranges [x, y], so that when normalizing in the softmax and subsequently quantizing you don't get the fidelity you need and too much information is lost between layers. The proposed solution is that modifying the softmax step slightly brings x and y close enough to zero that exp(x) and exp(y) are close enough so that more compact quantizations are useful instead of useless.


Implementations usually replace replace the 1 in the denominator with exp(-max(x)) for this reason.


This trick “they found” is part of the standard torch implementation of multi head attention, namely it is called, add_zero_attention. They add a zero to the logits, resulting in a one in the denominator as e^0=1 https://pytorch.org/docs/stable/generated/torch.nn.Multihead...


I find its documentation quite poor though: "If specified, adds a new batch of zeros to the key and value sequences at dim=1."

Doesn't describe the implications even briefly. If they add just your second sentence to that description, it'll immediately become so much more useful.


It's an option which is set to false by default. Does that mean people have tried it and it's not usually helpful...?


It probably means they have tried it for _some_ purpose, but not necessarily the one described in OP's post here. The claim is that this is specifically useful for quantization. It's seems reasonable to assume that this would have initially been tried and potentially discarded for having little or impact on general accuracy. But that's a different issue. I suppose we'll here something definitive in a month or so.


Yes.


Can you elaborate? (It wouldn't be the first time there was an extraneous feature that no one has every used in some code!)


If you take the inner product between a lot of more or less random vectors (the key and query vectors in attention) most values are going to be close to 0. This means they contribute by e^0 to the denominator. Now, if you have a context length of say 2000, your denominator is already ~ 2000. Increasing it to 2001 doesn't really make a difference.

Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands.


That simple comment is a strong counterpoint to the entire blog post?

Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.


> it might be that the model trains all of the inputs to become very negative

It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.

Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.


I guess you could add a sort of gating operation with a learnable parameter that sends the value to -inf if doesn't reach the threshold.

Of course it might have some other serious repercussions.


It’s useful but it’s less used than dummy tokens.


Are dummy tokens just tokens that don't have an associated input/output token? Like, a way to give more computational power to the model without splitting the text into more actual tokens?


TL;DR sort of yes. But they're also useful for reasons not related to computational "power".

An example here with an actual algorithm, although it's been a couple of years so my explanation might be a bit wrong in places. and/or i might have gotten the completely wrong end of the stick with the current thread.

--

The CTC (Connectionist Temporal Classification [0]) algorithm maps a sequence x with length X -> sequence y with length Y.

i.e. in speech to text we might have some audio features that correspond to the following class predictions (post softmax classification)

    x -> hellllloooooooooo wwwooorrrllld
we want to get this as the output

    y -> hello world
we have the alphabet as classes we try to predict for each sequence item in x.

we could just removed all the duplicate in the first long sequence, but we would end up with `helo world` ... we need to preserve one of the early `l` characters in `hello` somehow

CTC uses a blank token (aka dummy) token to handle potentially deliberately repeated items in sequence x.

By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification)

    y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d

The CTC decoder (non-ML decoding algo) heuristically removes repeated tokens. Turning the above into ...

    y -> hello world
... the duplicate `o` and `~` characters are removed.

It was a decent enough algorithm for speech-to-text prior to attention/transformers etc.

However, it makes CTC vulnerable to well designed adversarial example attacks because there is a massive bias within models to predict the blank token -- meaning it's very easy to modify input sequence x to switch the output sequence y to include blank tokens for nefarious purposes (the subject of my unfinished phd).

[0]: www.cs.toronto.edu/~graves/preprint.pdf


> By adding the blank token to the classes predictions, we can get the model to predict something like this (post softmax classification) > y* -> hel~l~~oooo~~~~~~ w~~o~~r~~l~~d

This is a great solution. Though that's a dummy token in the output rather than the input. I guess you could do something inverse to do text to speech, but it might be hard to say where to insert the dummy tokens in that case.


Nice catch! Hopefully OP will see this.



While not about AI or the algorithm mentioned, on the subject of little errors that you can't convince anyone are errors....

In 2011, I wanted to copy the reddit ranking algorithm in a project of my own, so I went to source code to look at it... the algorithm in the source code I found wasn't doing anything at all sensible with negative-sum voted posts.

I thought I discovered the error, some terms swapped in the simple equation, the sign for positive/negative was misapplied.

I blogged it, and [posted it to reddit](https://www.reddit.com/r/programming/comments/td4tz/reddits_...), only to have MANY people, including reddit employees, tell me I am definitely definitely wrong, and the algorithm was working as intended. And that I was in fact not the first to notice what I thought I noticed, and point it out, and be told by everyone I was wrong.

OK, I didn't really understand what was going on, I couldn't make sense of the algorithm if it wasn't wrong, but so be it. I updated my blog post to say that people smarter than me said there was no error in the reddit algorithm, all I can say is this variation makes more sense to me.

Then, three years later in 2014, a commit was made to the reddit source code with exactly the correction I (and others before me) had suggested all along. The one that everyone piled on to tell me how dare I have the temerity to suggest reddit source code is wrong.

https://github.com/reddit-archive/reddit/commit/50d35de04b92...

¯\_(ツ)_/¯

Open source means there are lots of eyes that can find bugs, but sometimes they can't convince anyone they've found a bug. (And of course, then reddit close-sourced their code in 2017).

I never did end up using the ranking feature in my own project, that I had wanted to copy from reddit. I didn't end adding "vote" features to the app.


When I was an intern at Yahoo working on OAuth back in 2008 (2007? It was long ago and I'm old) I had the pleasure of implementing an internal tool for generating OAuth 1.0 URLs, which meant encoding a lot of things in query parameters. My tool did not generate URLs which were compatible with Yahoo's implementation (certain parameters effectively should be encoded twice, which my tool did). The implementing engineer insisted my tool was wrong, cited my status as a lowly intern, and even pulled out the OAuth spec and bent over backwards to say how his implementation was correct and I'm clearly reading it wrong. It literally took bringing in Eran Hammer-Lahav to weigh in on the topic to say I was correct, at which point the engineer agreed that of course that was correct. I got zero acknowledgment or apology for the days of ad hominem attacks against me.

I did learn an important lesson that more senior people are not always right, and as someone who's usually more senior than my colleagues now I try to remember it daily.


> It literally took bringing in Eran Hammer-Lahav to weigh in on the topic to say I was correct, at which point the engineer agreed that of course that was correct. I got zero acknowledgment or apology for the days of ad hominem attacks against me.

If it weren’t for the torturous gaslighting, this is borderline hilarious. Appeal-to-authority types have a way of submitting so effortlessly when a grander poobah comes around. Spine made of jelly.


I work at a FAANG and it was absolutely astonishing to find out how often this happens.

You can make a long, impactful career by just being "the guy who adds log statements throughout the codebase and reasons through it", doing this at even a simplistic level has always shown me an astonishing fix to some long-standing issue.

n.b. It also attracts a ton of political fun. People's first order reaction is denial, and it only gets worse from there. Absolutely no one except 1-2 colleagues will see it as "oh we should fix that", and at least one person will make sure your boss' boss' boss is CCd on an email with a nice version of "no he's just insufficiently concerned about {concurrency, memory management, take your pick}" Just wait it out quietly when that happens, do not engage or complain. If nothing happens and you're never asked about it by leadership, but your peers ask, make plans to move onto another team.


A long impactful career, or a career of horrible frustration and alienation as everyone gets mad at you for pointing out their bugs? (or, from their point of view, making trouble insisting that something is a bug which isn't and is causing no problems)


What FAANG have you seen this at?

I've been at big tech companies for most of my career and I've never seen anyone deny the existence of a technical bug. I've seen plenty of teams mark a bug as lower priority and never fix it because other things are higher priority. But denying that the bug exists, especially after a detailed explanation? That doesn't resonate with my experiences.


I've told this story before!

It used to be writing the outputs from the C/C++ preprocessor (.i files) to disk took forever (5+ minutes IIRC) with Microsoft's compilers. I asked one of the lead compiler developers why, and he waved me away saying it was just really complicated. Around that time a bunch of tools existed for GCC that worked with .i files, but none existed in the Microsoft ecosystem likely because writing .i files was so slow.

I was on the compiler test team at the time and we did lots of stuff with .i files, our tests were distributed across a large cluster of test machines (see my post about that https://meanderingthoughts.hashnode.dev/how-microsoft-tested...) so it wasn't a big deal, but it still annoyed me.

One day I decided to find out what was going on, so I loaded up process monitor while outputting a .i file and watched what was happening. Much to my surprise, only 1 byte was being written at a time! No wonder writes were taking forever.

A quick dive into the source code revealed a comment above the file write call that read to the effect

// to work around a bug in windows 98

So anyway I opened a bug against the compiler saying we should probably fix that. :)


But that's not the type of story that's being claimed from the person I responded to.

Of course the lead developer waved you off. You wondered why things took forever, and the lead developer knew it was a complicated system and figured it wasn't worth their time investigating. It happened to be incorrect, but the lead developer wasn't in denial. They just filtered the issue out because they can't afford to go down every rabbit-hole they come across. I'm sure once you found the actual bug, it was later fixed.

The person I was responding to seems to think a large number of people are in denial when a bug is filed against them. That doesn't make sense, and isn't something I see. It'd be as if when you pointed out the actual bug, the lead developer continued to say it wasn't actually a bug (which is of course ridiculous and I bet didn't happen).


You are ascribing an absurdly maximalist viewpoint to me, one that would be obviously wrong at its face.

I know it's not so confusing as to get that sort of interpretation, because of the score on the comment, and comments like the above that explain to you how this happens.

As a result, I don't feel comfortable providing more detail publicly about my situation. That far off the mark tends to indicate an aggressive rather than curious interlocutor.

I am comfortable building on their example. The particulars of the issue are quite similar in a very helpful way.

I did the investigation, did a fix, worked it up to my manager and my managers manager. Elated, we work diligently for a couple weeks to document concisely, 3 page tech doc, briefest code deltas possible, one page + slides withs simple diagrams.

It gets bogged down at managers managers coleads submanager for the platform team implicated. They basically say "reading the single byte at a time means its provably serial and thus has no concurrency bugs.", as indicated in my original comment.


It's 2 anecdotes to your 1. Anecdotes are useless, but you are down by 1. I suggest you call for reinforcements, or make a hasty retreat.


I used to work on a large-ish open source project. When I was bored I used to go bug-picking (not hunting, picking). I'd go and browse the source wherever my intuition told me there was likely a bug and I indeed found a few using with such a "method".


Wow, that must have been frustrating. I just looked at the code and it's just so clearly wrong.


Oh, it was! And there was no way to "prove" it, it's just like "look, the equation this way actually makes a lot of sense and is very clever and produces results that seem reasonable and much respect to whoever designed it... and the equation the way it is in code now... does not. It seems like clearly it got accidentally transposed at some point?"

And the response was just like "We disagree, we think it makes sense the way it is and the product is correct"

That's kind of the end of the argument, there's nothing more one can say!

It didn't help that I came in assuming that of course everyone would see which version was correct (as you just did! although i didn't find it obvious, it took me lots of study to figure out), instead of producing a narrative designed to gently persuade them that. (That's on me -- I think I've learned something about technical communications around bugs and disagreements since then, although I'm still far from perfect).

The real answer, I think was given in one of the reddit thread comments -- the way it's broken for the most part _doesn't matter_ in the usual operations of reddit, it matters only in edge cases, and not very important ones, so really people mostly don't notice and we don't care.

Fair enough, I guess? But they did fix it three years later? I forget how I even found out they had fixed it; I can't at this point find any context for _why_ they fixed it, or who with power finally noticed/agreed it could use fixing why.

(And if it had happened three years after that, it would not have been in public source, and my gloating satisfaction would have been stolen!)


In light of the past couple of months, I guess I should not surprised that the interaction with reddit staff went that way.


Some very interesting discussion of outlier features and quantization: https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

* Outlier values are used to prune values. * Transformers seem to undergo a "phase shift" in how outlier features are treated around 6.7B parameters. This could complicate research on removing them.

Maybe you and Tim Dettmers would have a lot to talk about :)


The author identifies a real problem and poses a simple solution. It passes all my crank tests (why did no one come up with this before? Because the author is intimately familiar with the softmax function from work outside of ML, and plausibly nobody who’s investigating these issues is remotely as familiar, so despite researchers narrowing the issue down to “something to do with softmax”, they don’t have a deep enough understanding of softmax to see what’s wrong).

If the author is reading any of these comments, though, I would urge them to expand on their claim that “I’m 99.44% sure that it will resolve the outlier feedback loop”. As it stands, that’s the only explanation we get of how the outliers might be related to softmax!


>why did no one come up with this before

So it turns out someone did. Specifically google did. This exact same idea has been in flaxformers since at least November 2021.

https://github.com/google/flaxformer/blame/ee62754ebe5a5eeb1...

Specifically to save people a click it says:

> """Softmax function with an additional virtual logit equal to zero.

  For compatibility with some previously trained models.

  This is equivalent to adding one to the denominator.
  In the context of attention, it allows you to attend to nothing.
And creates the exact same modified softmax as this essay. I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows


> I suppose only time will tell why it was ignored publicly before, maybe it doesn't do much, maybe it just fell through the cracks, maybe google just didnt push it, who knows

Maybe quantization wasn't as hot back then than it is now?


Yea the benefit is not going to come in terms of performance for a given model, but in terms of ability to be efficiently quantized.


Or maybe it doesn’t really do anything to improve performance.


Yeah, but it lacks the most important test: results. He hasn't actually tried it, he just thinks it will work.

For such a simple change to the softmax it wouldn't take long to verify. It's really embarrassing to not do that before publishing.


You seem to really disregard the positions of this author. They seem to have invested substantial efforts in that specific area of research.

To validate the idea the author has, it would be required to train a LLM from zero. If the author is right, you would get similar results to the current generation of LLMs, but with (a lot) less space required for the intermediate layers.

The time to achieve that is still measured in kilo- to mega-dollars, why is it wrong to put that idea in the open to substantially criticize or adopt?


You don't need to train a ChatGPT-sized LLM, a toy nanoGPT would have been enough. You can train those on a consumer GPU in an afternoon.

And yes I do disregard his research effort. There are hundreds of well-justified and well-researched "clever tricks" for improving Transformers, and almost all of them don't work. I'll believe it when I see the results.


Outliers only begin to appear around 3B parameters (as per the original LLM.int8 paper) so unfortunately not consumer GPU in an afternoon kinda stuff to prove you've managed to suppress them.


I tried to test this with nanoGPT in an afternoon, since the code change is pretty minimal. It's hard to get conclusive results at that scale though - to be able to say anything with confidence you'd need to run multiple tests, figure out if the 'outliers' mentioned only appear above a certain scale, find good tests for quantization performance that work on small enough models that you can iterate quickly ... It's doable but still lots of work, enough that putting out the idea and hoping others with more time+compute will try it out seems a valid strategy to me :) More generally though I definitely agree that the trend among 'improvements' to transformers has been things that don't turn out to work in practice.


Google used it in flaxformers since 2021 apparently


Do you know of handy testing steps? I suppose I could ask ChatGPT, but if someone has a validated "here, this is how you do it" I have a 3090 that I can do it on, but I'm not keen to debug anything here.


Testing steps (based on thinking about this for 30 seconds - so probably can be improved):

Train a Transformer based model with and without the modified Softmax (Suggestions: GPT-2 or nanoGPT)

Measure performance - I'd probably start with Perplexity and see if there is any difference (we'd expect little difference).

Quantize both models with different quantization strategies.

Measure the perplexity of the quantized models of different sizes. We'd expect the performance to drop off quicker for the non-modified model than the modified one if this is working.


I was thinking about a different problem as I was typing that and got some mental memory alias bug. I wanted to know a set of steps to take to train a model. My apologies.

In any case, that was an lmgtfy-level question. Here's what I found: https://til.simonwillison.net/llms/training-nanogpt-on-my-bl...

I shall try that soon.


Shaaaaameless plug:

I did a writeup like this. (Not as nicely as Simon though) where I modal.com (cloud GPU, containers, quick starts, free $30/m spend) to use their GPUs (e.g. T4, A100).

https://martincapodici.com/2023/07/15/no-local-gpu-no-proble...

T4 I think was good enough for the job, not much need for the A100.

Since this post I am working on an easy way to do this with a script called lob.py that requires no code changes to the nanoGPT repo (or whatever repo you are using) and runs in modal.com. The script exists but gets refined as I use it. Once it is battle tested a bit more I will do a post.

(It is named lob.py as it "lobs the code over to the server" where lob is UK slang for throw)

Watch this space.


Thank you. FWIW I often find write-up + script superior to script because I often want to modify. e.g. I want to run GPU-only, but other script provide part-way solution when textual description added. Therefore, much appreciated.


In the Qualcomm AI paper linked in this post it turns out they use a similar testing approach:

BERT 109M, testing perplexity

OPT 125M, testing perplexity

ViT 22M, testing on ImageNet top-1.


It's not embarrassing at all.

I think there might be some curse of the auto-didact here, hinging on the meaning of publish: it would be embarrassing if he was capital-P publishing, as in a scientific paper.

The blog goes to great lengths to point out it is _not_ capital-P publishing.


It's a blog post. And it includes a call for help in testing the idea.


> why did no one come up with this before? Because the author is intimately familiar with the softmax function from work outside of ML, and plausibly nobody who’s investigating these issues is remotely as familiar

I doubt that is true. Softmax is extremely well understood within the ML community. It's a very common trick, these properties are well-known as well. It feels very unlikely that nobody has thought of this before. That said, it's also plausible that the current softmax convention was chosen by accident and the author is right to identify this drawback.


> why did no one come up with this before?

And because the effects of the problem are subtle. Supposing the diagnosis is correct, full-precision LLMs still avoid the issue through large attention weights given to meaningless tokens to give harmless attention outputs. The problem only matters when quantizing weights, and quantized performance isn't really the goal of recent cutting-edge LLM development.


I interpreted it as cracking a joke about miscalibrated probs in softmax, it tends to be 99.9% sure, or 0.1%, but little in-between.


The I know it's on-vogue on HN to complain about academia, but the blog post is not making a good argument.

The post could have probably gotten the point across in less than 1/4 of the overall length (probably even less than 1/8th), instead the author wrapped the the post into lots of informalisms and a thinly veiled complained about academic publishing.

The result of this is reflected in the discussion here, nobody actually writes about the result/idea behind the post, instead we have ~200 comments discussing the merits of academic publishing vs blog posts and formal vs informal writing.

So I guess if you want to get your blog-post on the front page of HN it's a good writing style. If you want someone to consider and discuss the merits of your idea, maybe not so much.


Thats the fundamental reason we end up with an Attention Economy - People have limited Attention to pay to everything, But unlimited capacity/need to receive Attention (via Michael Goldhaber).

This plants the seed for the info explosion (those 200 bikeshedding comments or those 6 billion videos on how to boil an egg).

To counter it we have rankings of comments and links and news feeds from google to fb to hn. But its just another layer of bullshit cause most of the pool of what is being ranked is bullshit.

We are yet to design Information systems that take into account what Goldhaber said about Attention 3-4 decades ago.


You sneer at “getting on the front page of HN” but when you rephrase it to, “discuss something you’ve observed informally” your dismissal loses its oomph.

The point may be to entertain as well as inform. Many humans enjoy the unfocused discussion around the main point, and perhaps the author prefers it to the clinical and formal tone an academic paper tends to take.


For what it’s worth, someone pointed out that PyTorch has an optional workaround for it in the Multihead Attention API. But yes, I had to skip over 200 comments ranting off-topic that was mildly annoying (to me).


I ran an experiment like this and in my setting it didn't help. Not saying there may not have been a bug or something, but I think attending over the current position sort of solves this problem. IE when it should not speak it just emits the current pos value.

edit to add details in case anyone is interested

I didn't add one to the softmax denom. I added a learned parameter (the attention sink) that would be appended to the beginning of QK but would be removed after softmax, so when multiplying by V the totals wouldn't sum to one. I tried variants that included looking at the current pos and not, and also variants that predicted used an ffn to generate the sink per position instead of a learned param. In my setting neither approach really made much of a difference. But I also had a bunch of other weird stuff in there too, so it may be worth trying again.


When you say it didn't help, can you clarify what you're measuring? In the context of this post, I think both the performance your task, and the number of outlier weights (and their magnitude) are important.


I was just looking at doing this in pretraining, so I was looking at pretraining losses. The difference was within the range of usual noise so I didn't keep trying.


this is fixing a different issue, not the one you are measuring.


It wasn't really the goal of my experiment to fix this issue for sure, I was trying to see if you could improve attention by decoupling the key used by a position for itself and for future tokens.

Open to being wrong here, but wouldn't it be functionally similar to adding a constant to the softmax denom? the function could sort of learn a specific position to have sink and q multiply to one, then removing it before multipling with v would be exactly identical?


The question concerns outliers ... how did the change manage them?


He's advertising it as fixing the spiking outliers. Did your variant have those outliers beforehand?


I guess yeah I was mostly responding to

Now it’s possible that softmax should be replaced wholesale, but it’s worked pretty well for the most part, except for this one wee little bug that prevents attention heads from saying nothing. So I propose a very small tweak on which I am willing to stake all future Internet claims to being correct. The tweak is so small, yet so obvious, and it’s been sitting here under everyone’s noses ever since attention was invented (2014).

I didn't test for outliers, but I don't think this will lead to a large improvement in attention overall/it will fix a lurking bug.


He’s not trying or claiming to improve attention. He’s trying to reduce outliers to improve the ability to quantize the parameters.


He refers all over the blog post to an "error" in attention. specifically says

The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man.

I'm saying it uses the current position to do this, that if it was a significant error I would expect it to improve the training loss. I sort of interpreted the blog post as being a bit more positive on the idea than just being about improving the quantization


I agree that he used the term error somewhat incorrectly. But he seems mainly to just be making the point that sumac introduces a large outlier which in turn is only now an issue that the community is now aggressively trying to quantize models


> I didn't test for outliers

Then you don't know if the approach he is advocating actually improves what he is aiming for


I don't see any results, it'd be more impactful and convincing if there were numbers supplementing the theory. It's not that hard to finetune existing LM on a small data and verify that it works.

I am, however, of the similar opinion that there could be better attention formulations. A paper from 2020 https://arxiv.org/abs/2005.09561 helped a lot in one of the transformers model I trained (not a vanilla LM but a specialised multi-modal graph problem).

It proposes normalised attention which if I'm not wrong should help with the quantisation problem too.


This method was frequently used prior to the ubiquity of dummy tokens. XLNet was the paper that introduced me to this idea. I believe it’s been in PyTorch since 2019/2020. I would not be surprised if someone finds an earlier reference.

I’m surprised by the pompousness in the OP. Especially about something that most people who do transformer research understand. I’m also surprised that so many in the replies are taking the position of “this is what research should look like” when this is clearly an example of why research doesn’t work like this. Peer review is good for many things and one of those things is saving yourself some embarrassment.


He's not being pompous, people appreciate the informality and straightforwardness and self-deprecation, which are the opposite of pompus.

You are reading some of the more ambiguous self-deprecation as genuine claims.

TL;DR on why this is important and he's sharing: it's a sort of niche thing that really only matters if you're trying to run pale imitations of ChatGPT on constrained hardware. That's why it's entirely possible the big guns didn't see it as important, they're not trying to run LLMs on a 3090


> I’m surprised by the pompousness in the OP.

He’s writing in a colloquial and self-deprecating and humorous tone. I can’t speak to the merits, but I can follow the reasoning perfectly fine. It’d be hard to find something further from pompous.

> saving yourself some embarrassment

Implying of course that being wrong, or not the first one to discover this, is embarrassing. And that’s not pompous?


This is similar the the (old) trick of adding a Uniform distribution component to a Mixture of Gaussians model. It doesn't really change the math wrt parameter optimization and probability evaluation, but provides a place to capture "background" or "unimportant" data points and improve the model robustness to outliers.

The motivation follows from the same problem the author points out in the original softmax formulation that it always "forces a choice" when it may be more useful to put a "Not Applicable" option into the model itself.

https://link.springer.com/article/10.1007/s10260-021-00578-2


This reminds me of the normalization bug in StyleGAN. It had this obvious visual artifact of a 'blob' which would appear in otherwise photorealistic images, which was puzzling because it was so obvious how did the Discriminator not squash it? It turned out to be a flaw in the normalization of the AdaIn style layers, IIRC, where the Generator was pumping up numbers and doing weird things to force through information.


I don't really understand the subject matter enough, so I apologize in advance for the meta-comment...

The author mentions that he would maybe have written this as a scientific paper:

> I tried writing a serious-looking research paper about the bug and my proposed fix, but I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead. (History is written by the winners; blogs are written by…)

Honestly, thank god he didn't. This paper is so much more readable and approachable than what gets published in "serious" journals. The tone is self-effacing, it does not have an "ego" the way scientific papers tend to have. If all science read like this, and if we were "allowed" to cite research that reads like this, I think we would be much better off. This reads like a conversational, approachable textbook, not like an impenetrable wall.

Is it because I don't understand attention at a PhD level that I hold this opinion? Maybe. Could he be writing like this because he's a layman and utterly wrong about the topic, unlike those Serious Science Authors? Maybe, I don't know.

But my god, wouldn't it be nice to be allowed to write like this?


Nah, scientific papers are supposed to be precise and technical. This reads like those quite frequent suggestions here of switching all equations in papers to plain English or code: it honestly comes from a place of ignorance, and I say that as basically a layman myself.

What should be encouraged is for academics to blog about their research as well. It would even help when recruiting and onboarding new members. Right now the sociological and economical incentives don't promote this at all.


    There was this sociologist who had written a paper for us all to read ahead of time. I started to read the damn thing, and my eyes were coming out: I couldn’t make head nor tail of it! I figured it was because I hadn’t read any of the books on the list. I had this uneasy feeling of “I’m not adequate,” until finally I said to myself “I’m gonna stop, and read one sentence slowly so I can figure out what the hell it means.”
    
    So I stopped-at random-and read the next sentence very carefully. I can’t remember it precisely, but it was very close to this: “The individual member of the social community often receives his information via visual, symbolic channels.” I went back and forth over it, and translated. You know what it means? “People read.”
    
    Then I went over the next sentence, and realised that I could translate that one also. Then it became a kind of empty business: “Sometimes people read; sometimes people listen to the radio,” and so on, but written in such a fancy way that I couldn’t understand it at first, and when I finally deciphered it, there was nothing to it.

  -- Feynman
I disagree. After going through quite a few research papers in my time, I've found the best are the ones that are direct and to the point. Many papers I've spent many hours/days trying to unravel just to realize the concepts were straightforward, not very novel, and there wasn't much of real substance to the paper.

Meanwhile, some of the most impactful papers I've read are direct and to the point. Kadmellia, Bitcoin, BitTorrent, DynamoDB, Firecracker, etc.

It seems like, when you have something of substance to say, you say it. When you don't you overcompensate by falling back on building an intricate puzzle of jargon and convoluted equations in an attempt to make what you're saying sound far more important than it really is.

As LLMs get better, I look forward to the day where every journal has a standard LLM filter you're required to apply to your paper that unravels all of this nonsense and rewrites it a more straightforward way, if not to directly publish than just for the editors to verify there isn't a simpler way to convey your ideas. I suspect that if we had an EIL5 filter for most journal articles, we'd discover that a majority of the words that get published have very little substance at all.


Systems research papers do not represent all research papers out there, not even in computer science.

In cryptography, certainly a paper with formal definitions and proofs can be much more valuable than a corresponding blog post. It's a field where formalism is desired, if not necessary. Otherwise you can't check other people's "proofs", or even know what model you're working in.

I think, since people haven't come up with better formalisms, sometimes it's quite obtuse, which gets mistaken as "academic writing", when really it's a best effort to formalize.


Requiring formalism does not preclude attaching an informal but intuitional description of the formal definition or proof. Unless the authors don't understand very clearly what they are talking about, or they want to prevent others from understanding their concepts too easily, I don't see why there is a reason for the authors not to attach an EIL5 in addition to formalism.


Sure. But it's an ELI5 "in addition to formalism", not "in lieu of formalism". In theory conferences like STOC or FOCS, the first section of the paper often comprises such an overview.

Certainly some papers are better written than others. But sometimes a blog post cannot replace a paper, unless it also goes into the depth and detail that formalism requires. (Then it becomes a 30 page blog post, where most people don't read past the intro.)


The complaint about research papers is that almost all of them omit the ELI5 and provide only the formalism.

You can have both and weave them together into a digestible narrative. I see Physics textbooks sometimes written this way.


Papers are mostly read by other researchers, where the added background is actively bad because it obscures the real meat of the paper to the main audience.

If you just wanted a digestible intro then you would usually buy a textbook.

I think the argument that every research paper ought to be a mashup of a textbook + the actual research to be a bit silly from a “people should specialize at what they’re good at” standpoint.

Put in another context, I also don’t want every recipe to reintroduce what it means to “fry” or “braise” or “marinate”. We have Google for that.


I've long wanted an informational slider to bits of text. Something where you can zoom in and out to the level of desired complexity. LLM's might be able to fill in some of those gaps. You could turn any paper into a introduction of the subject it's a part of.


This sounds like a good use case for local llm models. Browser plugin, precooked prompts for different levels of detail, maybe a lora to give the model some idea of expected output. I bet some of the 13b models could do a useful job on this even if they were imperfect.


Look for "stretchtext"


I don't know that much about AI, but my experience in other areas has shown me that 'more grown up' literature that feels harder to parse when your starting out later becomes the precise technical information you need as you get deeper into a subject. Like W3Schools when you start out in web dev vs MDN when you're skills are more mature.


I believe Feynman understood that he was oversimplifying, and I believe he was able to do because his reason for reading the paper was not the same as the reason another sociologist might have. Thus a sentence like, "The individual member of the social community often receives his information via visual, symbolic channels", does, to a non-expert, mean "people read", but to another sociologist of a researcher in related fields, phrases like "individual member", "social community", and "visual, symbolic channels" would terms of art. That means an expert in the field could read "social community" and it would mean, cognitively, an entire set of concepts in the field.

In short, jargon matters. People here can talk about functional, procedural, and object-oriented programming because each of the three words has more than just the dictionary meaning - to those of use in the field. In the same way we can talk about linear algebra and know it doesn't mean "algebra on lines".

Yes, it's possible to write scientifically without jargon and wordiness, but it's a lot of effort and takes much more space to say "a group who follow a social structure within a society (culture, norms, values, status). They may work together to organise social life within a particular place, or they may be bound by a sense of belonging sustained across time and space"[1]

1 https://othersociologist.com/2013/11/20/sociology-of-communi...


Visual symbols could be anything from written words to police uniforms. It's not oversimplifying— it's flat-out wrong. It would be like reading

Expressions representing numbers may be combined with an expression representing a primitive procedure (such as + or *) to form a compound expression that represents the application of the procedure to those numbers.

And an English professor haughtily responding, "you know what that means? 'Computers compute!' This SICP book is just a pile of jargon that could be dramatically simplified!"

His dismissal revealed nothing about the topic, but a whole lot about how so many in the "hard" sciences view others. Don't understand the text? It's the text's fault! For I am a real scientist, and if I don't understand it, it's not understandable!

He might have been a genius, but he should have stuck to subatomic particles and left exploring human behavior up to the people who'd done the prerequisite reading.


Well, maybe, but you can rationalize arbitrary amounts of pointless jargon that way.

Besides, in the example Faynman gives the simple sentence is actually shorter. Maybe that shorter sentence loses some information that the jargon carried, but Occam's razor suggests the writer was just trying to sound smarter.


Some bad writing certainly comes from trying to sound “academic” or “scholarly” but there’s more to it than that.

A lot of research involves lumping and splitting: what underlying properties do these seemingly-different share (or vice versa). For example, reading text is just one possible instantiation of a “visual symbolic channel.” Traffic lights, road signs, gauges and dials, logos, and clocks also carry information the same way. If you want to discuss “reading and reading-like activities”, you may want some kind of umbrella term.

Plus, you may want to contrast them with other ways of sharing information: non-symbolic systems that literally depict the item in question (photos on a picture menu, for example) or using a different sense altogether, like church bells for telling time.


> It seems like, when you have something of substance to say, you say it.

And this blog post probably could be condensed into 1/4 of its size or less with a less conversational/bloggy tone.


There are words that are added to drive the point in multiple ways, ease into it, and make the text more engaging.

And there are words that are added to add empty padding, keep up academic pretenses, and appear smart.

The post could have been condensed, but it would lose the former, not the latter.


Good rhetoric takes time and energy from both the author and reader


Not an academic here, but I've read (and continue to read) through research papers regularly.

The original bitcoin paper is a great example. I was able to follow the paper almost fully at my first read itself—despite my not having a formal background in maths.

...and as you said, many of the insubstantial papers hide behind jargon and unnecessarily complex equations, just to camouflage their lack of substance. It's frustrating to spend time deciphering a paper, only to realize that you've essentially wasted that time.


I hadn't seen that Feynman quote before, but I discovered then when reading Donna Harraway's books (Cyborg Manifesto, Modest_Witness@Second_Millennium.FemaleMan©Meets_OncoMouse, Primate Visions).

The criticism was """Haraway's work has been criticized for being "methodologically vague"[39] and using noticeably opaque language that is "sometimes concealing in an apparently deliberate way""""


>Haraway's work has been criticized for being "methodologically vague"[39] and using noticeably opaque language that is "sometimes concealing in an apparently deliberate way

So you're saying that "Her work is basically handwaving and bullshitting".


Yes, but also, wrapping the handwaving and bullshitting in a layer of obfuscation:

"Michel Foucault’s biopolitics is a faccid premonition of cyborg politics, a very open feld. By the late twentieth century, our time, a mythic time, we are all chimeras, theorized and fabricated hybrids of machine and organism—in short, cyborgs. The cyborg is our ontology; it gives us our politics. The cyborg is a condensed image of both imagination and material reality, the two joined centers structuring any possibility of historical transformation. In the traditions of “Western” science and politics—the tradition of racist, male-dominant capitalism; the tradition of progress; the tradition of the appropriation of nature as resource for the productions of culture; the tradition of reproduction of the self from the refections of the other—the relation between organism and machine has been a border war"

(donna was woke before woke was a thing)


> (donna was woke before woke was a thing)

Donna Haraway was born 6 years after “stay woke” in its sense as an admonition to maintain alertness to the racist context was coined. Leaving aside a debate over whether her work is a good match for “woke”, she very much cannot have been woke before woke was a thing. (Before its recent replacement of “politically correct” as the American Right’s preferred, meaning-stripped, label for everything it disagrees with, sure, but “woke” was a thing long before that.)


>Leaving aside a debate over whether her work is a good match for “woke”, she very much cannot have been woke before woke was a thing

A game of being pedantic is always welcome:

She very well could have been "woke before woke was a thing", because "woke" as the parent means it in her case, refers to the modern usage (of like, 2 decades), not the original term of the 40s that might have preceeded her birth.

So take the parent's comment to mean:

"She was woke, in the modern, circa-2000s+ sense, before woke, in the modern circa-2000s+ sense was a thing, not in the 1950s namesake sense".

Similar to how somebody could have been a hipster (in the 2000s+ sense [1]) before a hipster was a thing (before 2000s), even if they have been born in the 70s. Sure, the term already existed before the 70s, but it referred to a different thing.

[1] https://en.wikipedia.org/wiki/Hipster_(contemporary_subcultu...


The 1938 sense in which it was coin3d is exactly the sense of the 1950s and the sense that got ibcreased attention circa 2000s and catapulted to attention alongside BLM (which itself was a response to the same kind of even that the art in which the phrase was coined for responded to).

The only newer sense is the American Right’s use of the term to replace “political correctness” as an empty epithet for everything and everyone it disagrees with.


Language grows organically, and the American right gets as much as the American left to define what a word means or how proponents of a movement or social fad are seen in practice (besides woke's standard definition is just "awake", if someone insists on the "original meaning")

So, one side could see woke in theory as a noble activist/social consciousness practice, which can not go wrong and helps liberate us all.

The other side might see woke in practice as intolerable virtue signalling and self-aggrandizing whose actions often border on farcical.


Thanks; that's exactly what I meant. I leave these things out because I assume not everybody is pedantically waiting to call me out on a slight variation on their personal belief system.


> I disagree. After going through quite a few research papers in my time, I've found the best are the ones that are direct and to the point. Many papers I've spent many hours/days trying to unravel just to realize the concepts were straightforward, not very novel, and there wasn't much of real substance to the paper.

Can say the same thing about code. Some people just honestly don't want to give away how simple the core logic is seemingly, and will lead you through myriad twists and turns to finally see the point.


> There was this sociologist

Found the problem.


The writing quality of academic papers is very poor, whatever its intended characteristics are, and we deserve better.

I'm skeptical that the only way for them to be precise and technical is to make them impenetrable. I think there is a culture of academic writing (many different cultures, really) that has adopted a voice and writing style which became a parody of itself over time.

Here's a trivial example: You frequently see papers use the passive voice, something a middle school English teacher would mark with a red pen. 500 participants were asked, vs. we asked 500 participants. In what sense is the former more precise and technical? It's not. It does not convey any additional meaning. People use it to sound objective and distant, even when they really aren't.

Realistically, academic writers usually don't even think about it as much as that. They're just copying the tone of other papers, because there is a culture and it enforces certain behaviors on its members irrespective of the value.


A pain in the ass was observed while writing was performed in the passive voice.

Nobody likes doing it, I think. We just do it because we’re scared our papers won’t be accepted otherwise.


In philosophy papers you see authors often use the pronoun "I", similar to blog posts. But they have other ways to make them hard to parse for outsiders.


Either your example is too trivial to justify your point, or the point itself is trivial. It's right for an academic to distance themselves from the subject of their study because we do need researchers who try not to be biased. If they fail that and then correct themselves, then what's the problem? Complaining about inconsequential uses of tone is obsessing about form over function and reeks too much of insecurity, to be honest.


They aren't magically "objective" because they used the passive voice. It's a performance.


Of course language does not guarantee that the study is objective—that would be in the design of the experiment, the reproducibility of results, and the absence of conflicts of interest among the researchers. Using the passive voice however elevates the outcomes being reported as facts that actually happened, instead of mere personal experiences.

People complain all the time about news being biased for being told from a reporter’s point of view, but complain all the same when events are reported in an encyclopedic manner as researchers do when they remove themselves from the events and the outcomes of their studies.


I'm convinced that the value of active voice is not precision and clarity, but rather the subliminal egocentrism away from the object (the research) towards the subject (the researchers) who need to receive credit for the work. The royal "we" also helps frame the work as a collaborative effort with the audience.


That's rubbish, passive voice has a number of detrimental effects, it increases text length without adding information, it makes subject (acting entity) and object (entity acted upon) easier to confuse and it confuses the reader about who actually did things (what some people often confuse with objectivity).

That said the assertion that most scientific articles are written in passive voice is outdated för quite some time. Most journal style guides advise to use active voice, e.g. https://www.nature.com/nature-portfolio/for-authors/write


> it confuses the reader about who actually did things

When scientific papers have a clear list of authors and delineated section headings, this point is moot. And in such papers, again, repetitive strings of sentences that begin with the same "we..." emphasizes the producers of the work over the work itself.


I agree with everything you say. Though papers really are a bit too hard to read sometimes, but I'd argue it's often not for an overly technical tone so much as writers cutting out a lot of background material for brevity and assumed familiarity.

>What should be encouraged is for academics to blog about their research as well. It would even help when recruiting and onboarding new members. Right now the sociological and economical incentives don't promote this at all.

I will add onto this that a lot of journals have been pushing for video abstracts and "plain English" abstracts. For the most part I don't see these too often but when they're there they're appreciated, and I vaguely recall that someone found that citations go up when they're used (specifically plain English, I don't think anything has been on video abstracts).

There are a lot of good blogs for computational academic subjects (ml, bioinformatics, comp neuro, etc) but I see less for bio and non-software engineering. Math and physics seems to have some really notable blogs, but beyond what gets posted to HN and linked further on those blogs, I can't comment.


"it honestly comes from a place of ignorance, and I say that as basically a layman myself"

Here is an added complication: succinct technical communication can be efficient when communicating to peers who work on the exactly same domain, similar problems as you, and want digest your main ideas quickly.

On the other hand, for any particular paper, the size of the audience to whom it is directly relevant and addressed to can be small. The size of the audience who got to reading it anyway may be vast. (Maybe I am reading your paper because someone cited a method paper that in lieu of a proof or explanation writes just two words and citation to your paper. Maybe I am a freshly minted new student reading it for my first seminar. Maybe I am from a neighboring field and trying to understand what is happening in yours. Maybe I tried to find what people have already done with particular idea I just had and search engine gave your paper. And so on.)

During my (admittedly lackluster) academic career I recall spending much more time trying to read and understand papers that were not addressed to me than papers that were and where I enjoyed the succinct style that avoids details and present the results. (Maybe it is just an idiosyncratic trust issue on my part, because I am often skeptical of stated results and their interpretation, finding the methods more interesting). But that is not all.

I also noticed that genuine misunderstandings coming from "brief" communication of technical "details" were quite common; two different researches would state they "applied method X to avoid Y/seek Z[citation]" in exactly so many and almost exactly same words, where X,Y and Z were complicated technical terms, yet the authors would have quite different opinion what the meaning of those words were and what would be the intended reading and how and why X should be implemented.

In conclusion, I think many a scientific field would benefit from a style where authors were expected to clearly explain what they did and why (as clearly as possible).


>Nah, scientific papers are supposed to be precise and technical.

They're also, more often than not, tedious, badly explained, error prone, oft-skipped, and hardly ever read carefully, even during peer review for the paper that contains them. That's how mistakes stay unnoticed for decades in influential papers with tons of citations.

In essense, a paper's tone and languge is often more formality, academic tradition, ritual, and padding for publication purposes, than serving a real purpose.


Well, I'm not so sure. It seems to me that someone could perfectly well devise an experiment based off of this (another poster chastised me for saying paper, so) blog post.

Equations are perfectly clear. I was able to follow his reasoning perfectly well.

I cannot say the same for so many papers (tm) that I've read. Mostly in a similarly computational (though non- deeplearning) applied math domain.


Strongly agree. “Why are academic papers always written in such mumbo jumbo?” is the same complaint as “Why are contracts written in such legalese?”, which is a manifestation of “I’m smart and I don’t get this, so the author is dumb for not writing clearly.” It’s a natural human bias that most HN denizens insist they don’t possess, but of course we do.


> Nah, scientific papers are supposed to be precise and technical.

> What should be encouraged is for academics to blog about their research as well.

Why so binary? A blog would be hard to find, why not have both in the paper?

My view is similar to that of code vs docs: code should be as small, and as precise as possible, whereas docs are best when they’re explaining to humans how things fit together, high level. Also easier to maintain.

Hyper technical natural language mixed in with math is almost the worst of both worlds: low density of the actual formulas, with an incomprehensible wall of text surrounding it. And clearly this is an issue also for phd domain experts.

Not saying academic writing could be super simple but I also see no reason that the status quo is optimized more for comprehension than say social posturing.


I disagree because it isn't possible for language to be precise on it's own syntactic merit. There is meaning and there is context and the biggest problem with research papers is that the context of many statements in the paper are incredibly ambiguous. The reason for that is that the papers are trying to be "concise". Context can only be disambiguated with more statements. You must eliminate potential interpretations that a reader could make.

"Spectrum sharing in an “apple-like” or a fixed set sense is not a coexistence. ". What does that mean? Coexist? Who knows, the author thought they were being precise, but they understood the statement they made with a head full of context that gave it precise meaning. As readers, we can only scratch our own heads as to what that context could possibly be.


Leslie Lamport definitely doesn’t share your opinion. A known fact about the Paxos paper is that there are no dumbed down summaries worth reading because the proper thing is so approachable. Not sure if you only have to sound smart if you’ve got nothing to say but certainly feels like it could be the case.


Paxos is so mistifyingly hard that Raft was invented as part of a project to understand Paxos (and the advisor and proponent of the project was John Ousterhout, who's pretty badass). There are also I believe a few papers trying to trying to explain Paxos more clearly


Just as a quick source to my claims:

1. The raft paper is titled "In Search of an Understandable Consensus Algorithm"

2. The abstract of this tutorial on Understanding Paxos https://www.ux.uis.no/~meling/papers/2013-paxostutorial-opod...

3. Lamport's own "Paxos made simple" https://lamport.azurewebsites.net/pubs/paxos-simple.pdf


> A known fact about the Paxos paper is that there are no dumbed down summaries worth reading because the proper thing is so approachable.

A known fact is that it's impossible to actually implement it correctly, and the "approachable" paper seems to be a significant factor in this.


I've read a lot of scientific papers in the comp sci / machine learning space and they are rarely precise. It's been over a decade since I've ready many papers so maybe this has changed, but I remember reading a paper out of Microsoft about how to make spell correcting auto-completion for search, and it was nearly impossible to figure out precisely how it was implemented. Precision would have been achieved easily by providing code and a sample data set. instead of was a mix of prose and math equations with many gaps where you had to guess how to fill.


Ah yes, my old supervisor was very fond of that strategy.

"Make it sound like we do cool stuff; but don't make it so precise that they can re-implement what we do. Let them come to us so we can co-author papers."


not always, ReLu is a fucking line, most papers write stuff in the most complicated way to sound smart.


More fundamentally he's postulating that this will work in a blog post but he doesn't do any experiment to prove that it does.


I think maybe its because he didn't have experimental results that show that it worked. Not a knock against the author, there are just so many things that seem like good ideas that don't end up working well in practice, a paper like this without results is hard to value.


Yes, definitely. If he tried to have it published, the lack of experimental results would definitely be a glaring error.

But this is still scientific communication. It's really nice that it's legible!

> Even though softmax1 is facially quite boring, I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research. If you want to run some experiments and prove me right, DM me on Twitter and we’ll get a paper going.

I'm guessing that in the stodgy world of science, a communication like this might happen over lunch at a conference, limited to a small clique of researchers who are zealously guarding their next paper. Who could blame them, publish or perish!

But someone will probably test this theory out (after my read, it will probably happen in llama.cpp with preliminary results on GPT-2 by next week) and achieve results, and it will happen quickly and legibly to the outside world, because this was published openly and without all of the pretension that formal science (tm) has. If it works, it works. Stuff like this is the soul of the internet. Sharing knowledge and making it legible for all.


There's a perfectly good venue for this communication: a workshop.

Workshop submissions often don't need evidence. They just need a small kernel to spur discussion.

Without experiments, there is no hope of publishing this in anything more than a workshop. Nor should there be.


Then again, if you don't have access to giant compute clusters you can't test this, so it's either a blog post or nothing. I believe the outlier problem that this solves only appears for very large models.


That isn’t true at all. Train a smaller model on a smaller dataset. You can even train on your laptop. It’s definitely feasible. This is just a proof of concept, it doesn’t need to beat state of the art.


Maybe I edited my comment too late.


> I believe the outlier problem that this solves only appears for very large models.

Any reason to believe this? The author never mentioned it, and I can’t think of any other a priori reason why it should be true.


See figure 1:

https://arxiv.org/pdf/2208.07339.pdf

Outliers appear at model size 6.7B and are not present at 2.7B


Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.


The second heading in the TFA is "It’s All About Outliers"


6.7B isn't "needs a datacenter" scale.


It's in the million dollar range. XLnet which is a 1.3B model cost $245,000 to train for example.


To finish the author’s analogy:

Blog posts are written by those who arrive first.

In a weird way my mental model is: blog posts are the recon team discovering a new idea. They might have errors. They might be incomplete. Maybe they’re outright wrong. Stakes are lower as it took less effort to get there and less loss if a position is abandoned.

Then papers are authored, often much later, and they’re the regulars coming in to fortify a newly captured idea. They provide (or at least are supposed to) rigor to the idea. A fortification of a position that we decide is worth holding.

Yeah, this analogy is probably sloppy. But in my brain there’s an eternal conflict against ignorance as we keep advancing into the unknown.


Counterargument: this blogpost is worthless. You get all the way to the end and then find out he hasn't actually tried it, not even on a toy model. It's just a neat idea he thinks will work.


I wouldn’t quite say its value is zero. It’s worth something, but a lot less than if it had been shown to work empirically.

Explainers and their folksy, imprecise tone are good for things we already know are true. I’m skeptical on things which are unproven.


Why would that make it worthless?


Among other reasons, because the decoder-only version of the original transformer architecture has proven weirdly resistant to these kinds of hacks and clever optimizations.

Ideas like sparse attention, tree attention, residual attention, etc, all sound good on paper, but when researchers try to reproduce them they either find no results or results that don't scale. Even AliBi is turning out to be less powerful than scaled-down positional embeddings. It's almost a bitter lesson on its own: you can't beat the original transformer.

Optimizations that do stick around tend to be the ones that preserve the original algorithm but help with caching or memory accesses.


Because there are a thousand ideas a minute in this field that meet the "it's worth trying" bar but don't actually pan out to make any difference. It's the equivalent of a blogpost that says "if someone else turned my idea into a business, it would be a billion dollar business. But I won't bother."


Because until he tries it, who knows if it works?

There are a thousand papers out there making minor tweaks to the transformer architecture. 99% of them are also worthless and forgotten.


> Because until he tries it, who knows if it works?

That's precisely what he shared this for, though. So someone willing to train a model with this tweak tries it.


With say system architecture, you can muse on stuff like "well if Kubernetes made this decision, it would definitely be more secure" or "it would scale up quicker" without empirical evidence and other people could argue "yes I agree because" or "no I don't because"... etc.

With large ML models, there probably is no intuition like this. We just don't know "if I do the common sense thing X, it surely will produce better results for a given benchmark" ... well we have no idea until it is tried out.


He says in the very first paragraph:

> I lost a series of pitched battles against Pytorch and biblatex, so I figured I’d just write a blog post instead.

So I think your accusation of his burying the lede on the lack of experiment is unwarranted.


> The tone is self-effacing, it does not have an "ego" the way scientific papers tend to have.

I can't imagine judging scientific papers based on whether the author might be looking down on me, or thinks he knows better than me.

> if we were "allowed" to cite research that reads like this

Maybe you're looking down on yourself? You can cite anything you want to cite.


Well if you yourself are trying to publish in a scientific venue you can't always cite exactly what you want to cite. Though it's probably uncommon for a peer reviewer to ask for a specific citation to be removed, the review process absolutely does affect the references list, and expectations about this process affect it doubly so.


In ML, no one is going to police your citation list. I've cited some weird stuff in my papers, including ideas from tweets and random quotes from Jeff Dean. It's never been a problem.


> This paper

It's not a paper. It's an idea that sounds plausible, presented in a highly entertaining form.


A lot of thoughts in this thread on what academic papers are or should be, let me give my own opinion as a person who tries to write papers.

Papers should be structured like fractals - that is, they should be "self-similar". The main text of the paper after the introduction should go into all the necessary details demonstrating the origins of the idea and proving that it has value. Then the introduction section should summarize all this, and take a less rigorous tone. The abstract should be a summary of the introduction. And then the title should summarize the abstract. If you really have a lot of technical work to do, maybe you can write a super long appendix and have the main body summarize that.

I myself probably spend as much time reading paper introductions as I do reading paper bodies, which means that probably 90% of the papers I read, I only read the introduction. I do this because I enjoy it more - I like new ideas, and the intros are a great way to get a lot of them. This blog post reads like a great paper introduction to me. It's easy to trick yourself into believing something is easy though, so an academic paper would have to back this up with an experiment.


There isn't much difference between a blog and a whitepaper, in that people tend to write blogs more casually and whitepaper more seriously (and some academics event only accept things that look more serious).

But a good writer can write great articles in whatever format they wish.


I learned more from this post than a thousand papers. Amazing writing!


> it does not have an "ego" the way scientific papers tend to have.

What do you call it when somebody takes the time to write about "a big discovery" they've made, but don't take the time to check if somebody else already did it? It's not like it's in some forgotten paper nobody has seen. It's in Pytorch itself.

Also this: "I’m 99.44% sure that it will resolve the outlier feedback loop that’s making quantization the subject of cascades of research."


It's interesting, because as a scientist who reads and writes these kinds of papers, my first impression was: This guy has a pretty big ego or is otherwise badly miscalibrated if he believes his genius idea has a "99.44%" chance of preventing outlier activations without doing any experiments.


Not ego, he's playing on the old Ivory Soap slogan "99+44⁄100% Pure"

https://en.m.wikipedia.org/wiki/Ivory_(soap)


This is why folks like gwern have their own research published this way, i.e. his analysis of GPT-3: https://gwern.net/gpt-3

We call him an "independent AI researcher" because his google scholar is "bland" compared to many academics who play the academia game - https://scholar.google.com/citations?user=yk1QMowAAAAJ&hl=en


I can see AI being used to make scientific papers more approachable like this.


Are most AI papers even published beyond arxiv anyway?


It would be amazing if academia started replacing papers with videos + code

I want to see: an explainer of the science/ideas/experiments/hipothesis

And instructions on how to reproduce the experiments/results

Some YouTubers are going in this direction


+1 to including code with your paper. It improves reproducibility and transparency. There’s even a well-known website dedicated to this purpose.

For the rest of it I don’t care. As long as researchers understand what’s going on, that’s what matters.


I'm not an academic, but some of the notation and terminology they use makes me want to hunt them down and 'clockwork orange their eyes open' until they can show me how their math is "intended" to work.

Inconsistent math notatation in papers along with vague terms in descriptions makes me so mad.


Most papers already have code, and videos are very common.


Videos showing some result, but almost never a video of someone explaining the thing they are doing

When they include good videos, they really stand out


oh god, please, no more videos...


The proposed replacement definitely makes more sense (and I've always found the absence of a "failed query" to be puzzling in standard attention), but, in deep learning, things that make more sense don't always actually get better results. So I'm curious whether this has been tried and carefully evaluated.


It would be an amusing find if "Black Swan mega-activations" actually but yet unintentionally made the model smarter...


The "missing 1" is a waste-category that is implicitly re-scaled.

The explicit 1 formulation is used in binary softmax, and the implicit (not seen 1) is used in multinomial softmax. I suspect this is the old "notation B looks silly in terms of notation A's standards."


If I followed this correctly and didn’t mess up my indices, adding 1 to the softmax denominator is exactly equivalent to appending an extra zero to the softmax input (effectively casting an exp(0) vote for a new null option) and appending an extra row of zeros to V (so the null option is all zeros).

The latter seems like something training could figure out by itself (zero doesn’t seem like a hard place to land with the weights producing V, although a bunch of zero weights would be needed), but the former is a bit awkward, as QK^T is quadratic in the weights.

In any case, this seems intuitively quite reasonable. But I do wonder whether the 1 in the denominator (equivalent to an exp(0) vote) is the best choice if the goal is to quantize well. 0 is in the middle of the numerical range, and perhaps the implicit null vote should be weighted lower than the middle of the range.


I follow the argument but the proof of the pudding is in the eating. I don’t know what “battles” the author lost to PyTorch lately but a good test would be to modify one of the smaller models (maybe nanogpt) and swap out all of the softmax calls for his quiet softmax.

I didn’t see anything relevant on alternatives to softmax, since TFA is specifically questioning softmax in a multihead attention context.

Ultimately, neural networks are arbitrary function approximators. It doesn’t necessarily have to be “right” internally to fit the data. But if this new softmax allows transformers to learn more, that’s great.


> a good test would be to modify one of the smaller models (maybe nanogpt) and swap out all of the softmax calls for his quiet softmax.

You'd have to train the model with the quiet softmax before inferencing with it would work.


This is right below the "Have Attention Spans Been Declining? – Yes, 65%" post, lol brilliant. In general, human decreasing, AI increasing- attention.


For posterity:

    1. Have attention spans been declining? (slimemoldtimemold.com)
       338 points by janandonly 4 hours ago | flag | hide | 254 comments

    2. Attention Is Off By One (evanmiller.org)
       400 points by elbasti 4 hours ago | flag | hide | 129 comments
Note that the #1 post is probably there because the title earlier had the provacative "Yes, 65%" appended to it. So even more numerical.


"In this post, I prove that attention spans have actually declined by 64%, contrary to widely-publicized reports of 65%..."


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: