Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Transformers to 1B Tokens (arxiv.org)
234 points by mottiden on July 6, 2023 | hide | past | favorite | 68 comments



The benefit of "traditional" O(N^2) transformer attention is you correlate every token to every other token. So, in the limit, your network won't "miss" much.

When you abandon O(N^2) attention, you are forced to start adding heuristics to choose what to correlate. Any time you see one of those giant context window LLMs, you need to be asking what heuristics they added, what is getting correlated, and what is not getting correlated.

This paper chooses an exponential heuristic where tokens further in the past get exponentially less attention. This heuristic is fine for certain tasks like responding in a chat room, where the most recent tokens are the most important, but bad for tasks where tokens are roughly equally important throughout the text, such as a dense academic paper or a reference manual.

The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it. Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.

Another possibility is learning a heuristic for non-recurrent LLMs via reinforcement learning, such as in [2], which is basically a reinforcement learned "auto-researcher" that was trained in a style reminiscent of AlphaGo.

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[2] https://arxiv.org/pdf/2109.00527.pdf


I would like to take a parallel view to the BitterLesson and how its playing out. There are exceptions. Its not only computation but also a mix of:

1. decades of theoretical breakthroughs coming together. 2. there is also collective human creativity and perseverance.

Like Yann Le Cunn, Geoff Hinton etc have been working since the 90's and there were several milestones that were hit and it only caught on fire/went on steroids once the application(and the associated funding) was found due to creativity in the tech sector. But if the computation was somehow available before I am not sure it would have happened so quickly.

Another example is that all methods under the AI umbrella are not dependent on crazy amounts of computation and data. Take the field of AutoRegressive models in Social/Life Sciences field. For example lets look at the STAN which broadly does heirarchical Bayesian Inference using MonteCarlo based methods in social science.

It took some hard theoretical advancements to move the needle on MonteCarlo Simulation methods like detecting convergence and ability to have non conjugated priors for posterior sampling to work etc. The new methods are better by leaps and bounds over the conventional methods in the field. The computation for running the modern models from 2013 would be enough to run em for most cases.


Both your points are not really valid. There have been decades of theoretical breakthroughs in computational linguistics too (Have there been any in Deep Learning?). There has also been a large amount of human creativity and perseverance in computational linguistics, arguably more than the amount I have seen in Deep Learning. Yet, not one useful algorithm has come from linguistics. In fact the old adage on speech processing can be applied to Natural Language Processing: "Every time I fire a linguist my performance improves by a few percent".

The bitter lesson is bitter and important to keep in mind exactly because human creativity and perseverance do not matter in front of it. Consistently, the only methods that work are those that scale with computation, everything else does not matter. I would take a more extreme view, if computation didn't follow Moore's law, we wouldn't have invented alternate methods that do not require massive computation, we would just simply fail to do even the most basic tasks of intelligence and be stuck in the 1960s. A scary thought, but a true one I reckon. If computation kept following Moore's law but a few stalwarts like Yann Le Cun etc didn't exist, we would likely have found alternative architectures that scale and work, maybe not as good as ConvNets but transformers aren't as good as ConvNets either, they just need to scale.


I'm not sure that the Bitter Lesson is the end of the story. The Bitter Corollary seems to be that scaling computation also requires scaling data.

Sometimes that's easy; self-play in Go, for example, can generate essentially infinite data.

On the other hand, sometimes data isn't infinite. It can seem infinite, such as the aforementioned NLP work, where computation-heavy ML system can process more data than a human can read in their lifetime. However, our LLMs are already within an order of magnitude of reading every bit of human writing ever, and we're scaling our way to that data limit.

"Clever" human algorithms are all a way of doing more with less. People are still more data-efficient learners than large ML systems, and I'm less sure that we'll be able to compute our way to that kind of efficiency.


I think Geoffrey Hinton addresses this point well in his recent podcast with Pieter Abbeel. He says and I paraphrase, current Deep Learning methods are great at learning from large amounts of data with a relatively small amount of compute. Human brain on the other hand, with around 150 trillion synapses/ parameters has the opposing problem, parameters/ compute is cheap but data is expensive. It needs to learn a large amount from very less data and it is likely a large amount of regularization (things like dropout) will be required to do this without over-fitting. I think we will have a real shot at AGI once 100Trillion param models become feasible which might happen within this decade.


> The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it. Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.

Linear RNNs and RWKV are examples of RNNs on deep sequences:

https://arxiv.org/abs/2303.06349

https://arxiv.org/abs/2305.13048


The work out of that group, starting with S4 layers, is 10000% the stuff to be paying attention to.

https://srush.github.io/annotated-s4/

HiPPO was brilliant - instead of working with the raw sequence, you work with its weighted laplace transform, and instead of actually computing the laplace transform you find the rule to update it when new data is added. Furthermore, we can 'band limit' the Laplace transform (similar to PCA) to keep only the 'most important' information while still preserving most of the information in the sequence - this is a common and quite effective compression technique.

Any 'fast' transformer is going to be working with some kind of sampling or aggregation or compression of the long sequence. Sampling is ultimately going to be too noisy, and standard aggregations are going to be too coarse. So the thing to bet on is better compression techniques, which is what the S4/RWKV group are ultimately working on.


Can you point to anything public on your last point about compression? What is being compressed?


The sequence of model activations is being compressed. s4 treats each activation channel as an independent sequence, and applies a learned version of the Laplace transform, and drops less-significant components.

This is similar to basic compression you get with PCA or Fourier transforms. These transforms re fully invertible, until you drop the less significant components. Dropping less-significant components lets you reconstruct some degraded version of the input, and the transform makes it easy to pick the right components to drop.


I think the jury is still out if these will actually scale to ultra-long language understanding sequences. KWKV, for example, is still trained like GPT, but is architected so it can be run as an RNN during inference time. This is awesome, but it is unclear if the training regime will limit the effective use of long-ranging recurrent context.


Training as GPT vs RNN will give you numerically identical results with RWKV, it's just two ways of computing the same thing. It's trained in GPT-mode because it's cheaper to train that way -- you can parallelize over the sequence length. In practice it isn't going to be any different than training with back-propagation through time for the same sequence length.


> RWKV

The current versions of RWKV slowly go insane when exposed to sequences that are too long, because the state slowly diverges over time as you increase past the context length of the training session. They are experimenting with ways to avoid this though: https://github.com/Blealtan/RWKV-LM-LoRA/tree/dev-infctx


Can you share more details about the divergence?


This comment makes so much sense relative to what I've seen with Claude's 1M context window. It reliably fails to succeed a task with a prompt where I just stuff in a big blob of data in the middle as context. But when I use emebddings to only select a small relevant subset of that data, it always passes the task.


Yes, Claude 1M is using all sorts of approximation tricks to get that 1M context window. IMO this is actually quite deceptive marketing.


Claude's context is 100K not 1M [1]. If you're somehow shoving in a million tokens that could explain the issue you're having!

[1] https://www.anthropic.com/index/100k-context-windows


Misremembered, the main thrust of the comment still stands, the 100K context window isn't "real", it would be absurdly expensive to do it for real. They are using a lot of approximation tricks to get there.


Yep, mistype on my end as well. Claude just fails to process the request if you get above 100k tokens (I've done that, heh).


Yes, that's the point now for competing AI research-for-profit companies, whatever metric is technical and sounds important, is going to be used in marketing and valuation determinations. It will be explored for research I'm sure, and then determine its product viability. It's nice competition, but agree, that it can be deceptive.


Having studied Sutton for a long time now, what I take away from the bitter lesson is that the only pathway to Generally capable agents is to have the same scale of computational capacity in an embodied system as humans or other intelligent systems have.

It’s effectively a product of physics and so we keep trying to outsmart physics is what Suttons point is - and you just can’t outsmart physics

So, while the method probably is important in terms of efficiency or functionality within the current state of technological systems, the method is less important than the scale, and we’re not even close to the scale necessary yet.


I don't think it's obvious that we don't have sufficient computational scale already...

The human brain has ~86 billion neurons, but they only fire at like 2Hz, so you get 192 billion firings per second. GPT-3 has 175 billion parameters, and can apply those parameters much faster than the brain can fire neurons.

Lots of folks like to point out that there's more complexity in neurons than model weights, which is fine, but it's not clear what the impact of that additional complexity actually is. Extra slow channels (eg, hormonal signals) also aren't an impossibility.

So /maybe/ the models need to scale more, or maybe we need to figure out some better combination of training tasks to get the next big leap. There's massive progress being made with multi-modal inputs (which helps the model create a more coherent world-model, by relating text to images, audio, or video). Data selection. - picking good data instead of just throwing in everything - is also showing lots of promise.

I tend to think there's some need for more 'interactive' or 'social' component to training, eg active/online learning with robotics - and figuring out how to get models to 'play'. Unstructured play is an essential mechanism in smart animals associated with hormonal signals and rewards - it's important, and we don't really know how to harness it yet.

But overall, I don't think we're yet at a local maximum. There's too much cool stuff going on and the iron is still quite hot.


“Embodied” being one of the key things that you’re ignoring

The brain =\= a human agent

You need sensors and effectors and a highly variable motor system

You can’t be “generally intelligent” if you do not have boundaries on your computing system which are mobile and have independent actions.

In order to perform as well if not better than a human then you need to perform as well, if not better than a human in all possible environments, and those include the top of the world, the bottom of the ocean, every factory line, flying airplanes, etc…


How could you do anything intelligent without a strong beak?

https://falseknees.tumblr.com/post/654380023602708480

The interactivity I mentioned is the bit that I think is actually important from embodiment - the ability to take an action in the world, see the result, and adjust expectations. What you've called 'independent actions.'

But there's certainly no proof that a general intelligence needs to be bounded and mobile - a pedantic thought-experiment-counterexample would be an 'uploaded' human mind: the people of San Junipero don't stop being generally intelligent once they are in a distributed simulation...

More generally, we don't actually know the boundaries on how general intelligence could arise and what shape it could take, because we don't really understand intelligence at all.


The only thing we do know about “intelligence” is that more compute = better performance on tasks we use to evaluate increasing generalization on tasks.

So map out the computing power and hardware requirements of an average adult human with IQ 100 and no disabilities and that should tell you what you need.

It’s probably 3x harder than just a brain.


I don't think that's even true. Where are you getting this? I can create an extremely compute heavy model that performs very poorly. Just adding more compute does not mean better performance on its own.

Given the same model too much compute can even mean over-fitting and worse generalization. It doesn't all come down to compute.


It doesn't sound to me like it's quite "tokens further in the past get exponentially less attention." What they say is "attention allocation decreases exponentially as the distance between tokens grows." Instead of being quadratic because every pair of tokens gets the same attention, the tokens farther apart from each other get exponentially less. It doesn't matter how far they are from the final token.

This seems to me more like a general computational approach than a hand-coded heuristic. David Shapiro claims it's similar to how the brain works, and has a neat analogy for it here: https://www.youtube.com/watch?v=R0wBMDoFkP0


This is intriguing but I don't quite follow - really naive, but:

isn't the final token as some position N?

And given context size limit Y, when we generate the next token, right now I get attention from N - Y to N?

And this supposes I get attention from 0 to N, but the attention decreases exponentially as we approach token 0?


One clever trick I’ve seen is where before the text the goes out of the context window, it gets summarized by the llm and then that smaller summary is put into the context window and is continuously updated. It also reminds me how human memory works too.


It seems like building a context tree with a convex branch cross attention estimator then using branch and bound to prune the tree while descending to get exact cross attention when it's above a threshold would work pretty well, assuming the cross attention matrix actually is very sparse and the trouble is just accurately guessing the non-sparse elements.


this sounds to me like a dollar cost averaging strategy - only buy in when the current price falls below an n-day moving average.

I doubt there is any risk adjusted alpha to the strategy - in practice it's my, newbie, understanding that the only thing that differentiates such strategies in the broader scheme of things is tax efficiency.

however I am also not a ML expert


What are you talking about? Wrong thread?


i am suggesting the two strategies might have similar trade offs/benefits though I am not familiar enough with attention mechanisms to say for sure.

it's a comparison/analogy?


> Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.

What would you call "ultra-deep"? [1] shows how to train an RNN in a GPT-like mode, using parallel scan, with great performance on Path-X, which has a sequence length of 16k. It's based on prior papers doing the same thing but from a state space model perspective.

[1] https://arxiv.org/abs/2303.06349


> When you abandon O(N^2) attention, you are forced to start adding heuristics to choose what to correlate. Any time you see one of those giant context window LLMs, you need to be asking what heuristics they added, what is getting correlated, and what is not getting correlated.

Well, having a small context window and everything correlated with everything else is equivalent to having a large context window, but a particularly dumb heuristic.


Good point


Recurrent networks are exponential. They also blow up and decay exponentially. So this is not necessarily worse than rnns


While I agree with the beginning of your post, you lost me here:

> The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it.

Inefficiently re-learning over and over patterns that can be more explicitly encoded as smart inductive biases for better sample efficiency is what ML research is.

The "bitter lesson" doesn't mean "throw the towel and your brain and just buy more GPUs". It means that the inductive biases / modeling strategies that win will always be the ones that are more hardware-friendly.


I agree with you that learning certain things is wasteful.

For instance, one could imagine an RNN that learned to do some approximation of tree search for game playing Chess and Go. But we have very good reason to think that tree search is basically exactly what you want, so even systems like AlphaGo have the tree search implemented outside the neural net, but still using a learned system to heuristically guide the tree search.

The reference to the bitter lesson here is that feature engineering has, thus far, typically lost out to more general end-to-end methods in the long run.

This paper tries to do feature engineering by hand-coding an exponentially decaying mechanism, where tokens further in the past are assumed to be less important.

My comment is that this type of hand-engineering will lose out to methods that are more end-to-end learned. These methods do not necessarily need to be hugely computationally intensive ("buy more GPUs").

That said, I could see it being the case that in the short term, we do just buy more GPUs, learn a general end-to-end algorithm, but eventually figure out how to re-implement that end-to-end learned algorithm in code significantly more efficiently.


By and large, we don't really know what inductive biases we ought to be shoving in to models. Sometimes we think we do, but we're wrong more often than not. So methods with the least inductive biases work better.


> Any time you see one of those giant context window LLMs, you need to be asking what heuristics they added, what is getting correlated, and what is not getting correlated.

Exactly. The paper doesn't even contain any experiments with context windows over 32K tokens. Presumably because it doesn't really attend to the rest of those tokens at all. In practice it's just a 32K attention window with some "theoretical" opportunity for attending a bit further than that.


These models seem to be able to cope with absolutely massive training sets, wheras the prompt input has to be quite small in comparison.

I wonder if could leverage this state of affairs by shifting the prompt from input to training data. Like take a generic model, run a little bit of fine tuning on the prompt.


Do you get sparsity issues though? Say with a million tokens, what does attention between token 123887 “the” and token 4 “ing” even mean? You probably want less density of connections? Genuine question.


Not disagreeing with your comment in general, but this particular sentence annoys me a bit:

> where tokens are roughly equally important throughout the text, such as a dense academic paper or a reference manual.

Even in these, not all tokens are equal, most of a text is actually pretty low-information, with key packs of token that contain most of the information that you're going to need throughout the entire text (that's why we use highlighters when learning). And that's why those O(n²) attention are pretty wasteful, at the same time, you need to be able to pick the proper token, and I agree with you that picking them through simple heuristic is probably not going to be enough.


Better phrasing would have been "the important tokens are roughly evenly distributed throughout the text", that was the intended reading.


Are you thinking more like a research paper or more like a textbook?

For a textbook at least, it often seems to be the case that you need to have fully ingested the big picture ideas of one chapter to move on to some later ones, but this seems to me at least more like updating your model, rather than sampling context from the whole book (I mean it is an analogy of course, so neither matches perfectly).


I need to carefully read the article, but sparse attention is an interesting technique that has been used previously (as in BigBird) but has often proved to perform (way) worse than full attention. The sliding component that performs full attention is indeed useful (much like the Blockwise Parallel Transformer), but the sparse patterns are elements that don't intuitively resonate with me.

The model might select random words in the context. There's definitely a case where this could be unfortunate if it ends up selecting irrelevant words.

The graph on the first page, in my opinion, seems like a needless flex


> The graph on the first page, in my opinion, seems like a needless flex

Indeed - they used half of the cover page of their paper to show a chart which illustrates... nothing...


Well, this looks promising. The key idea is to collect a different set of tokens, with different levels of sparsity for each head, apply regular (dense) self-attention over all heads, weighted by pairwise distance, and spread and add the output residuals to their corresponding location in the original sequence. It seems to work really well, judging by the perplexity scores shown in the paper -- though we don't yet know if those perplexity scores will translate into good performance on real-world tasks.

I'm going to take a closer look.


They use perplexity on github data to demonstrate the effectiveness of their model.

I suspect github data has a lot of copy pasted code. Ie. a good chunk of what you are asking the model to do is to go back X million tokens and copy a chunk verbatim.

Sure, the model might also be looking back at some code X million tokens ago and using that to improve its guess of the next token (oh look, the API definition of the API I am using is back here, that'll help me get this right!).

But the perplexity number alone doesn't differentiate those cases - and considering how much code copying/templating happens in software, I suspect that affects the perplexity a lot more than smartly using stuff from the context window.

I wonder if these models work well on other kinds of data?


Title with a 10 digits number, meaningless first page figure and no experiments related to the main claim. Did a rogue author posted it without permission again?


As noted, they only did experiments up to 32k length which is silly considering the title


Silly is a charitable interpretation.


Yes, it’s bunk. https://twitter.com/theshawwn/status/1676822953210662913?s=4... (A researcher at Brain confirmed it’s not worth reading: https://twitter.com/giffmana/status/1676864336764055552?s=46...)

I hate being dismissive, but I’ve been dragged to the conclusion that headlines matter in research, and people chase headlines.

Three orders of magnitude jump, instantly, isn’t plausible. It almost never happens.


Important note: They only did experiments up to 32k length


Without any experiment showing that language modeling performance actually continues to improve past 32k tokens using this scheme, how are we supposed to tell whether this is actually viable?


What does the "number of tokens" characteristic of an LLM mean exactly? How does 1B compare with GPT-3.5 or GPT-4?


A sequence of characters is encoded into tokens, tokens are grouped characters, each token is mapped to a vector representation. When you give text to an LLM, the text is encoded into tokens, and each token corresponds to an index. Each index corresponds to one vector. The model produces vectors, and then finds the most similar vector and selects the corresponding index as the next token.

This is a spectrum, you can write a model that works on the bit level, so 2 vectors, or byte level, 256, or pairs of bytes, 2^16 and so on and so forth.

These days, we use statistical approaches to build the tokens, and a token can be 1, 2 or 3 or N characters long.

So when you give a sequence of characters to the model, it turns that to a sequence of tokens and loads a vector for each one, and when doing computations, it needs to consider all tokens together. This is called the context window.

In this case, scaling the number of tokens means scaling the context window to a large number.

GPT3.5 can do 2Ki tokens iirc, OpenAI’s GPT4 can do 4Ki iirc, Claude from anthropic can do 1Mi iirc.

The context window is kinda analogous to your working memory, the higher the better, unless there are approximations that trade off quality for length, which is what is happening here.


Original GPT3.5 can do 4k tokens and there is a recent version with 16k tokens (gpt-3.5-turbo-16k)


Ahhh thanks for the correction! And iirc GPT-4 has a 20k version too.


Context length / window. Think of them and the "number of words" that the model can effectively process. 1 token is roughly equal to 4 characters or 0.75 words for English text. The number of tokens is the total number that can fit into a context window, which again is the space of "input" i.e. prompts and output (response/ completions) that the model can handle


It means the maximum length of your query. The longer the context window the more complex questions you can pose. For example, being able to paste the text of a whole book and asking for a summary.



How stupendous to not put the first figure on a log scale...


Cute trick but the opposite of helpful. Is the goal of your paper to brag or educate?


this paper leans towards the former


It is almost an XKCD style piss-take.


Is assuming the sequence length is directly correlated to the context window a meaningful thought?

Does this imply similar increases in context in practice?


Does anyone know if 1B tokens is enough to solve sudoku puzzles?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: