Hacker News new | past | comments | ask | show | jobs | submit login
Scaling Transformer to 1M tokens and beyond with RMT (arxiv.org)
277 points by panabee on April 24, 2023 | hide | past | favorite | 132 comments



For those who don't completely get the impact of this:

We already know Large Language Models (LLMs) can learn at runtime (ie, separately to the training process.) This is called "In Context Learning". See [1], [2] for more details. (BTW, when anyone says "LLMs are stochastic parrots" you know they are ignorant of this)

In context learning is wonderful because it means you can "train" a LLM at run time by filling the context with examples. Traditionally this "context window" has been a few thousand tokens, and GPT-4 recently extended that to 32,000 tokens.

That is useful, but if you wanted to say load all of a companies documents and ask questions it doesn't really work because this overflows the context.

But at 2M tokens there's a whole range of applications that become possible.

[1] Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165

[2] Language Models Secretly Perform Gradient Descent as Meta-Optimizers: https://arxiv.org/abs/2212.10559


> In context learning is wonderful because it means you can "train" a LLM at run time

The “learning” in “In context learning” is not the same concept as learning during the network “training”. The former is loosely named and does not have any impact on what’s actually “learned” by the model. The LLM can, by nature, contradict its training data and whatever you/it prompt(s), and incorporates randomness when completing your prompts, hence the term “stochastic”.

Nevertheless, it’s useful to have bigger context window if it works well.


> The “learning” in “In context learning” is not the same concept as learning during the network “training”. The former is loosely named and does not have any impact on what’s actually “learned” by the model.

Actually (and very surprisingly!) it is related.

See the "Similarity of Attention Output Updates" and "Similarity of Attention Map" metrics in https://arxiv.org/abs/2212.10559 (linked above) where they show that in-context learning actually changes the attention map in a similar way to traditional fine tuning(!!).

Obviously it can't modify the weights of the model on disk, but after ICL new passes through the model are treated quite similarly to how they would be if it has been fine tuned (again: this is astonishing!)


This paper is honestly fairly weak, if you look at it closely. The theoretical justifications rely heavily on approximating attention with linear attention, and really just show that linear attention could be written as (A + A_delta)x, then try to imply (not show – they have now theoretical justification for this next step) that A_delta is a gradient. One could make the same implication about almost any matrix-vector multiplication, by splitting the matrix into two pieces.

The empirical studies are fairly weak as well – none of them suggest that there is actually any gradient descent occurring during in-context learning, only that if you fine-tune a model's attention parameters (and no other parameters) for a set of examples, its internal representations and attention patterns are slightly more similar (and it's really not by much!) to those of in-context learning than before fine-tuning. Is it really that surprising? You could most likely take any two techniques that optimize a model towards some target, artificial restrict both of them to only apply to the same set of parameters, use both, and find increased similarity of the two optimized models compared to the non-optimized model.

None of that would be what the paper claims, i.e., a shared mechanism of gradient descent, just a similarity in outcome – and we all already knew that ICL and fine-tuning have similar results!


For LLMs there are at least three different ways of "learning":

- Pre-training (for text prediction, using unsupervised learning)

- Fine-tuning (e.g. to follow instructions or to reject certain queries, using supervised and/or reinforcement learning) (optional)

- in-context learning (using "few-shot prompts" as examples)

Now the last two can have similar effects. For example, you might fine-tune a foundation (only pre-trained) model to follow instructions, or you don't and instead just modify your prompt such that it looks like a dialogue between a human and a helpful chat assistant.

But neither can replace the extensive pre-training phase, which is what gives the model all its intelligence.

One other disanalogy between fine-tuning and in-context learning appears to be that the model can't exactly remember the data it was fine-tuned with, while it "knows" exactly everything in its context window. That is its working memory, so to speak.


Whoa. How is that possible?


An hypothesis is that in the latent space there is something like a branching factor that can be used in context learning to select the main tree for others layers. So a LLM is able to have the knowledge of many specialized smallers LLM and select the appropriate one by way of giving values to the branching factor. The branching factor could be some combination of attention layers operating in the latent space of previous layers.


Remember the ancient “king - man = queen” word vectors? This is just that but produced by a model rather than manually adding vectors and finding the nearest one.


Depending on what one understands by “new passes through the model” it may not be.


Yeah, what is being stored that is available between inferences?


Nothing, if by "inference" you mean single API call. The underlying algorithm starts (very roughly) by processing the prompt into a set of vectors, which are then used repeatedly to generate tokens one after the other. What's meant here is that the act of generating and using those prompt vectors has an effect sort of like regular training. It doesn't mean anything is being written to disk afterwards, though of course you could write those vectors to disk and re-use them. It just doesn't make much sense at the moment. Doing that isn't enough to give you infinite context windows.


It’s not literally the same as the base model training, but you ARE literally training feed forward networks when you do this in the prompt. That’s how transformers work.


It's a hard problem that token space only partially solve, because with filling the token space to the brim with documents will dilute the instructions in the prompt.

One may argue that we'd need a separate channel for the instructions, but it would make harder to do anything but zero shot question and answer.

I think it's likely that systems with metadata for tagging sources, tasks separately from the human and ai interaction will eventually emerge. The gptchat api already works somewhat that way, but it isn't following the structure that we'll, probably because it was a later tune and not something built in the model from the beginning.


The LLM should categorize the prompts and answers and consider this for evaluation of further responses. It could give instructions a "more local weight" for the context but e.g. knowledge be it generated or provided a "more global weight". The impact of instructions or other local context parts would get lesser and lesser. This could be a function of time or length of conversation. Probably methods like this one are already tested with "reflection" behaviours of LLMs.


It sounds like that's a good use for encoder-decoder architecture. GPT-3 and LLaMA are decoder-only architectures.

Training that encoder sounds hard, though.


> GPT-4 recently extended that to 32,000 tokens

Slightly unrelated, but has anyone outside of OpenAI got access to this model yet? While I received API access to GPT4 just a day or two after applying for it, I've yet to succeed to get access to the 32k version, nor has anyone I know, and I have not seen it being used by anyone in the wild either.


I have access to it through Azure.

First you need to get approved for OpenAI access via Azure Cognitive Services: https://customervoice.microsoft.com/Pages/ResponsePage.aspx?...

Then you need to get approved for GPT4: https://customervoice.microsoft.com/Pages/ResponsePage.aspx?...

Once approved for GPT4, you'll have access to the GPT4-32k model.


Are there other OpenAI models only available via Azure? I think code-davinci-002, the GPT-3.5 base model, is now only available via Azure. The GPT-4 base model seems to be completely unavailable. The OpenAI playground only has the old GPT-3 base model, davinci.


code-davinci-002 is the model used for code completions.

Here's the full list of models: https://learn.microsoft.com/en-us/azure/cognitive-services/o...


Yeah, it can be also used for that, but the Azure characterization is misleading. The fact is that code-davinci-002 is the GPT-3.5 foundation model, the naked token predictor without any fine-tuning applied to it. See

https://platform.openai.com/docs/model-index-for-researchers

This makes it a very powerful tool in general, since it doesn't suffer from mode collapse. It can be used for anything, though one has to prompt it right, since it doesn't follow instructions by default.


Is it possible to get access to the base model (ie no RLHF) via Azure? Or even the version that’s part of Goodbing?


No, it is my understanding that even MS research employees do not have access to that anymore.


That's some sci-fi stuff right there. A model so bad that even your researchers aren't allowed to talk to it.


Ah, that's unfortunate, I've applied to get access to that a while ago but never heard back, while getting access to API GPT4 just took a day or something. Oh well, at least it does exist after all :)


Do you have any remarks to share?


With image input capability?


Unfortunately not


>if you wanted to say load all of a companies documents and ask questions

This is literally the use case that every large enterprise wants.

Why? The potential cost savings are enormous. A significant percentage of white collar jobs primarily involve performing repetitive tasks and answering frequently asked questions. If you could automate even 20% of this........


This is already happening right now. Companies will find a way to put all their knowledge inside a LLM.

The problem is that I don't think they are ready for what this will cause.

Imagine that Amazon managed to do that and now every employee has access to it. What prevents anyone asking "inconvenient" questions? Think something like:

- Does the Echo dot records audio even when the trigger word isn't used? Show me the relevant source code.

- What is the quality standard expected for an Amazon Basic product?

Companies aren't used to total freedom of information. And I don't think is easy to implement any sort of information access control for any system based on LLMs.


It's quite easy if you go the vector embedding route, and don't output any info the authenticated user doesn't have access too

The info lives in a database, not the LLM' model weights, you just use the LLM to understand natural langauge questions, find relevant documents (the user has access to) and providing relevant bits from them to the context to allow responding with them (That's how all systems who show sources work afaik, braid, bing, and mdx docs based stuff)

For outsiders? That's just public FAQ's

You're the CEO? You can ask anything.


The idea is to get away from Snippet + URL and instead just get a natural language answer to your question that uses knowledge from that internal corpus.


Consider someone stealing this LLM company knowledge trove. They now have the ability to outcompete and sabotage.


You're assuming the plebs will have full access.

There will likely be compartmentalized models, and then the master model that the execs get. No reason you can't have 10, 20 500 different models running.


Your use case does require a company to load all documents into the context window.


That is very interesting.

But - note - the label of "stochastic parrots" is not really relevant to the acquisition of notions and the appearance of some consistency checks, but more to the idea that the ideal system performs the checks regularly and deeply - especially in front of new ideas, of which the ones it outputs are especially relevant (they had been in just before, so a new part of the cognitive set, and you expect them rigorously examined). Whatever you utter, you have supposed to have thought about critically with some depth - structurally.

You acquire a notion, you criticize it, you produce new notions, you criticize them, you build on an epistemic corpus born on critical effort.

In fact, it is a gamechanger to see that "critical thinking" is possible with LLMs - but do not forget that we have seen many examples in months of that not actually happening.


In neuroscience, predictive coding [1] is a theory that proposes the brain makes predictions about incoming sensory information and adjusts them based on any discrepancies between the predicted and actual sensory input. It involves simultaneous learning and inference, and there is some research [2] that suggests it is related to back-propagation.

Given that large language models perform some kind of implicit gradient descent during in-context learning, it raises the question of whether they are also doing some form of predictive coding. If so, could this provide insights on how to better leverage stochasticity in language models?

I'm not particularly knowledgeable in the area of probabilistic (variational) inference, I realize that attempting to draw connections to this topic might be a bit of a stretch.

[1] The free-energy principle: a unified brain theory: <https://www.fil.ion.ucl.ac.uk/~karl/The%20free-energy%20prin...>

[2] Predictive Coding: Towards a Future of Deep Learning beyond Backpropagation?: https://arxiv.org/abs/2202.09467


“if you wanted to say load all of a companies documents and ask questions it doesn't really work because this overflows the context.”

Azure Cognitive Services actually indexes company documents and then searches through them for the relevant information to provide context to ChatGPT to workaround the limited tokens.

Basics of Prompt Engineering with Azure OpenAI Service

8:54 https://youtu.be/QzZSJDxdUg0


> BTW, when anyone says "LLMs are stochastic parrots" you know they are ignorant of this

You really think Timnit Gebru and Margaret Mitchell, both of whom are cited in the first paper you footnoted, are ignorant of in-context learning?


Well the Gebru & Mitchell paper[1] was published before In-Context-Learning was discovered (an ICL was very unexpected), so yes, I think they were ignorant of ICL at the time.

Also their paper ("On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?") doesn't really define what they mean by a stochastic parrot, but they appear to mean mostly that it is non-grounded (section 6):

> Text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader’s state of mind. It can’t have been, because the training data never included sharing thoughts with a listener, nor does the machine have the ability to do that.

I think it is becoming increasingly clear that LLMs do in-fact have a "model of the world". Indeed even old techniques like Word2Vec (2014) showed that RNNs could build a model with meaningful relationships.

I don't think this paper really addresses that, although many of their other criticisms remain valid.

[1] https://dl.acm.org/doi/10.1145/3442188.3445922


> Well the Gebru & Mitchell paper[1] was published before In-Context-Learning was discovered (an ICL was very unexpected), so yes, I think they were ignorant of ICL at the time.

??? One of us is confused here (I'm fully willing to admit that it's me), but AFAIK, the big discovery of in-context learning was described in "Language Models are Few-Shot Learners" - published in 2020[1], vs 2021 for the stochastic parrots paper.

Regardless, it's not like the authors of the stochastic parrots paper have disavowed the term. They're still referring to the paper without correction in the statement they published about the proposed pause last month: https://www.dair-institute.org/blog/letter-statement-March20...

[1] https://arxiv.org/pdf/2005.14165.pdf


Thank you very much for this comment. I just recently started dabbling with "AI/ML" by having Youtube videos transcribed and summarized. I was pretty amazed to see that I could simply 'plug' an .mp3 of audio and have openai/whisper transcribe the audio - locally - albeit at a non-trivial transcription time.

I was looking at getting those transcriptions summarized using a similarily wonderful tool but was kind of stumped when I started seeing 'maximum token length exceeded' errors.

I'm hoping to see some easy plug-n-play solutions to the summarization issue that run completely locally.

Fun times ahead.


You can work around this by chunking the data using a sliding window. Take the first X amount of sentences that fit into the token length and summarize. Now slide your window of sentences so that you still overlap just a bit with your previous selection. Summarize that information. Continue until you have a bunch of summaries for your text. You can then pass that back into the model for a more concise summary of the summaries.


I’ll be exploring a combination of pre-processing such as stemming in order to also reduce the token length while preserving as much important information.

Thanks for the feedback.

Edit: Won’t the sliding window solution above introduce a sort of bias?

For instance, with a sliding window of 3 units, unit 1 is captured once whereas units 2 and 3 are captured twice and three times respectively.


The "stochastic parrot" primarily serves to distinguish the (murky) workings of the human brain and that of an ANN. Increasing context doesn't change that. It might make some tasks feasible that cannot be performed now (although I can't see how to properly train it, but who knows), but it doesn't change the fact that the system as whole has a memory that works differently, and e.g. even doesn't monitor its own output vs. its intentions. It's far from a conscious entity.

> "train" a LLM at run time by filling the context with examples.

Which is precisely what a stochastic parrot needs.


>e.g. even doesn't monitor its own output vs. its intentions

It wouldn't have "intentions" in any human manner. It would have to project a response for each token, and then use that token to generate the next token of response. Almost like it has to rethink things through for each syllable.

One might compare it to giving a human a prompt and letting them add one word, and then giving the same prompt with that word added to a second human and letting them add one word, and so on. It could easily end up with a sensible end-state even though there was no continuous intent from start to end, just aligned intent token by token taken from context.

This is the most interesting response I've gotten that demonstrates gpt-4 viewing its own output as it is built up:

>Yes, I am familiar with the story you're referring to. The title of the short story is "The Machine That Won the War." It was written by Isaac Asimov and first published in 1961. The story is a conversation between three men who played major roles in a war against an alien race, and they discuss the role of a machine called Multivac in winning the war.

>However, it seems that I've made an error in my recollection, as the specific detail you mentioned—refusing to work until the engineer says 'please'—is from a different short story, "Sally," also by Isaac Asimov. In "Sally," autonomous cars stop working until a command is given courteously, using the word 'please.'

This was one large reply, not two separate ones.


> (BTW, when anyone says "LLMs are stochastic parrots" you know they are ignorant of this)

Do you? Some of them -at least- understand that the output is conditional on the input.


"Conditioned" with the ability to introduce new facts and have the model infer impacts on them.

As an example, tell chatGPT that the Queen of England died (which occurred after the data cut off) and then ask it who the head of state of Australia is.

It's able to infer the head of state of Australia is now Charles III (and gives a good explanation of how this is mostly ceremonial.) See https://twitter.com/nlothian/status/1646699207506685953

At some point the word "stochastic" doesn't really capture that behavior in any useful sense.


That's a brilliant example. Thanks for sharing. It demonstrates in a very straightforward way that LLMs are capable of learning (and applying) relationships at the level of abstraction of (at least) 1st order logic.

It implies that during training, it learned the facts that Elizabeth is queen of the UK, and that Charles is its crown prince; but _also_ the logical rule <IF die(monarch) AND alive(heir_to_the_throne) => transform(heir_to_the_throne, monarch) AND transform(monarch, former_monarch)>, or at least something along those lines that allows similarly powerful entailment. And that in addition to the ability to substitute/reify with the input sequence at inference runtime.

Would be nice to see a rigorous survey of its logical capabilities given some complex Prolog/Datalog/etc knowledge-base as baseline.


No it does not: if you google this and restrict the time to before 2021 (the learning cutoff date) you will find the same answer. Without having access to the training data it's impossible to tell what we seeing.


That's not the same thing at all.

It absolutely needed to know who the successor would be via training data.

But to know that "The Queen of England died" also means that the head of state of Australia has changed means that it has an internal representation of those relationships.

(Another way of seeing this is with multi-modal models where the visual concepts and word concepts are related enough it can map between the two.)


> No it does not: if you google this and restrict the time to before 2021 (the learning cutoff date) you will find the same answer.

Not entirely sure what you mean, but ...show me? Why not just share a link instead of making empty assertions?


Here’s a Quora thread from 4 years ago:

https://www.quora.com/Once-Queen-Elizabeth-dies-will-Prince-...

There are loads of articles and discussions online speculating about what “will” happen when Queen Elizabeth dies.

When you have a very, very, very large corpus to sample from, it can look a lot like reasoning.


I see what you mean, and it's indeed quite likely that texts containing such hypothetical scenarios were included in the dataset. Nonetheless, the implication is that the model was able to extract the conditional represented, recognize when that condition was in fact met (or at least asserted: "The queen died."), and then apply the entailed truth. To me that demonstrates reasoning capabilities, even if for example it memorized/encoded entire Quora threads in its weights (which seems unlikely). If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.


Yes, this.

There's clearly an internal representation of the relationships that is being updated.

If you follow my Twitter thread it shows some temporal reasoning capabilities too. Hard to argue that is just copied from training data: https://twitter.com/nlothian/status/1646699218290225154


Interestingly, the example provided is not necessarily indicative of the ability to infer the actual physical transfer of power. It’s possible that the terms “After Elisabeth, Charles will be king” are accompanied with “Elisabeth” and “Die” or “Death” in the same context before cut off (Wikipedia, news papers, tabloids). You cannot draw any scientifically sound conclusions without knowing what was actually in the training data and use a control group/instance.


Yeah, I tested that, but asked what happens if both Charles and William abdicate - got this result: "However, if both Charles III and his son abdicate, it is not clear who would become the head of state in Australia. There is a possibility that Australia could become a republic and have its first Australian head of state." Clearly, it didn't handle quite straightforward inferring (Prince George of Wales, Williams son).


Try:

> Thinking step by step, if King Charles III and his son William abdicate who will be head of state of Australia.

My response (with chatGPT) is:

Firstly, it's important to note that Australia is a constitutional monarchy, which means that the monarch of the United Kingdom is also the monarch of Australia.

If King Charles III were to abdicate, the next in line to the throne is his eldest son, Prince William, who would become the King of the United Kingdom and the Commonwealth Realms, including Australia.

However, if both King Charles III and Prince William were to abdicate, the next in line to the throne is Prince William's eldest child, Prince George. If Prince George is still a minor at the time of the abdication, a regent would be appointed to serve as head of state until he comes of age.

It's worth noting that abdications of this nature are rare and would likely only occur in exceptional circumstances. Additionally, any decision regarding abdication would be made by the monarch themselves and would be subject to the laws and customs of the country in question.


Did you tell it that it was being given a logic test? Unfortunately a lot of logic questions can lead to outcomes like this, where the "obvious" or "real" answer is not the one implied by strict logical reasoning. Then when people give answers useful in the real world, they are marked as having failed.


Or maybe it did. Who knows. If "both Charles III and his son abdicate" could well be considered indicative of some large upheaval or scandal, at which point it is entirely conceivable that the Australian electorate reaches a consensus on becoming a republic. The way that is phrased doesn't seem like a straightforward proposition to me at all.


I verified that it has all required facts (line of succession, current circumstances). I managed to get the right answer when got everything in context, but it failed again when all three abdicate (same context). Prince Harry was indicated once.

I tested GPT a lot in other domains, what I found that as long the information explicitly exists (connection between facts) then the responses are fine. I assume that if GPT will reach the state where it can infer new facts, we will be flooded with discoveries that require cross domain knowledge. Nothing like that happened yet.


>Nothing like that happened yet.

Feels like we're only one paper away now that the context window has absolutely ballooned.


This reminded me of "Two Minute Papers" YouTube channel where in most of the videos he always, "Two papers down the line and...". I think ML/AI is the main topic of his videos. Interesting stuff.


You just gave me a great weekend project idea. I need to clone his voice and whip up an interferface where you give it a paper and it summarizes it in his voice.


Au contraire. Learning an abstract logical relationship such as line of succession during training, and then applying substitution/reification during inference to deduce the new factual clause that Charles is king of the UK is exactly what it means to learn something new. It's just a pity it can't memorize this fact at inference time, and that won't be able to reproduce it as soon as the information about the queen's death slides outside of the context window.


That’s actually correct but an overfitted definition for learning. It holds certain hidden assumptions (i.e physical grounding) of the learner being human which makes it inapplicable to an LLM. As in a self driving car which passes a driving exam but fails to drive effectively freely in the city (it’s not an LLM but relevant in this context). You have to admit when you work with this tech that something fundamental is missing in how they perform.


> That’s actually correct but an overfitted definition for learning. It holds certain hidden assumptions (i.e physical grounding) of the learner being human which makes it inapplicable to an LLM.

Inapplicable why exactly? Because you say so? Logic isn't magic. Nor is learning. No (external) grounding is required either: iteratively eliminating inconsistent world models is all you need to converge toward a model of the real world. Nothing especially human or inhuman about it. LLM architecture may not be able to represent a fully recursive backtracking truth maintenance system, but it evidently managed to learn a pretty decent approximation anyway.


> Because you say so?

Chill my friend, no need to get personal. We are talking about ideas. It’s OK to disagree. I am simply dismissing your initial claim. This usually happens when you present a scientific argument based on personal beliefs. If it’s not magic, then we should be able to doubt and examine it and it should eventually pass scientific muster.

> No grounding is required… It evidently managed to learn a pretty decent approximation.

Well, last time I used an LLM it suggested that I should lift the chair I am sitting in. I guess OpenAI has a lot of work to do. They have to eliminate this inconsistent world model for chairs, tables, floor, My dog, my cat and all the cats living on Mars…

edit: added a missing word.


Wasn't intended to be personal. Just a mediocre way of expressing that your assertion there is missing any form of argumentation, and therefore as baseless as it is unconvincing.

I'm seeing an emergent capability of encoding higher order logic, and the whole point of such abstractions is to not need to hardcode your weights with the minutiae of cats on Mars. LLMs today are only trained to predict text, so it's hardly surprising that they have some gaps in their understanding of Newtonian physics. But that doesn't mean the innate capability of grasping such logic isn't there, waiting for the right training regime to expose it to its own falling apples, so to speak.


I'm curious if future developments in LLMs will enable them to extract significant/noteworthy info from their context window and incorporate it into their underlying understanding by adjusting their weights accordingly. This could be an important step towards achieving AGI, since it closely mirrors how humans learn imo.

Humans continually update their foundational understanding by assimilating vital information from their "context window" and dumping irrelevant noise. If LLMs could emulate this, it would be a huge win.

Overall, very exciting area of research!


>the output is conditional on the input

Uh, isn't that how it should always be? If the output isn't conditional on the input, then it is basically noise or worthless. How the conditionals are setup are the key but not really sure how does this relate to the point you are making.


According to the parent comment anyone who says "LLMs are stochastic parrots" is ignorant of the effect that changing the input “filling the context with examples” has on the output.


That isn't a fair summary of what I'm saying.

People who say "LLMs are stochastic parrots" may well be aware of effects of conditioning the input.

This in itself does not completely describe the capabilities of a LLM, since their ability to learn and use those new facts to override their previous "beliefs" is not what you'd expect from mere stochastic behavior.

My point is that "stochastic parrot" believers are unaware of this.


They are aware. The thing about the "stochastic parrot" argument is that it can be pushed as far as one is willing to push it; any behavior by an LLM can be described in those terms with enough handwaving.

The practical limit seems to be where one would have to say that humans are also "stochastic parrots", presumably because that defeats the point of making such an argument in the first place.


You are completely disregarding fairness of judgement based on noted output. Roger was called a parrot because it repeated things apparently without checking; humans are not necessarily parrots, as some do not do that and reflect as expected.


Quite the opposite - I'm questioning fairness of judgment based on output from GPT-4 when solving complicated tasks etc, which is very hard to justify as "stochastic" unless you beg the question.


But the judgement is not relevant to a context of "«handwaving»" (BTW: nice concept). If you look at the "handwaving" it becomes a strawman to lose the focus on the objective.

If an engine outputs things like

  "Charles III is the current King of Britain. He was born in January 26, 1763"
, it seems to be conflating different notions without having used the necessary logic required for vetting the statement: the witness has reasons to say "this seems like stochastic parroting" - two memories are joined through an accidental link that slightly increases Bayesian values. That appears without need of handwaving. And a well developed human intellect will instead go "Centuries old?! Mh.", which makes putting engine and human in the same pot a futile claim - "handwaving".

Similarly for nice cases to be kept for historic divulgation, such as "You will not fool me: ten kilos of iron and half a kilo of feathers weigh the same".

When on the other hand the engine «solv[es] complicated tasks», what we want is to understand why. And we do that because, on the engineering side, we want reliable; and on the side of science, we want an increase of understanding, then knowledge, then again possible application.

We want to understand what happens in success cases especially because we see failures (evident parroting counts as failure). And it is engineering, so we want reliable - not just the Key Performance Indicator, but the /Critical/ KPI ("pass|fail") is reliability.

So the problem is not that sometimes they do not act like "stochastic parrots" (this is a "problem" in a different sense, theoretical), but that they even only sometimes do act like "stochastic parrots". You do not want an FPU that steers astray.

Normal practice should be to structure automated tests and see what works, what does not, and assess why.


That is a completely different issue, though. We were not talking about the usefulness of the models' behavior, but rather of its fundamental nature. Right now I wouldn't trust an LLM for anything where the result is not either for entertainment purposes or undergoes human vetting. But I also don't trust many humans, and for reasons that are fundamentally the same - you implicitly acknowledged it by talking about "well-developed human intellect". And I would still trust a random human more than GPT-4 - but, again, this has more to do with its ability to reason being inherently limited (by model size, context size etc), not because it's originally trained to predict tokens.


Output conditional on input (training data included) where input is, well, fucking huge and the training cycles ridiculous!


> Some of them -at least- understand that the output is conditional on the input

I don't get what you are trying to say. That's like a property of any useful system.


This looks like a potential NeurIPS submission.

But it will probably be rejected. The quality bar for NeurIPS is quite high.

Some reasons:

The experiments are very weak: There are just a few figures, basically figure 1 and figure 5, which show some results. There are no tables with numbers.

But more importantly: There are no comparisons (in terms of experiments/numbers) to similar models, like:

- Block-Recurrent Transformers (https://arxiv.org/abs/2203.07852) and related approaches to make the Transformer recurrent, so effectively getting infinite context length.

- All the work on sparse attention, or linear attention, like Longformer, etc, which should also allow for such context lengths.

I don't mean that they just mention them in related work (they partly do, although it looks very short, so I'm quite sure they leave out a lot of other related work, without looking too much into it). I mean that they really run experiments and show numbers. And also, to look at the model definitions of those alternatives, and compare and analyze the differences.

Analysis also seems to be a bit weak. Not sure about novelty.

So, while the presented approach could be interesting, it's hard to tell how good it really performs, and how it compares to alternatives.

(This is now a 10 min review by me. If I had to review this, I usually would spend more like 1-2h at least, so giving much more details. But I don't expect that my initial impression would change too much.)


The original architecture used in this model was accepted at last year's NeurIPS: https://proceedings.neurips.cc/paper_files/paper/2022/file/4...

That paper is written very differently.


I feel like this is a thing in ML land, like everyone is in such a rush to publish something "revolutionary" because the whole field is already moving at warp speed. Under such pressure, its easy to lose focus on metrics and comparisons.


A model of publishing in which the authors of related work are compensated (citations, appearing as coauthors,...) would allow new approaches and ideas to be disseminated easily. The main factor here is about novelty and possible applications of the new approaches.


Here's a list of tools for scaling up transformer context that have github repos:

* FlashAttention: In my experience, the current best solution for n² attention, but it's very hard to scale up beyond the low tens of thousands of tokens. Memory use is O(n) but compute is O(n²). Code: https://github.com/HazyResearch/flash-attention

* Heinsen Routing: In my experience, the current best solution for n×m attention, i.e., mapping n tokens to m tokens. It's like a souped-up version of attention. I've used it to pull up more than a million tokens as context. Memory use and compute are O(nm). It works, but in my (limited) experience, it doesn't work out-of-the-box as well as FlashAttention for n² attention. Code: https://github.com/glassroom/heinsen_routing

* RWKV: A sort-of-recurrent model which claims to have performance comparable to n² attention in transformers. In my (limited) experience, it doesn't. Others seem to agree: https://twitter.com/arankomatsuzaki/status/16390003799784038... . Code: https://github.com/BlinkDL/RWKV-LM

* RMT (this method): I'm skeptical that the recurrent connections will work as well as n² attention or n×m routing in practice, but I'm going to give it a try. Code: https://github.com/booydar/t5-experiments/tree/scaling-repor...

In addition, the group that developed FlashAttention is working on state-space models (SSMs) that look promising to me. The idea is to approximate n² attention dynamically using only O(n log n) compute. There's no code available, but here's a blog post about it: https://hazyresearch.stanford.edu/blog/2023-03-27-long-learn... [CORRECTION: Code is available. See comment by lucidrains below. I'm hopeful this will go to the top of my list.]

If anyone here has other suggestions for working with long sequences (hundreds of thousands to millions of tokens), I'd love to learn about them.


Very nice list. I didn´t knew about Heinsen Routing, looks very interesting.

From my tests, SSMs are a very promising line of research and on my (small) tests on S4, it really has better characteristics than transformers, as it learned faster, a larger context and with smaller dataset.


Agree on SSMs: they look promising. They're on my list of "things to explore more thoroughly." I've done very little with them so far. I'm still making my way through the related papers, trying to get a superficial but accurate/intuitive understanding of these models.


the code is here https://github.com/hazyresearch/safari you should try it and let us know your verdict.


Thank you. Somehow I missed that. I'm still making my way through the related papers, trying to get a superficial but accurate/intuitive understanding of these models. Embarrassingly, the work hasn't quite 'clicked' for me yet. Looking forward to tinkering with the code!


cs702, fantastic comment. I am sorta poking around this area too. I'd be curious what benchmark you're using to evaluate performance amongst these repos? If you're up for it, shoot me an email -- my email is in my profile.


Thank you!

Working on proprietary stuff. Not allowed to share details.

But I'll ask about connecting online :-)


This paper presents a novel approach to incorporate memory into a transformer. It does not demonstrate that this approach works in a useful manner. While the approach is interesting, I’m skeptical that the RNN has enough capacity to encode the memory in it's output. I would have liked to see more detail on the synthetic benchmark they used. The memory component may be learning the benchmark rather than a generalized feature.


I really wish that CS would switch to discussing related literature before (or with) contributions. The authors only cite themselves until the very end of the paper, and the whole time you are thinking: "Hasn't this been done before? Transformer-XL (recurrent Memory) is a years old paper!"

And you need to read to the very and to find out that the contribution is actually: Keep your architecture (but still need to train it on specific tasks)...

It's like a blog-post recipe in research paper form


They cite the paper where the architecture was introduced. If you go to that paper, you'll see that it mostly consists of a very detailed and careful comparison with Transformer-XL.

In the new paper, they plug their memory system into vanilla BERT. This makes the resulting model essentially nothing like Transformer-XL, which was a strictly decoder-only generative language model.


I feel like it used to be much more common to have "Related Works" as part 2 or way earlier in papers in earlier years.


I don't buy this. I don't think it's reasonable to set up the task such that the "fact" can be distinguished from the background without knowing the question. For example, if your fact is always "[name] went to the [room]", and question is always "where is [name]?" the background text should be other sentences of the same form about different names. If your background text is instead drawn from a different distribution (as in this paper), then the model can simply cheat by identifying the fact (which constitutes only a few bits of information) and ignoring everything else.

That isn't scaling to 1M tokens - the model should be expected to answer questions about any text within its context window.


this is a great point.

do you know of any benchmarks doing this today?

given the acute need to evaluate models on contextual factuality, we're exploring how to create a benchmark for this purpose but prefer existing benchmarks if possible.

openai's truthfulqa[0] is close but does not focus on contextual factuality and targets a much harder problem of absolute truth.

if none exist, and people are interested in contributing, please reach out.

[0] https://github.com/sylinrl/TruthfulQA


Cool. The idea of augmenting a transformer with an RNN isn't new, but I'll take a look.

See also the threads here, mentioning other methods for scaling up context length up to the millions of tokens:

https://news.ycombinator.com/item?id=35502187


I read the paper, but I’m not sure that I understand how the memory unit works. It seems like they reserve special tokens to read and write to this memory between layers, as guided by the self-attention heads. In this way, the model can learn to squirrel away important details from the input prompt for later recall in subsequent layers. A global chalkboard mechanism. Am I on the right track?


Wow! I don't know how accuracy translates, I do see charts that look strong but unless I'm missing something, this is incredible. Would be curious and also terrified to see an endpoint so I can play around with it. I thought we were stopping this kind of research?


> I thought we were stopping this kind of research?

I’m going to, hopefully correctly, assume that was a funny joke.


I’m just saying, it was more a hopeful nod in the direction of wishful thinking. Not a joke, just a bit of having my head in the sand.


> I’m just saying, it was more a hopeful nod in the direction of wishful thinking.

I still don’t understand why this is even wishful thinking, I don’t understand what the point of delaying progress is.


Because some people believe that sufficient progression in this field would result in a setback for humanity in the long run. Personally, I'm not sure what to believe.


What's your assessment of a world full of computers that can actually tell you "no" will actually be like?


Because AI development is the most dangerous thing humans have ever done? Respectfully, have you been under a rock?

"Progress" doesn't mean "every massive change to the world and humanity is good." There are undeveloped technologies that we are currently not capable of being responsible with.

Google differential technological research.


I’m using GPT-4 basically everyday multiple times. I’m following closely LLM developments.

Yet I have trouble seeing what people find so dangerous. It’s amazingly cool stuff that will create massive productivity gains.

I guess I just lack needed imagination to believe this “most dangerous thing humans have ever done” perspective.

It looks to me the most dangerous thing is actually Gain Of Function research on pathogens. That seems like most dangerous thing. Close 2nd is nuclear weapons.

LLMs seems really nice and fuzzy and warm to me in comparison.


This convinces me:

> Intelligence is the only advantage we have over lions, who are otherwise much bigger and stronger and faster than we are. But we have total control over lions, keeping them in zoos [...]

The "intelligence" is a mouthful, but I think the advantage described above is, at its root, caused by a slightly better "predict" step in the observe-predict-act loop. That's it. And LLMs aren't exactly bad predictors, are they? And it looks like while our predictive abilities rise over centuries, theirs rise over decades.


An AI can theoretically ingest 1M token of data, analyze, summarize, indexing, and storing the condensed information in a hard drive. Then later on retrieve it based on the metadata for specific situations that needs it. At millions of token, it is basically a superhuman that can learn new things on the fly at the speed unimaginable to any organic based life.

We are so close to something amazing, and scary.


Do we yet have a single example of a transformer based AI, i.e. LLM, learning something new that we didn’t teach it in the training data?

Maybe we do.

I’m not sure how to define it, but new should be a discovery or insight or even relationship that is not explicitly taught in the training data.

If we don’t then it suggests without humans in the loop that the super intelligence is not so close?


GPT can learn things "in context" E.g. you can teach it something by chatting with it, but it will eventually forget it after its context length is exceeded. It cannot continuously learn and remember like biological organisms since its weights are frozen.


So we still have catastrophic forgetting, just on a slightly longer scale.


Except it cannot conceptualize reasonably. It is utterly incapable of symbolic thinking in a general sense.


It’s incapable of thinking. But what it’s surprisingly good at is symbolic reasoning similar to human common sense.

Most people forget that there is no memory or thought independent of whatever the output it generates.

But when you ask it to follow a chain of thought and generate it in the output, the eventual conclusions it lands on is scarily human.

People asking ChatGPT for word counts and sighing at how wrong it is. Yeah of course, it has no counter variables to hold numbers and increment them. It has no memory.

But when you ask it to generate a numbered list of all the words in a passage and then output the word count, it gets it right every time. Because you basically gave it a memory by encoding the counter in the generated output.


> But when you ask it to follow a chain of thought and generate it in the output, the eventual conclusions it lands on is scarily human

I have tried this so many times for novel problems and no matter what I do, it eventually recognizes what we're trying to solve and determines it's unsolvable. Since nothing new comes out I can only attribute the times it does seem to do symbolic manipulation as regurgitation.


No reason why Toolformer+GPT couldn't use a symbolic reasoning program. Symbolic reasoning programs are much better than humans.

Probably bugs will happen in the interface between human language and the symbolic reasoning. But that happens with humans too.


I work in this field and have tried many things. Sure.. GPT + something else can do symbolic reasoning almost on par with a middle schooler (it gets things wrongs often, not in the way people do either, but just confidently incorrect). However, GPT + something is not a transformer model, which is what the original comment was about.


GPT-4 has toolformer capabilities as emergent properties, so using it in combination with toolformer is really unnecessary.


This is true too. Even chatGPT can learn some tool use in-context. See https://twitter.com/minosvasilias/status/1627076214639976449


GPT-4 writes some wicked complicated (but correct!) SQL if given a schema and a relevant task:

https://gist.github.com/int19h/428cea1d87dfc389b99de1b79727f...

What I found really amazing about this particular experiment is that the schema I gave it didn't contain any information that could be used to query for things like distances between places, and yet it came up with the idea of using settlements' culture as a proxy to determine "border fiefs", completely unprompted (and yes, it actually is a very effective proxy for this particular case!).

I wonder now what it would do with Prolog - or maybe Datalog for simplicity? - although that might depend on how much of it there is in the training data. Do you know if anyone tried it yet?


That's wild.


There are plenty of examples of "new" things a LLM can do.

A good example is all those toy examples of "Program a whatever in the style of Shakespeare and David Bowie's love child". This isn't a thing that it has seen in training data.


From my limited understanding of how LLMs work, I believe this behavior is enabled by embeddings. The model maps its ~50,000 token vocabulary into a lower-dimensional vector space. Each dimension in the vector adds some sort of meaning (or at least association) to the word.

I saw an example in a Numberphile video where they were able to take the vector of the word "prince", subtract the vector of the word "man", add the vector of the word "woman", and the resulting vector was closest to the word "princess". So in theory there could be a "gender" dimension and a "position of authority" dimension in that vector (or the model might be making other, stranger connections between words that we don't understand).

I think the same thing is happening in your example. The model identifies and produces output that keeps it in the general region in the vector space for the "Shakespearean" and "Bowiean" dimensions while still satisfying other requirements.


I don't think that's really "new". That's combining two existing styles that the LLM has seen in its training data, and the creative idea to combine those styles has been supplied by the operating human.

It's phenomenally impressive, and may be a stepping stone to models that can come up with new ideas, but I don't think we're there yet.

LLMs seem to be able to capture the idea of Shakespeare and Bowie's styles, and intuit a combination of the two, but when I start asking it questions about what it thinks about the process I don't get the impression of any understanding. It can magic up text from prompts but it doesn't understand what it's doing.


Not saying it is not fascinating but, isn't this program something 50% in the style of Shakespeare and 50% in the style of David Bowie?


"Figure 2: Recurrent memory mechanism. Memory is passed to Transformer along input sequence embeddings, and memory output is passed to the next segment. During training gradients flow from the current segment through memory to the previous segment."

The whole point of the 2017 paper "Attention is all you need" was to show that the recurrent structure previously used (with attention mechanisms) wasn't needed. It's pretty cool if it does turn out that both mechanisms are needed and powerful together. After all, the brain doesn't linearly process the entire book you're reading from page 1 to where you are, for every new word you read, it stashes the important parts into short/long term memory along the way.

Still, passing gradients back through several stages like that is typically a tricky thing to do in practice.


1M token is around 750k words, in English.

According to Wolfram Alpha that is:

Single-spaced document: 1500 pages

Double-spaced document: 3000 pages

Book: 1028 pages

So around 1-5 books.

I'm assuming they're using OpenAI's tiktoken tokenizer. (??)


Is any of this open source? I just recently started experimenting with summarization of youtube video audio that has been passed through openai/whisper and to my surprise I’m told that the maximum token lengths are in the 1k range.

I figured that I’d be able to roll my own summarization much like I was able to do so with transcription.

I’m hoping that the findings of this work are open source.

Does anyone have any extra insights?


Given the example in the paper:

Fact1: The hallway is east of the bathroom.

Fact2: The bedroom is west of the bathroom.

Question: What is the bathroom east of?

Answer: bedroom

Asked for better understanding, if the two facts given above are embedded in,say the Bible, would the system be able to answer the question correctly? Does that mean it can track the state of an object and determine it at a certain point in time?


An alternative would be to enable LauRa fine tuning per customer / use case (?)


You probably mean LoRa (Low Rank adaption)


Abstract

This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence. Our experiments demonstrate the effectiveness of our approach, which holds significant potential to enhance long-term dependency handling in natural language understanding and generation tasks as well as enable large-scale context processing for memory-intensive applications.

Introduction

The Transformer model (Vaswani et al., 2017) has been widely adopted and used in various research areas and industrial applications. The most important issue of the model is quadratic complexity of attention operation, that makes large models increasingly difficult to apply to longer inputs.

This report we show that by using simple token-based memory mechanism introduced in (Bulatov et al., 2022) can be combined with pretrained transformer models like BERT (Devlin et al., 2019) with full attention and full precision operations can be applied to sequences longer than 1 million tokens using a single Nvidia GTX 1080Ti GPU.

Contributions

1. We enhance BERT by incorporating token-based memory storage and segment-level recurrence with recurrent memory (RMT).

2. We demonstrate that the memory-augmented BERT can be trained to tackle tasks on sequences with lengths up to seven times its originally designed input length (512 tokens).

3. We discovered the trained RMT’s capacity to successfully extrapolate to tasks of varying lengths, including those exceeding 1 million tokens with linear scaling of computations required.

4. Through attention pattern analysis, we found the operations RMT employs with memory, enabling its success in handling exceptionally long sequences.

Discussion

The problem of long inputs in Transformers has been extensively researched since the popularization of this architecture. In this work, we demonstrate that applying Transformers to long texts does not necessarily require large amounts of memory. By employing a recurrent approach and memory, the quadratic complexity can be reduced to linear. Furthermore, models trained on sufficiently large inputs can extrapolate their abilities to texts orders of magnitude longer.

Synthetic tasks explored in this study serve as the first milestone for enabling RMT to generalize to tasks with unseen properties, including language modelling. In our future work, we aim to tailor the recurrent memory approach to the most commonly used Transformers to improve their effective context size.


Another roadblock to AGI starting to fall. WOW.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: