Great news. No URLs for the models yet and nothing on https://huggingface.co/meta-llama . But the appendix has a really cool explanation of the RoPE algorithm and a visualization of what it means to rotate the angle of the token embeddings set of values. It really gives you an intuitive understanding of why the cosine similarity varies periodically so that every $x tokens are more like each other. With the increased base frequency in this version of RoPE that just means more spurrious "similiarity' between tokens ever $x tokens along in positional encoding. I've never seen anyone exploit this to achieve a desired result yet but it seems straightforwards: pre-tokenize your input and then massage it so the tokens you want most linked $x tokens apart.
Meta research has been really putting great stuff out there. No wonder they are still rocking after all those years, despite the reputation facebook has been racking.
"With FLASHATTENTION (Dao et al., 2022), there is negligible GPU memory overhead as we increase the sequence length and we observe around 17% speed loss when increasing the sequence length from 4,096 to 16,384 for the 70B model."
"For the 7B/13B models, we use learning rate 2e−5 and a cosine learning rate schedule with 2000 warm-up steps. For the larger 34B/70B models, we find it important to set a smaller learning rate (1e−5) to get monotonically decreasing validation losses."
"In the training curriculum ablation study, models trained with a fixed context window of 32k from scratch required 3.783 × 10^22 FLOPs and achieved performance metrics like 18.5 F1 on NarrativeQA, 28.6 F1 on Qasper, and 37.9 EM on Quality."
"Continual pretraining from short context models can easily save around 40% FLOPs while imposing almost no loss on performance."
"Through early experiments at the 7B scale, we identified a key limitation of LLAMA 2’s positional encoding (PE) that prevents the attention module from aggregating information of distant tokens. We adopt a minimal yet necessary modification on the RoPE positional encoding (Su et al., 2022) for long-context modeling – decreasing the rotation angle."
Pretty exciting stuff. Getting close to GPT-4 hopefully soon!
This is very cool, and I was reading the paper when I noticed they call it an open-source model, but wouldn't a better term be "open-weight model", since to make the model weights you'd need the document sources and lots of compute? Or did they actually open source it fully so people can pull the same sources in to build the same weights and I didn't get the memo?
I find it confusing to release weights under a source code license too as Apache 2.0 that is used for plenty of open models spends a lot of time talking about code, which should be unrelated to weights (I am not a lawyer though). But perhaps one should look at weights as "firmware" necessary to run a model? I am not sure.
Although I am getting tired to keep mentioning it at this point, LLaMa is not open by any reasonable definition though. It is available. But as a sibling points out, it is rude of Facebook to call it something it blatantly is not, even if it is more open than say Clos^WOpenAI's offerings. In addition, it is very rude towards the people behind Mistral, GPT-J, etc. who do honour the tradition of open science and open source to place your work both in your papers and marketing next to them. There is a great word to describe what Facebook is doing: appropriation.
Neither is releasing LLaMa the way that Facebook does honouring their "commitment to open science" as its license also violates the principles of open science. This was objectively false when they stated it six or so months ago and even after OSI and many others calling them out they still have not adjusted their messaging and I am frankly considering them bad faith actors at this point.
The best argument in favour of Facebook doing what they are doing is that their model is more open than the closed ones. Which is fair (pun not intended). But, calling your dog a cat because it is more akin to a cat than a fish is still misleading and rightfully surprises and upsets cat lovers, even if dogs are fine pets in their own right.
Many clearly think of the weights as a source-like artifact.
Also, the model's architecture is sufficiently documented, & supported by open code, for others to fine-tune it & run it for generation, which to me justifies describing the model as "open", even if the entire process for creating a new set of high-quality weights isn't. (That's in contrast to "Open"AI's GPT4, about which many architectural details are undisclosed.)
Note also: as I understand it, there's so much paralellism, algorithmic randomization, & even floating-point-implementation-instability in how such models are trained that even having the exact same training corpus wouldn't be enough to ensure the same final weights. That would require both Facebook & reproducers to do every calculation, in every stage of prepration, in identical order on identical hardware - a constraint that'd make typical parallel/distributed training optimizations (subject to all sorts of CPU/IO/network jitter) impossible.
I suspect a lot of people like me are feeling quite a bit "nerd sniped" while trying to read the paper. I am just suggesting open-weights are more fit term than the open source term I saw when trying to read the paper, I was not talking about EXACT reproducible builds, which I agree are probably a stretch with current tech (but obviously not impossible with new tech but possibly slower and not obviously needed anyway). I also know about OpenLLaMA and others but that's not Meta's work, and it's not the work in the paper. I'm pretty sure a lot of the core principles Llama uses are drawn from OpenAI's published research but I agree that there is a lack of openness recently, e.g people having to speculate about even core things like how they presumably use mixture of experts, but "open source" is not the right term for them just because OpenAI is not being open right now. I agree mostly that the model itself is somewhat open but I really think we should consider "open weights" as a much better term, I think it's much more reasonable than saying open source.
> Many clearly think of the weights as a source-like artifact.
I believe releasing the weights is necessary for an open-source AI project, but don’t consider the weights to be source.
Indeed, open source is not just about the code. For instance, the OSI definition[0] states:
> The license must allow modifications and derived works
If a company released source code for a project, but it was written in a new language whose compiler they held private, then making derived works would be extremely hard. Someone would have to reverse engineer how the language works from source examples, and reimplement a compiler with a similar performance. In that case, in my view, while the code was open, the project would not be open-source.
The same is true of AI models: people can certainly reverse-engineer the training code and replicate a training run to get the weights, but without the weights, the project is not open-source.
On the flip side, if the inference code is open and has the weights public but the training code is not, it is similar to an open-source project using a proprietary compiler (think C# before 2014). The project is open-source, but the training is not.
While I like the Open Source Initiative's early & principled stake-in-the-ground, as a matter of usage, many things get casually called 'open source' that don't fully fit the OSI 'Open Source Definition'.
And, with regard to the Llama models, it seems to me that all the actual computer-language "source code" to run & train them is available. The specific objection of the grandparent post, with regard to the non-availability of the full training-data document corpus of non-source-code-text, isn't clearly in violation of the OSI's 10-point definition.
There is of course a different problem, a part of the Llama's licensing that does clearly violate the OSI Open Source Definition: its "Additional Commericial Terms" preventing just Meta's biggest competitors from using it – discrimination against persons or groups.
"Rude" is a strangely chosen word here. Rude towards whom? I could maybe accept "misleading", but as a person who don't believe OSI has an ownership of the term "open source", I don't think it's misleading either.
I used the word "rude" carefully here, because I'm not confident that I can make a case for it being illegal or even necessarily misleading (though I personally think it is) - but I'm happy to declare it "rude", partly because rudeness is in the eye of the beholder so I get to determine if I think something is rude or not myself.
I don't understand how it's economically viable to build an AI business. 32,768 is nothing and using OpenAI as a price reference (which I presume is heavily subsidised) just one chat hitting that will be ~$4 a message if you truncate the chat history down to 32k. That's $4 for every single message in your chat, even for mundane fillers such as "thank you".
Loading the whole history into the context window is, indeed, dumb.
Most implementations I’ve seen recently involve using embeddings/RAG against the chat history (plus any other public, proprietary, or account-specific knowledge base you want to include) to only pull out the most relevant aspects of the history to include in the prompt envelope. Embeddings are cheap (don’t even need a transformer for them) and RAG lookup with a vector DB (e.g. pg-vector) is fast and efficient.
You also don’t need to always his the 32k endpoint — you can save that for higher tier accounts, and even then only use it for queries where the prompt length requires it. Yes you’ll truncate some responses where prompt plus response > 32k, plus if your use case typically results in those kinds of situations then your pricing should probably be relatively high/“enterprisey”.
Maybe the use case isn't chat, and people need to stop looking at evolving tech solely through the lens of status quo applications of earlier incarnations?
$4 to run two research papers through GPT-4 asking it to identify differences in methods in order to accelerate directing focus during a meta-analysis seems like a pretty amazing price point even if that same price point is expensive for a chatbot to answer a customer query about on a website.
Which itself may not be an expensive price point if the good in question is an automobile or a travel package even if it is for laundry detergent.
Markets tend to find an equilibrium pretty quickly, and I've yet to see any AI products priced such that I can't think of any applications where they'd be welcome as long as performance was adequate.
This research takes us closer to Anthropic Claude like models that can handle 100K tokens - being able to reference book sized context is awesome and being able to run that locally in a way that benchmarks in a similar way will be awesome. What's the bottleneck to get to 100K? Is it just compute power/cost? It's funny that OpenAI don't offer an API people can use that can do 100K tokens yet!
It’s been interesting watching the open source community get ROPE working well. I find it interesting that the method appears to be as good as retraining a large model.
To me, that reads like models can be much smaller, understanding of human language at a fraction the size. But we then need a technique (such as this) to expand the context window to interpret larger inputs.
This could dramatically democratize model development.
The thing is, GPT-4 isn't useful to me because it understands language. It's useful to me because it understands a wide variety of advanced, complex phenomena and is able to synthesize these insights into novel solutions to all sorts of problems.
A lot of knowledge could be taken out and put in a vector search system which dynamically loads context into the input, but some knowledge seems worth embedding into the model. The question becomes what knowledge is critical in order to create a useful, large-context language model which can run on the edge.
Long context is a mistake. You don’t remember 30,000 words when you read the next word in a book. All of that has been summarised down aggressively to a tiny working memory.
It should be a series of caches of increasing compression. The words before should be cached exactly as they are. Sentences before start focusing on essential phrases. Pages before are topics.
Instead of training on 50,000 chars plus the current char it would be 5000 chars of cache. And if you do the cache right it will contain endless knowledge compared to just storing the entire text.
Aggressive summarisation to a tiny working memory does not seem like the right model to describe human recollection. Obviously we don't remember 30,000 words individually, but we're way better at remembering random seemingly insignificant stuff than a current LLM asked to summarise into a tiny context could possibly manage. Absent a good model for implementing human memory, increasing the context window seems a better approach than trying to fit literally everything the model should know into a summary and accepting that once any information at all falls out of the summary it's gone forever.
Recently I've been doing some data modeling stuff that essentially boils down to aligning and identifying common concepts between several JSONSchema each of about 30000-50000 characters without whitespace. For example, "This field `$.run.acquisitionTime` in the first schema means the same thing as `$.operation.experiment_timestamp` in the second schema; but generally for more esoteric concepts than datetime.
Naturally, I cannot remember most of the content and use text editors / home built data review tools to augment my very limited working memory. It is slow.
I would love to chuck them all into ChatGPT4 and have that magically align and cross reference them, which is just not possible with limited context. There are perhaps some clever solutions involving progressive abstraction and stuff, which I do not have the time to figure out.
So TL;DR maintaining a context much larger than human working memory is a value prop of computers in general, so it feels like it should be a value prop of LLMs too
(...) experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences (...)
...to the extreme and arrange training into passes with increasing context length and decreasing number of samples?
Yes, empirically I've noticed that if you do not exchange that sequence length gain for huge batch sizes up front, however, your performance will go down overall....
I'd imagine performance will suffer but how does it relate to gains in overall training costs? Ie. with same compute budget, which approach would produce better overall performance?
Expanding on this idea, do you think it makes sense to explore what could be called "bootstrapping phase" or "progressive training" for foundational model training:
- starting with small number of weights that are being increased with further training
- arranging training data with basics first - short sentences, logic, grammar, arithmetic, naive knowledge ("bear is animal" etc) that increases in complexity as training progresses.
- increasing context length - ideally implicit, based on increasing sample sizes
This is still full attention so it should be better in benchmarks, whereas the sliding window attention is more efficient so it allows higher context sizes.
I think(hope?) Llama 3 will be a MoE architecture that shows >GPT-3.5 level performance. Interesting to think Meta will probably continue to spearhead the open source AI movement.