I would care more about LLaMA architecture when I can get hands on, honestly this project is more interesting and lighting fast on even a 2060 laptop https://github.com/BlinkDL/RWKV-LM
If you can tell them which university you're with and the formal name of your research project that would be a start. They'll reach out to the admin to confirm.
You have to split it up which slows it down a lot. The 14B model doesn't fit fully on a 3090, though the 7B fits easily and is very fast. Other replies either may have meant this or thought the original comment was about llama.
It would be interesting to see a version of RWKV[1] that takes some of the improvements in LLaMA (eg the SwiGLU activation function and the Rotary Embeddings - although I think they have tried rotary embeddings in some versions of RWKV) as well as the same dataset and see how it does.
The dataset is interesting. It's not dissimllar to The Pile, which RWKV is already trained on, but does seem to have quite a lot more preprocessing to increase the dataset quality.
very interesting. Do you know of anything that would take advantage of the 128 cores of my Ryzen Threadripper even though I have a 2080 and a 3080 as well? (Or all three... lol)
This is not strictly true. GPUs fare better at these tasks for a few reasons:
* The largest contributor is the sheer number of cores.
* The latency between cores and memory.
* FP16 instructions.
128 cores isn't an insignificant fraction of the number of cores on a 1050 (about 600), but CPU cores are individually more powerful. That advantage is potentially difficult to call. The top-end of Genoa has 96/192 cores, and you can slot many of them onto a single board.
AMD is throwing more and more memory into the CPU cache. That's very different to a direct path to GBs of HBM, but at some point the difference in performance might not matter to a novice/dabbler.
IMHO these chips designed for specialized workloads are looking more and more ridiculous. I expect GPU manufacturers to keep dragging them out for another decade or so, as well as Apple as it explores offshoots of M1. It all makes me feel very tired (the ultimate code smell).
A better design would be something like a 256+ core RISC-V with local memories in/by each core for data-locality and a content-addressable caching scheme for deduplication. Copy-on-write languages like Clojure and orchestrating processes under something like Docker would make it a breeze to program, although it would still support manually managed mutability like with Rust for innermost loops in games or whatever. It's fairly obvious how it would all work, but IP law and gatekeeping ensure that it will not happen anytime soon.
Then stuff like CUDA would be just another framework run on a symmetric multiprocessor and we could get back to exploring alternatives like genetic algorithms like we did in the 1990s. Thankfully nobody cares what I think, heck even I'm sick of reading my own complaints, so it's easy enough to just unsee this.
> Keep in mind that many of the GPU cores go unused, since they are dedicated to geometry or ray tracing or whatnot
I mean, “many” is not usually the case; an RTX 3090 has 10,496 CUDA cores, 328 Tensor cores, 82 RT (raytracing) cores, and 96 render output pipelines. ML apps will uses the first and, depending on software, second set. The vast majority of the cores being CUDA cores is the norm.
I'm using a Ryzen 9 5950X to run some tests with Whisper (ASR), and since I have no GPU with more than 4 GB VRAM, I'm running it on the CPU. It is slow. It takes between 20 seconds and 2 minutes to transcribe 1 minute of audio, I'm using 8 cores. Adding more cores doesn't seem to improve the inferencing time.
ML is really something which should be left to a GPU.
There is an example to use multi GPU in the link, Outside of this I have read about offloading to cpu/nvme in the case of 100GB+ models that don't fit in VRAM tho this will be at the expense of performance.
I don't really understand the benchmarking aspect researchers are touting. The public never cared about LLMs until they had a proper conversation with one. You can beat GPT3 at any benchmark you'd like, but if you can't get people that "feeling" when chatting with your model, is it worth anything?
In the future there's going to have to be a way to benchmark the "human-ness" or "intrigue" or "feistiness" of a model to show us if its getting better at what we want
Often articles show up on Hacker News that are meant for other researchers, not the general public. Not everything is a product.
Benchmarks are used as a way to show that a particular machine learning technique does better at some task. It’s a way for researchers to show they’re making progress that will be legible to other researchers. You can’t publish a paper saying “we tried it out and we think it’s better.”
The question is, where does this "human-ness" lie? In the initial neural network, the training data, or the supervised reinforcement?
In theory, a significantly smaller neural network that outputs at nearly as good a quality, should be able to chew through training data, and its re-enforcement process, much faster and cheaper, right? A more generalized, lower parameter model, is almost always preferred, as long as it works?
If the human feeling is all boltonable to the neural net later, there is no reason to discount this component as lacking potential to exceed current models.
I think it's the feedback from RLHF that is mostly responsible, but it only works if the base model is large. Never seen a small model doing good conversation. They can do ok for classification and open book question answering, but generating long form coherent text is hard.
GPT-3 performed really well on synthetic benchmarks. It was later made palatable for general public consumption. You might say that a LLM needs to be good on synthetic benchmarks first before you can make public facing chatbots with it.
The techniques to go from a basic GPT-like model to a conversational agent are largely published and should be reproducible, open-source base models are starting points for that work that are unencumbered and available.
This is important for researchers and implementers, not (immediately, ay least) end users.
I guess one technique is to train a model on various language model outputs to classify these as good/bad/intriguing/robot-sounding/repetitive/etc. A human can tag the answers for the training dataset.
Then, we can use this model to compare different LLMs and optimize new models - could be with genetic algorithms or just a human tweaking the model - so that the LLM maximizes whatever we want.
You just rediscovered RLHF - reinforcement learning from human preferences. That's the last stage of training for chatGPT, but uses RL instead of supervised learning.
What is the purpose of this? The model from meta is not available to public. Neither this open source "LLaMA-based ChatGPT" nor the "open source" LLaMA can be downloaded or actually used by public because it would required the actual trained model.
I'm as much a META hater as anyone - their policies have consistently disappointed me in almost every aspect of their business - but their stance on this LLaMA project I must say I am happy with and seems to mark a turn for the better.
If they follow through on their promise of making the weights available and share source code that is a big step in the right direction for democratising this technology
The weights are non commercial and while their code is GPLv3 they’ve only released inference and have removed anything that would give away the training methods :)
Yeah, but their history ( and accumulated goodwill ) are similar to Microsoft's. They may say it is 'open' or sprinkle appropriately sounding corporate speaking all over the press release, but the actions will, at best, temporarily prevent them from going 'full evil ahead'.
And I like that announcement. I just don't think they will actually follow through on this.
I am very far from an expert on this, but I think domain specific conversational AI would be much more useful than these large models. It's fun to ask an AI to compose a fresh 600bpm hip hop song about the relationship between materials science and the breeding habits of mosquitoes, but an open-source medical AI, application support AI, or many other applications would be much more practical, if they could be accurate enough. And especially if they could run "standalone." They could also consult with each other, as a network of specialized AI. Is work inching closer to more specific, more accurate applications? Or is this just a big gimmick/distraction phase around a maybe not so great idea of AI?
Smaller models are typically dumber. Sure you could fine-tune a smaller model on say the medical domain and they might even perform better on some benchmarks but they won't reason or generalize as well. domain finetuned large models >>> domain finetuned small models.
And because competence in one area bleeds over to other areas, you often need much less domain specific data to finetune on compared to the smaller models.
You can see instances of this with Minerva, where the finetuned 540b version beats the finetuned 62b version despite being finetuned on only a quarter of the data the 62b version was finetuned on.
They're claiming the 13b model beats GPT-3 175b which is an extraordinary claim requiring extraordinary evidence. If that's true, though, it'd be interesting to see if that also applies to fine-tuning. Since the claim is predicated on the 13b model being better trained (amongst other things?), I wonder if limited fine-tuning data handicaps the 13b model even if the base model can outperform GPT-3 Davinci, given your point about large models handling fine-tuning better with limited data.
I mean the benchmarks are there. Can't exactly fake that. It should apply to fine-tuning. fine-tuning works off the back of the weights. That's why instruction-finetuned models even of small models like the T5 converge much faster on any additional fine-tuning or training than their non instruct counterparts as per the flan paper. Honestly, what i'm taking from this paper is that even chinchilla is undertrained. 13B was trained on 1T tokens.
My opinion is entirely opposite - we need conversational AI for exploring the multitude of possibilities and identify what works, what doesn't work, what is/isn't useful given the obvious limitations on accuracy, truthfulness and interpretability; but once we can identify a specific narrow use case then we can fine-tune a ML system for that isn't conversational but is able to provide better results with whatever domain-specific structure is required (linking to specific sources, including external structured data, providing certainty metrics, filtering results according to domain-specific criteria instead of the conversational political correctness filters which fail some domains, treatment info which was correct but has become outdated, etc, etc) that can be done better in non-conversational systems.
Not necessarily, you don't need to make a domain-specific model from a general model, you can definitely make a large pretrained domain-specific model from scratch by training it only on domain-specific data, which can result in a smaller and more efficient model.
Furthermore, when making task-specific models, an 'encoder' architecture (similar to BERT) often works better than a 'decoder' architecture (similar to GPTx), so you might want to use a similar-but-different architecture than the general model intended to be conversational/generative.
If you want to build a domain-specific classifier that determines whether an image is a dog or a cat, and you have 50 labeled images of dogs and cats, it's much better to start with a large model pretrained on millions of images, and then specialize it by training on 50 images of dogs and cats.
Try to start with a NN and 50 images of dogs and cats, and it won't work very well.
Sure, that's correct, but that's absolutely unrelated to what we were talking about; your example is about the general concept of transfer learning to task-specific annotated data, not about domain-specific pretrained models.
For example, if you want a domain-specific model for the legal domain, then you can pre-train a large self-supervised model on every single legal-related document in the world you can get your hands on, instead of a general mix of news and fiction and blogs and everything else - and that might be a more efficient starting point for however many(few) annotated examples you have for your task-specific classifier than the general model.
Legal-related documents are a minuscule fraction of the corpus the large model is trained on. The resulting model won't have the conceptual fluency that the large model has. It's like training a human baby with legal briefs and expecting her to be a good lawyer.
These models are only capable of coherent conversation in the first place because they are so large. As soon as you shrink it down to be 'domain specific' its ability to form coherent sentences even within its own domain greatly reduces.
The way its typically done, AFAIK, is that you train these big models on a breadth of information, hoping that it picks up on the generalities of the information. In the case of LLMs, things like basic inference, for example. You then take these big, general models and “fine tune” them for specific applications, with specific bits of data. This way, you get things like basic inference, and logic, while still having something that can answer specific questions.
There are definitely very solid attempts at least to make LLMs that encode biomedical knowledge such as BioGPT which is trained on Pubmed and other domain specific areas. Source: https://arxiv.org/abs/2210.10341
I think you would still need a large model to train on general-purpose knowledge and then train on domain specific things to get the specialized knowledge to be truly useful. For example, without a general-purpose model, if someone wanted domain-specific language "translated" for a neophyte, I doubt it would be able to without having also been trained on a general-purpose dataset
You can already do that. Take your task first to GPT-3 and collect a bunch of outputs. Then fine-tune a small model on them. Works well, but you need to extend the dataset to cover all edge cases because the small model can't draw on the vast knowledge GPT-3 has.
In what way is this a ChatGPT implementation or equivalent? Seems like a chatbot based on a different backend, therefore it has absolutely zero link to ChatGPT.
It is a different backend but it supposedly should be roughly comparable to ChatGPT. Also, looks like it's both open source and requires a lot less hardware to run and train.
Its not open source until the weights are available. I have the hardware I need to run it but the required files are not available unless you receive special access.
You can't use what has been released unless you want to spend $500,000 on training.
With only a modicum of trolling-level here, I wonder what percentage of that training expense was used to identify and avoid "true things that must be muted because they offend someone"
Ignoring the subtext of "true things that must be muted because they offend someone", there's a whole section in the paper on how they didn't filter and the problems that causes. TL;DR:
> We observe that toxicity increases with the size of the model, especially for Respectful prompts.
It does outperform GPT3 slightly in terms of observed bias against protected groups (as in it is slightly less biased) but not substantially so.
It uses a different engine, so this is as related to ChatGPT as a Toyota Corolla is related to a BMW car. This is an efficient and open-source chatbot, which is very good news, but the authors just wrote a clickbait title and they know it.
In formal analogies, : is pronounced "is to" and :: is pronounced "as".
The purpose here is to use the relationship from a known, to describe the relationship between a partial known.
ChatGPT is to GPT3 as ChatLLaMa is to LLaMa. It uses the relationship between ChatGPT and GPT3 to extrapolate a relationship between an unknown and LLaMa.
Corolla:Toyota::3-Series:BMW. If you had heard of a Corolla, Toyota, and BMW, but not a 3-Series, you now roughly know that a 3-Series is BMWs equivalent to a Corolla.
I think I prefer the other commenter's point, referring to ChatGPT as a known learning paradigm for chatbots. But thanks for the little crash course on analogies ;)
Yes and no. You wouldn't say GPT to mean large language models or autoregressive language models. I would've thought the same to be true for ChatGPT instead of Chatbots with RL from human feedback (RLHF), perhaps the field is moving towards adopting ChatGPT as a paradigm name. Note that the title doesn't say a ChatGPT-like model based on LLaMa, it says outright opensource implementation of ChatGPT.
> You wouldn't say GPT to mean large language models or autoregressive language models.
In the analogy, that’s exactly what you are saying. Identical to Toyota and BMW meaning “the make of the car.”
Maybe reimplementation is a more precise word, a black box re-engineering/cloning. In this case I inferred it by knowing it was a different LLM underneath, and that this group didn’t have access to the chatgpt source code.
The analogy is somewhat accurate, but also moot, since within the ML community "ChatGPT" can be used either as the product or the method (more specifically called RLHF) somewhat interchangeably. It's more like Google/Googling, where the largest/most popular provider becomes the defacto way to refer to a method.
As someone who develops DL models, the title seems quite apt.
Have we got any details on the benchmarks that show LLaMa's 13B architecture outperforming GPT-3? Because that seems kindsof fantastical. Is it just a product of a very specific benchmark or does it reflect real world performance?
The GPT-3 they are comparing to is the one that was released on 2020. Since then OpenAI made a lot of improvements and nowadays I believe GPT-3.5 is competitive to Palm (540b). Still, LLaMa is in the same tier, with much less parameters.
You're right. Equivalent scenarios, the gap is smaller - about a difference of 10. you can check the end of the flan paper for some equivalent comparisons.
Just a heads up with my comparison. Under equivalent scenarios, the gap is smaller. davinci-003 gets about 10 more points using five shot (which is what the palm comp does)
They list 7B, 13B, 33B, 65B architectures. Presumably, they compare 65b one to GPT-3 175B. Chinchilla model which is about 70B outperformed a much larger GPT-3 model. So not that fantastical.
EDIT: I stand corrected. They do compare 13B model with a large GPT-3 model which is hard to believe without a bit more concrete evidence
This whole debate - if a 13B model can really be as good as GPT3 - would have been settled if we had a live demo. I am not sure their licence allows running public demos, even if you get the weights.
Can't have a stable diffusion moment if you refuse to release the weights to the general public. Stable diffusion only got to where it is because 10,000 people with otherwise zero reputation were able to play around with the code and models.
Seeing the performance of implementations like FlexGen [1], I don't think it would be entirely unreasonable to run a 13B model on a single GPU for personal usage purposes. You are not going to a run a public service out of it, but it probably would be good enough to run your own ChatGPT or Copilot locally.
This obsession with locking up model weights behind a gate-keeping application form and calling it open source is weird. I don't know who the high priests are trying to fool.
If your model is really that good, unleash it into the open so that others can truly evaluate it-warts and all-and help improve it by identifying the flaws.
> This obsession with locking up model weights behind a gate-keeping application form and calling it open source is weird. I don't know who the high priests are trying to fool.
When they don't do it, people scream at them (see Galactica)
"Journalists" react like this:
> On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism. Yesterday the company took down the public demo that it had encouraged everyone to try out.
> Meta’s misstep—and its hubris—show once again that Big Tech has a blind spot about the severe limitations of large language models. There is a large body of research that highlights the flaws of this technology, including its tendencies to reproduce prejudice and assert falsehoods as facts.
> However, Meta and other companies working on large language models, including Google, have failed to take it seriously.
Not really, the LLaMA model is only available on request and access is granted on a "case by case basis" [1], which for most of us is more or less as available as GPT-3 is.
I was mostly talking about access to the trained model weights. The OpenAI API is certainly better than nothing, but it is very restrictive and cost prohibitive for many purposes. For instance, you have to adhere to the OpenAI usage policies, and while they offer fine-tuning services, it is not likely enough to implement techniques like RLHF, which is the basis for ChatGPT.
That said, if LLaMa can achieve performance competitive with GPT-3 with just 13B parameters, I imagine that it is only a matter of time until open source pre-trained models based on this architecture become available, which would render GPT-3 obsolete.
> LLaMA is creating a lot of excitement because it is smaller than GPT-3 but has better performance. For example, LLaMA's 13B architecture outperforms GPT-3 despite being 10 times smaller.
Exactly. Best part is that it is open-source.
That is worth getting excited about. Not a AI SaaS API owned by a so-called pseudo-non profit company which struggles on API uptime and availablity, just like GitHub.
This is the 'revolution' you are looking for that changes everything. Not ChatGPT.
Indeed.
All the weights for all the models will be available one way or the other very soon.
The proprietary nature of the weights is not going to be a bottleneck for more than a month, if I had to guess.
The other bottle-neck to personal use — the hardware required to run (not train from scratch) the thing - is going to be gone within the year I bet. I would assume some clever bloke is going to be able to prune the model or decrease the precision of the weights and discover you can get good-enough results with 1/10th of the memory.