Likewise. My day job is machine learning and I still, or maybe consequently, do a double-take every time I see the acronym with minimal context (like on the HN front page, where either usage would be normal).
It's still strange to me to work in a field of computer science where we say things like "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."
> "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."
Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.
We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.
I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.
LLMs were very much engineered... the exact results they yield are hard to determine since they're large statistical models, but I don't think that categorizes the LLMs themselves as a 'discovery' (like say Penicilin)
There’s an argument that all maths are discovered instead of invented or engineered. LLM hardware certainly is hard engineering but the numbers you put in it aren’t, once you have them; if you stumbled upon them by chance or they were revealed to you in your sleep it’d work just as well. (‘ollama run mixtral’ is good enough for a dream to me!)
I understand your distinction, I think, but I would say it is more engineering than ever. It's like the early days of the steam engine or firearms development. It's not a hard science, not formal analysis, it's engineering: tinkering, testing, experimenting, iterating.
I believe, from what I saw in Mathematics, this is a matter of taste. Discovered or invented are 2 perspectives. Some people prefer to think that light is reaching in previously dark corners of knowledge waiting to be discovered(discover). Others prefer to think that by force of genius they brought the thing into the world.
To me, personally, these are 2 sides of the coin, without one having more proof than the other.
This can be laid at the feet of Minsky and others who dismissed perceptrons because they couldn't model nonlinear functions. LLMs were never going to happen until modern CPUs and GPUs came along, but that doesn't mean we couldn't have a better theoretical foundation in place. We are years behind where we should be.
When I worked in the games industry in the 1990s, it was "common knowledge" that neural nets were a dead end at best and a con job at worst. Really a shame to lose so much time because a few senior authority figures warned everyone off. We need to make sure that doesn't happen this time.
Answering the GP's point regarding why deep learning textbooks, articles, and blog posts are full of sentences that begin with "We think..." and "We're not sure, but..." and "It appears that..."
we have no theories of intelligence. We're like people in the 1500s trying to figure out why and how people get sick, with no concept of bacteria, germs, transmission, etc
I haven't seen this key/buzzword mentioned yet, so I think part of it is the fact that we're now working on complex systems. This was already true (a social network is a complex system), but now we have the impenetrability of a complex system within the scope of a single process. It's hard to figure out generalizable principles about this kind of thing!
I mean, it’s kind of in the name isn’t it? Computer science. Science is empirical, often poorly understood and even the best theories don’t fully explain all observations, especially when a field gets new tools to observe phenomena. It takes a while for a good theory to come along and make sense of everything in science and that seems like more or less exactly where we are today.
Welcome to engineering. We don't sketch our controlled systems and forget all about systems theory. Instead we just fiddle with out controllers until the result is acceptable.
It's still not too clear to me when we should fine tune versus RAG.
In the past, I used to believe that finetuning is mostly for model behavioral change, but recently it seems that certain companies are also using fine-tuning for knowledge addition.
I think the main use case remains behavior changes: instruction finetuning, finetuning for classification, etc. Knowledge addition to the weights is best done via pretraining. Or, if you have an external database or documentation that you want to query during the generation, RAG as you mention.
PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA (quantized LoRA).
Fine tuning is better than RAG when the additional data isn't concise, or requires context. This is because too much context (or "unfocused" context) can dilute prompt following behavior, and RAG doesn't help the model with higher order token associations so you have to get lucky and pull what you need from the augmentation material, at which point it's not much better than a fancy search engine. Of course this is mostly an issue when you're dealing with a specialized corpus with its own micro-dialect that isn't well represented in public data sets, such as with government/big corporation internal documents.
From what I gather, fine-tuning is unreasonably effective [0] because in-context learning really depends on how powerful the underlying model is and just how you do RAG (process queries, retrieve embeddings, rank outcomes, etc [1]). Per this paper I read, fine-tuning may add new domain knowledge (but as another commenter pointed out, knowledge is better represented from data of the pre-training stage) or boost specific knowledge; while RAG is limited to boosting only; nevertheless, both techniques turn out to be similarly capable with different trade-offs [2].
These are autoregressive models. When you have a new type of sequence where future elements are able to be predicted from previous parts of the sequence, but in a new kind of way than the models have seen before, it would make sense to finetune.
Admittedly, that's a pretty vague descriptor for how to decide what to do for a given data scenario, but it might be good enough as a rough heuristic. Now, whether knowledge addition falls under that, might be a question of taste (without experiments).
Exactly this. If you have a model that's never seen JSON and you want JSON to come out, fine-tuning probably not a bad idea. If you have a model trained on English documents and you want it to produce English documents related to your company, you don't need to fine-tune.
Nice article, I'm not in this field, however, my understanding of the original paper was that the LoRA was applied only on the last dense layer, and not to all independently (maybe I misread it originally).
Digging a bit in why the implementation is like this in the link, I found that in QLoRA they used this and it seems to have some interesting effects, maybe adding a note on the QLoRA decision would be nice :)
I'm not sure I understand why it works though, my neophyte view was that applying LoRA to the last layer made sense, but, I do not wrap my mind on the rationale of applying it repeadly to each linear layer. Can someone explain their intuition?
Like most things in ML, the answer of which layers to use come down to empirical evidence more than theory. In a typical Lora training pipeline, you freeze the contents of the base model and just adjust the Lora layers. The more layers you convert to lora layers the more degrees of freedom you have for the optimization.
There are some finetuning regimens that only recommend finetuning the last layer since this is theorized to have the "highest order" representation of the inputs. Other training regimens will finetune all layers. It's largely data and problem dependent. Lora just mirrors this convention.
Yeah, but if I remember correctly the paper, LoRA followed the logic that only the last layers on a llm changed drastically during finetuning, and the layers above remained almost unchanged, so it made sense to alterate only the last ones, breaking this by adding a LoRA at each linear layer doesn't seem to follow the logic of why LoRA was created and why it works.
Well, Lora works just because it's a low rank approximation of full updates - much in the same way that SVD works, and regular gradient updating works. It delivers good results by both acting as a regularizer and by allowing larger models to be updated with smaller memory footprints.
My point is that the original Lora paper choosing the last layer is one choice. And it is likely the most common one because of its higher symbolic nature typically being all that's needed for good performance on downstream tasks.
Depending on the size of your finetuning job I've personally seen updating more layers (or updating some only on a certain learning rate schedule) to be more effective. Lora is just the mathematical technique of updating, it doesn't really have a hypothesis on the ideal training regimen.
Thanks, I'll meditate on that and re read the paper with this view in mind.
The last sentence makes sense to me, if the finetuning job changes significatively more the weights of other layers than just the last one, it is kinda normal to to use Lora on them. I had the impression that it was rarely the case, but I must be mistaken. I'll think about applications where this is the case.
I prefer the not from scratch, but from configuration approach by Axolotl. Aolotl supports fine-tuning mistral, llama-2, with lots of the latest techniques - sample packing, flash attention, xformers.
I concentrate on collecting and curating the fine-tuning data, do "data-centric" fine-tuning - not learning LoRA from scratch.
During training, it's more efficient than full finetuning because you only update a fraction of the parameters via backprop.
During inference, it can ...
1) ... be theoretically a tad slower if you add the LoRA values dynamically during the forward pass (however, this is also an advantage if you want to keep a separate small weight set per customer, for example; you run only one large base model and can apply the different LoRA weights per customer on the fly)
2) ... have the exact same performance as the base model if you merge the LoRA weights back with the base model.
Yeah, the LoRA part is from scratch. The LLM backbone in this example is not, this is to provide a concrete example. But you could apply the exact same LoRA from scratch code to a pure PyTorch model if you wanted to:
If anyone is interested in a more 'pure' or 'scratch' implementation, check out https://github.com/michaelnny/QLoRA-LLM. (author here) It also supports 4-bit quantized LoRA, using only PyTorch and bitsandbytes, without any other tools.
Not to be confused with LoRa ("long range"), a radio communication protocol. At first I thought this could be about using LLMs to find optimal protocol parameters, but alas.
It's the first thing that comes to my mind too, but this is mentioned in every thread (and there are far more of them for LoRA than LoRa atm), and in this case there's unlikely to be much confusion because it starts by spelling out the acronym: 'LoRA, which stands for Low Rank Adaptation, [...]'.
Concur; or at least don't use a mix of lower and upper-case, like the radio. I think there would be less mis-assumptions if they had called it "LORA", "Lora", "lora" etc. "LoRA" is asking for trouble.