It's still not too clear to me when we should fine tune versus RAG.
In the past, I used to believe that finetuning is mostly for model behavioral change, but recently it seems that certain companies are also using fine-tuning for knowledge addition.
I think the main use case remains behavior changes: instruction finetuning, finetuning for classification, etc. Knowledge addition to the weights is best done via pretraining. Or, if you have an external database or documentation that you want to query during the generation, RAG as you mention.
PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA (quantized LoRA).
Fine tuning is better than RAG when the additional data isn't concise, or requires context. This is because too much context (or "unfocused" context) can dilute prompt following behavior, and RAG doesn't help the model with higher order token associations so you have to get lucky and pull what you need from the augmentation material, at which point it's not much better than a fancy search engine. Of course this is mostly an issue when you're dealing with a specialized corpus with its own micro-dialect that isn't well represented in public data sets, such as with government/big corporation internal documents.
From what I gather, fine-tuning is unreasonably effective [0] because in-context learning really depends on how powerful the underlying model is and just how you do RAG (process queries, retrieve embeddings, rank outcomes, etc [1]). Per this paper I read, fine-tuning may add new domain knowledge (but as another commenter pointed out, knowledge is better represented from data of the pre-training stage) or boost specific knowledge; while RAG is limited to boosting only; nevertheless, both techniques turn out to be similarly capable with different trade-offs [2].
These are autoregressive models. When you have a new type of sequence where future elements are able to be predicted from previous parts of the sequence, but in a new kind of way than the models have seen before, it would make sense to finetune.
Admittedly, that's a pretty vague descriptor for how to decide what to do for a given data scenario, but it might be good enough as a rough heuristic. Now, whether knowledge addition falls under that, might be a question of taste (without experiments).
Exactly this. If you have a model that's never seen JSON and you want JSON to come out, fine-tuning probably not a bad idea. If you have a model trained on English documents and you want it to produce English documents related to your company, you don't need to fine-tune.
In the past, I used to believe that finetuning is mostly for model behavioral change, but recently it seems that certain companies are also using fine-tuning for knowledge addition.
What are the main use cases for fine tuning?