Hacker News new | past | comments | ask | show | jobs | submit login
LLMs Unleashed: The Power of Fine-Tuning (lucaspauker.com)
174 points by lucaspauker on July 27, 2023 | hide | past | favorite | 90 comments



> "The idea of fine-tuning has a strong research pedigree. The approach can be traced back to 2018, when two influential papers were published."

The article refers to the BERT and GPT papers as the source of the fine-tuning idea. However, we actually first demonstrated it for universal models in 2017 and published the ULMFiT (Howard and Ruder) paper in early 2018. Prior to that, Dai and Le demonstrated the technique for in-corpus datasets. So it would be more accurate to say the approach can be traced back to those two papers, rather than to BERT and GPT.

BERT and GPT showed the effectiveness of scaling up the amount of data and compute, and switching the model architecture to Transformers (amongst other things).


Yeah, my understanding is the ULMFit paper was the “genesis” Of fine-tuning in the way we mean it now.


As a researcher in the field, I agree with this characterization.

I think it's more accurate the say that GPT and then BERT massively popularized and simplified the idea/approach. Prior to ULMFiT/GPT/BERT, fine-tuning usually meant freezing most of a model and tuning a small layer on top of it. (ELMo also fits somewhere in here, being a kind of halfway step). ULMFiT was a relatively lesser known work but to my knowledge one of the first to do 1) LM pretraining and 2) fine-tuning all the layers, albeit with some complexity (gradual unfreezing/different learning rates for layers).

GPT simplified this massively by simply tuning all weights: no nuance about it. (Also added a classifier on top). BERT took this a step further, benefiting from the larger size of BERT-large and bidirectional attention, which works really well on the NLU datasets of the time along with a classifier head. (It's not until T5 that seq2seq for general tasks became prominent again.) The idea that you tune all the weights of what was then considered a massive model was considered somewhat excessive at the time, but BERT so massively dominated every other architecture and approach at the time that everyone switched over to it. (Adapters (alongside PAL), one of the first parameter-efficient tuning methods in the Transformer era of NLP, came out shortly after.)


Jeremy, if we want to roll up our sleeves on this, I'm pretty sure we can trace this even further.

I'm pretty sure the earliest deep learning work from Bengio and Hinton (maybe even the unpublished circulating TRs from 2007) were basically:

* Let's pretrain autoencoders, layer by layer, to a large depth.

* Let's learn task specific fine-tuning through full backprop.

* (Oh btw this is better than learning a task-specific head on the whole network (And this is briefly noted since it's obvious to us since we work with this stuff so much).)

It was just that the entire deep learning approach was so radical overall that whole the fine-tuning vs. frozen features thing was buried in a ton a other new insights. And I mean radical. NeurIPS (then NIPS) wouldn't give Hinton, Bengion, and LeCun a workshop, so they organized a fucking satellite conference co-located at the same hotel as NIPS during the same time, on the topic of deep learning. And managed to dazzle everyone and steal the whole show.

With that said, I agree with your assessment that these 2018 works very squarely and in a laser-focused way put the idea of fine-tuning. There is totally huge value and impact for correctly packaging and rigorously assessing a specific approach known to be good but that was sort of an after-thought in published work. by expert practitioners.

With that said, I'll self-cite here and acknowledge (take the blame for ?) turning the NLP community on to using large-pretrained models and NOT fine-tuning them at all in 2010. Since the existing practice of designing features using expert knowledge hadn't gotten the community very far, I noticed Collobert + Weston doing large-scale pretraining of features in their amazing early NLP work that everyone slept on, and was like: "Hmmm, maybe NLP people would actually try deep nets if they could just use the features on their existing models and NOT worry about learning how to fine tune a net."

Anyway, greetings from Berlin.


Joseph your work on word vectors was really the key foundation behind modern NLP. Ulmfit and everything else builds on top of that.


We’ve found that 1-shot or few-shot methods with 3.5Turbo or 4 are vastly simpler and exceeds the quality of fine-tuned models from the GPT-3 era.

We have some 100k context models too that can ingest entire documents.

So right now, I would say fine-tuning is probably only useful for a very narrow set of use cases.


Even 100k still seems very limiting for many applications naturally suited to LLMs.

What if I want an AI assistant that is specifically trained on a large codebase? Or all my product’s docs (which might easily exceed 100k characters for a big project). Or one that knows the exact details of the entire tax code? Or one that knows every line of Dostoyevsky’s novels so I can have angsty existential conversations with it? Or that can fully remember conversations with me that stretch on for years?

It seems like you’d need fine tuning for these kinds of use cases? Or am I missing something?


You use a (probably vector, ideally) database and a framework of behind-the-scenes prompting that uses it to do a “research loop” to gather info to answer questions. You might also fine tune it on the source you want it to incorporate, but training (including fine tuning) is most useful (as I understand) to provide “understanding”, not exact recall, whereas a searchable auxiliary memory and a framework which accesses it as part of the process of generating user responses is a way to provide exact recall over a larger dataset.


Ok, I see what you're saying, but at least for the big codebase example, and I'd imagine for many other applications as well, are you really going to get good output by just loading in bits and pieces, even if they are intelligently selected? It seems like holistic "understanding" is exactly what you want in that case, so that the model can take into account the full architecture and know how every component and sub-component fits together.

I wouldn't look forward to a PR for a significant feature from a human developer that just skimmed through all of a project's files and only read a handful of them in depth, so I guess I'm also skeptical this would lead to good results from an LLM.


I think the suggested process is to ingest all of the code base into a vector database and provide prompts and APIs for the model to search and focus on relevant areas while developing a feature, which is analogous (identical, even) to how a human would approach the process - with long term and short term memory and applying general and specific experience.


You can also create hierarchical summaries that help the model select the right starting place in the code base.

Additionally fine tuning isn't a great fit for mutable knowledge like a code base.


But finetuning wouldn't give the LLM a holistic understanding. LLMs are finetuned to predict the next token. At best you give them a full file at a time, but more likely, you create batches of similar length sequences. So while finetuning will lead to the LLM seeing all the codebase, it won't necessarily know how all the files fit together (for example I'm pretty sure it won't see folder structure, unless you somehow include it via preprocessing)


While I grant that it doesn't have the ability to go into single-file or line-level granularity, GPT-4 can have a highly informed discussion on the architectures of big projects in its training set like PostgreSQL or the Linux kernel. That's what I mean by "holistic understanding". If you're a contributor for one of those projects (or want to be), it could definitely help to evaluate the tradeoffs of various design approaches for new features or modifications.

Getting into the nitty gritty of files and functions would be a step further, but it seems to me that just having this general knowledge of how a big project is structured would vastly improve any output related to that project compared to just loading in comparatively tiny pieces of it. Even if those tiny pieces are the most relevant to the prompt, it will be missing all the background context that it has for projects in its training set.


I expect that for Linux or PostgreSQL it was trained not just on the codebase, but also on mailing lists and tons of other documents that discuss various things about it and that's what give it these reasoning powers.


You basically can't. We haven't developed that kind of AI.

Some of us hope that just-in-time retrieval of data from an external source (usually a vector db) is going to work. Like, say, "retrieve chapter 3 of Dostoyevsky's CP and make an interesting comment about it". For the record, you can have plenty of interesting angsty existential conversations with both GPTs about D right now if you so desire. Just preface your dialogue with something slightly more sophisticated than "let us have an angsty existential conversation about Dostoyevsky".

This JIT retrieval might work for some cases, but my guess is it won't for large holistic pieces of work where you have to integrate and "understand" the entire edifice at once. I'm not completely sure if codebases merit such a distinction though, but you can imagine a domain like "law" might. We can push the context limit and this might bring some temporary relief, but I'm not convinced. The recent pushes to 100K seem to rely on "dropping" attention in an intelligent way. It sounds like cheating and while it might work OK for some cases, I think it'll drop the ball eventually.

The integrated understanding that results from, somehow, internalizing the relation between all the hundreds of thousands of seemingly unrelated datapoints is what makes us interesting. That's also what makes an LLM interesting. It's just that an LLM only has access to that level of development when it's training.

If I were to guess I think we somehow need to enable "always-training" and I'm not sure anyone has the faintest idea how. You can only go so far if you step out of school and never learn anything ever again, no matter how brilliant you are. GPT4 is quite the scholar, but we're stretching it.


> I think we somehow need to enable "always-training" and I'm not sure anyone has the faintest idea how.

To ask the naive question, why can't we just keep continually training it with new stuff? If it takes, say, a year to ingest approximately the whole internet, tacking on a single large codebase should probably take something like an additional few seconds if we consider what fraction that codebase is of the full training set. I know it's not that simple for a lot of reasons, but it doesn't seem obvious that fundamentally new methods are needed to do this.


This new piece of information can impact all previous information in unknown ways.

To give a dramatic and superficial example, let's say you just found out you have been living inside the Truman Show. This impacts, if not everything, then a whole lot of what you know about your world. Instantly and completely. This cannot be "tacked on". This has to be integrated somehow.

Current AI doesn't work like that. Instead of building a tower of knowledge it sort of creates abstract cognitive landscapes filled with predetermined averaged out paths that, when followed, give good answers. Useful for a lot of tasks, but it doesn't strike me as the type of structure that can be manipulated with any kind of precision. Maybe that will change and/or I am wrong. Certainly the last one is very probable.


You use a vector database with a query similarity search to find portions of documents that might be relevant to your prompt, rank them, take the top n, put them into the context and run through your LLM to generate an answer to your question.


This paradigm applies for much more than just answering questions - I've found the in-context learning capabilities to be very relevant as well. For example, you can query a vector database and input that information into an LLM and ask it to make a prediction across the input data.

https://zilliz.com/blog/ChatGPT-VectorDB-Prompt-as-code


Can you elaborate? I don’t understand how the article you linked is different from normal question answering with context.


That's pretty cool, thanks for sharing!


Finetuning is not useful for teaching new facts, you need RAG for that: https://zzbbyy.substack.com/p/why-you-need-rag-not-finetunin...


There is some debate about whether in-context learning is real or not, but there are many data points (and articles) showing that it is an emergent pronomena, and it emergence in models of that order of magnitude (gpt-3.5-turbo, gpt-4, and beyond).

References found here: https://www.hopsworks.ai/dictionary/in-context-learning-icl


What I could see fine-tuning being very useful for is efficiency, either getting GPT-4 Level performance out of a smaller model or pruning GPT-4 for your specific needs.

After all, if I just want to detect from text what color and brightness the user wants to adjust their lights to, it seems inefficient to use a model that's been trained on all of human knowledge, even if I'm sure it'll work just fine.


That depends on the scale and balance of inferencing vs. training.


This is basically an ad for fine-tuning as a service.

Can anyone offer an example of a free public-facing LLM which has been fine-tuned by adding much specific info about some narrow area? Say, one that knows all the PR about some car brand or fandom? Somebody must have tried that by now.


The only known to me is LLaMA 2 to remove censorship using LoRA.


There are tons of them for other models, often Alpaca or LLaMA.

This is a popular tool for finetuning Alpaca:

https://github.com/tloen/alpaca-lora


LLaMA or Alpaca aren't free outside research though, so it's better to focus on LLaMA 2.


Hm. Why aren't those things all over the place?


The people that make them often post them or comment in the sub-Reddit below. I've linked to a search for uncensored models in there. Usernames of main people involved are faldore and The-Bloke. Most had Wizard Uncensored in their names.

https://www.reddit.com/r/LocalLLaMA/search/?q=uncensored&res...

People have posted comparisons showing it works pretty well. Some foundational models have enough censorship built in that they still try to resist it. You have to prompt those cleverly to do what you want them to do.


They kinda are, HuggingFace hub is already full of them. Most of them were released like 3-4 days ago so that might be a reason why you don't see them. I already put one of them into production.


There are lots of guides out there, but most these days tend to be selling something under the covers. Here’s one a quick Kagi found on llama 2:

https://towardsdatascience.com/fine-tune-your-own-llama-2-mo...


Not exactly what you're asking, but I fine-tuned llama 65b on my texts and it knows a LOT about me. When I ask it about specific situations where I've talked a lot about the subject, it brings in other seemingly unrelated details I don't think would show up in a vector search: https://airic.serveo.net (I scrubbed PII from it and it's only public-facing for a few days)


Take a look at https://replicate.com/blog/fine-tune-llama-to-speak-like-hom... (and the related blog posts) for a good example of this.


Well, OK. They put in 61,000 lines, the entire first 12 seasons of the Simpsons's. Lines such as

    {'previous': 'Marge Simpson: Ooo, careful, Homer.',
     'character': 'Homer Simpson',
     'line': "There's no time to be careful."}
and then used it to generate more Simpsons dialogue. With a data set so well matched to the goal, the algorithm barely matters. Remember that recent paper, "Copy is all you need."?[1] This is the ideal case for that approach.

I'm thinking more in terms of loading in the detailed product descriptions and maybe manuals from a catalog, then letting users ask questions about how to do things with the products.

[1] https://news.ycombinator.com/item?id=36758233


> Fine tuning is better for complex tasks where the model’s generated output must be accurate and trusted.

uhhh. I understand what was intended there but while fine tuning may reduce the rate of hallucinations and make hallucinations more plausible, it's not magic accurate and trust-worthy dust.

Unfortunately many people think this stuff is magic and care should be taken to not encourage people to confuse improvements with resolving the issue.

One way of characterizing the LLM accuracy problem is that it often looks very accurate and convincing even when it is emitting nonsense. If you cast the problem in those terms-- as a problem of looking more trustworthy than it actually is-- fine tuning actually exacerbates the problem.


This is 2020-level stuff. These days with emergent abilities in LLMs trained with over 1T tokens like GPT-4 a single-shot chain-of-thought beats most fine-tunings. I did research on transformer adapters i.e. parameter-efficient fine-tuning and that stuff is now completely obsolete outside of some restricted domains where small models can still perform well.


I haven't seen any recent papers that show that fine-tuning is obsolete - I've only seen papers showing the opposite. I'd be very interested to see any papers that have demonstrated applications where fine-tuning is not effective nowadays, if you have any links.

Here's an example of a paper that shows good results from fine-tuning for instance: https://arxiv.org/abs/2305.02301


This Stanford seminar video can provide some references:

https://www.youtube.com/watch?v=tVtOevLrt5U


That video shows examples of new models with one-shot performance better than old models with fine-tuning. It doesn't compare new models with fine-tuning to new models without fine-tuning.


Right, the page 28 of this article mentions "On some benchmarks, task-general models (not explicitly trained to perform a task) surpass prior state-of-the-art performance held by a task-specific model" but they don't provide data for fine-tuning of their very large models:

https://arxiv.org/pdf/2206.07682.pdf


What’s your take on the argument that fine tuning might be destroying model calibration and causing it to overfit to aspects of the finetuning dataset eg: causing it to appear more confident, rather than becoming more correct?

See https://twitter.com/animaanandkumar/status/16816906540601876...


That tweet is about a specific type of fine tuning: instruction tuning and RLHF. It causes to get better at the thing that it's being fine tuned for, which is making users feel better about the answer. Whether that's actually desirable in all cases is the question -- but that's not really related to whether fine tuning is a useful tool in general.


I would agree with other commenters that fine tuning is very much not obsolete, and for another important reason: many people and domains do not have the resources or even desire to work with extremely large models like GPT-4. The world outside of OpenAIs monoliths is still very much important.


Yep as soon as query per second matters, and cost per query... Definitely going through that on some work now, where we use GPT4 on subset and rest on GPT3 or tuned. Only slow user-facing and other low-volume / high-latency goes to GPT4..


Gosh you are so wrong. Literally every bit of fine tuning and fine tuning related work is more important than ever. Being able to fine tune a giant model like GPT-4 would be a game changer. I don’t get why people like to come on here and tell blatant lies like this.


What would be my incentive to lie? I am getting absolutely breathtaking results from multi-task single shot prompts with one >1T model where I would need to fine-tune dozens of models to get anywhere close, and likely with less reliability. While fine-tuning might be important to bring some benefits to smaller models, in the really big models CoT is often much better.


General models are useful to the general public, and they're useful as a starting point to find tune to a smaller model. The purpose matters. Research and most products benefit heavily from having a much smaller model (speed) to do one thing very well on trillions of documents. General chatting from big models is mostly a gimmick, but they're still very useful to find tune.


The thing is that those super large models with CoT are often much more reliable than specialized fine-tuned smaller models you'd like to use on trillions of documents, as I am observing right now. You are not trading just speed but also quality with smaller models. The only relevant use case for fine-tuning for me recently was to apply wizard dataset to LLaMA 2 for a less censored model and that was done using LoRA.


What tasks?


For processing trillion documents for example NER can be done much better.


This tradeoff is ridiculous, even if it is "better" by .01% F score. I would much rather have a dataset created in 1 day from BERT at 98% F-score than 1000 years at 98.01% F-score from a 540B parameter model, or even a 33B parameter model. The performance in million parameter models for NER is still excellent, and works at speed that are usable. Running things through OpenAI is also useless, as it would cost a few million $.


It's more like 100% accuracy vs 95% accuracy, and the super large models are now able to extract non-trivial derived info from a regular human speech as well. While cost-wise it's not efficient right now, this will change over time (you skate to where puck will be, not where it is now), making the current fine-tuning way obsolete. Academically I am not thrilled as I built my research on fine-tuning, but as a producer of a product this solves so many issues at the same time, making me pretty happy.


It's really depressing that a handful of big corporations will be able to exert such control over labor and productivity


That's why the LLaMA 2 release is so significant. You can run the full 70B in 8-bit on two prosumer A6000 Ampere (cost around $10k together) which is within the reach of most companies and some devs. This could further accelerate all research to make it both efficient and available even to regular folks.


But it's still not comparable to GPT 4, nor will it likely be for some time at least. And by the time we have GPT 4 class open source models, I'd imagine there'd be significant advancements in closed source models, such as inference time symbolic reasoning using MCTS, something google is working on for gemini...or just bigger/better architectures, data etc


Yeah, I don't really see a solution. Even Stanford can't keep up with the latest AI research. Still, having LLaMA 2 is better than not having it.


You are literally using trillion documents? Or are you exaggerating?


chaxor above mentioned it so I quickly recalled a task I saw a super large LLM demolishing fine-tuned models on documents.


I’d like to generate Minecraft fan fic. GPT4 does a terrible job because it’s been lobotomized. Fine tuning a model with a corpus of fan fic generates better fan fic. If I could fine tune GPT4 then ok I agree. But “general purpose lobotomized oracle” is only one use case for a language model, albeit a useful one, is the least creative one.


OK, understood. You might do that with uncensored versions of LLaMA 2 now. Role playing/fan fic is much better with those.


They tend to not have the right base of training materials - the models are constrained not just by the lobotomization but by some specialized writing not being part of the corpus. Training a lora on a specific corpus - fan fic, alt.sex.stories, or a persons specific writing, can help induce behavior or language choices specific to that corpus. You can also use context injection but I’ve found combining a model weight adjustment with priming the context to be the most effective.

I think over time it’ll become standard practice to train special purpose models on top of general purpose models. If I have a very specific task domain a model tuned to that domain will less often wander off and will be more likely to respond in the desired way.


Such a strong dismissal of another comment should come with some kind of objective evidence. Is there any way to prove that fine tuning is as valuable as you say it is?


It’s self evident to anyone whose seriously worked in the field.


Finetuning is more relevant than ever now. People are fine tuning LLaMA every single day.


GPT-4 is a fine tuned model.


This claim is meaningless without being specific about the tasks.


The link to your terra cotta product, which I assume is the point of the article, is broken.


Hi, we checked and it works on our end. Could you try again here please: https://terra-cotta.ai/? If not, could you share the error?


Step one - "upload your data" and we'll do cool stuff

Ok. How do I upload my data? Can't find any link.

Sign up for our newsletter. Nope.

And we went to Stanford? I don't care.

I'm not trying to be nasty, though it probably comes of like that. Assume I want to get started with your product and invite me in...


I like that your article was well cited. Fun read. Nothing stands out as too inaccurate.

You should try a post on parameter efficient tuning next!


Are there any good tutorials on fine-tuning the quantitized versions of the LLama models anywhere? I have a few NLP tasks I’d like to test out, with plenty of training data, but everything I’ve seen doesn’t seem generalizable enough or lacks necessary details.


Noob question: when folks talk about fine-tuning LLM, do they usually fine-tune the encoder (of the prompt), the decoder (that generates the text) or both ?


Both can be done. Fine-Tuning the prompt is cheaper and can be done on consumer hardware. Fine-tuning the LLM model weights is expensive and needs cloud support.


Thanks for the reply, but when you mean "fine-tuning" for the prompt, do you mean fine-tuning of the LLM encoder of the prompt right ? (The thing that transforms the prompt into a sequence of embeddings?) But that is not cheap/easy to train ...

I know some systems also allow an extra fixed embedding parameter to the prompt, that can also be fine-tuned. But that is yet another thing that can be fine-tuned very cheaply.


Fine tuning seems to me to be dangerously close to a new snake oil of AI these days.

The narrative goes, "look how awesome ChatGPT is, imagine how good it would be trained on just your company's documents".

Which 1000% misses the point. ChatGPT is because (a) it is trained on almost nothing short of the entire corpus of human language ever created. At > 1 trillion parameters, it can have ~1000 parameters for every human on the planet. Let that sink in. And then (b) because it has been subjected to an unknown but likely massive amount of human reinforcement feedback.

The idea that you can meaningfully impact the output of the model towards factual accuracy or logical correctness just by doing a small amount of fully automated training using a tiny corpus of company documents is seductive, but super far from robustly demonstrated as far as I'm aware. Yet this is the pitch being sold very often.


What? Fine tuning has been a common technique for years now. Fine tuned BERT models were behind a lot of retrieval-based systems and they work well.

A more recent example is stable diffusion fine tuned on specific subjects.

Whether fine tuning can reduce hallucination is first of all a question which only pertains to decoders and which is highly dependent on how the model has been fine tuned.


Helpful. I was thinking today about when it makes sense to fine tune vs use embeddings to feed into the LLM prompt and this helped solidify my understanding.


Except that the article didn't cover that distinction at all. It looked at (manual) prompt engineering vs fine tuning. What you are describing is Retrieval Augmented Generation (RAG) which is creating embeddings from a knowledgebase, doing a similarity search using an embedding of the search query, and then programmatically generating a prompt from the search query and the returned content. IMO, this design pattern should be preferred to fine tuning in the vast majority of use cases. Fine tuning should be used to get the model to perform new tasks; RAG should be used instead to add knowledge.


Realistically this seems like a question that would be difficult to generalize an answer to without measuring it. Intuition is unlikely to yield a better result than actually trying it.


I've tried text generation on gpt3 and it was very very bad. has anyone done so and got good results? care to share the code?


What if we fine tune a model like LLaMA on all published research papers? Would be able to product new knowledge?


Depends on what you mean with "new knowledge". A lot of inventions are "just" novel combinations of things we already knew.


Finetuning is not useful for teaching new facts, the current solution for that is using RAG: https://zzbbyy.substack.com/p/why-you-need-rag-not-finetunin...


The value of research isn’t just in the ideas. It’s in the grounding of ideas in fact. It might be able to suggest interesting experiments, but those experiments still need to be carried out.


Can anyone provide a step-by-step ELI5 guide to fine tuning Llama? I still don't quite understand.


https://huggingface.co/docs/trl/sft_trainer and https://huggingface.co/docs/trl/using_llama_models, supervised fine-tuning and another llama example with rlhf



Look up LoRA.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: