LoRA from scratch: implementation for LLM finetuning

ignoramous · 2024-01-22T17:38:29 1705945109

I've been keeping track of the techniques through Maxime Labonne's LLMs 101: https://github.com/mlabonne/llm-course#4-supervised-fine-tun...

pama · 2024-01-22T18:38:06 1705948686

Thanks for the resource. It seems useful enough to warrant its own thread here.

denysvitali · 2024-01-22T19:37:47 1705952267

LoRA != LoRa. I keep on getting confused and hate that they chose to reuse an existing acronym

daemonologist · 2024-01-22T20:25:18 1705955118

Likewise. My day job is machine learning and I still, or maybe consequently, do a double-take every time I see the acronym with minimal context (like on the HN front page, where either usage would be normal).

travisgriggs · 2024-01-23T00:53:23 1705971203

And my day job involves a lot of LoRa. I always do a double take on these. I'm grateful that at least the caps is now being done differently.

sbrother · 2024-01-22T20:37:16 1705955836

Wait, what is the meaning other than "Low-Rank Adaptation"? It's hard to google the difference.

cristoperb · 2024-01-22T20:50:43 1705956643

It's the name of a "Lo"ng "Ra"nge wifi-like technology:

https://en.wikipedia.org/wiki/LoRa

boolemancer · 2024-01-22T20:49:49 1705956589

I assume the radio technology:

https://en.wikipedia.org/wiki/LoRa

jcuenod · 2024-01-23T14:09:14 1706018954

Trying asking an LLM :)

esafak · 2024-01-23T01:20:04 1705972804

That's what happens when people specialize and don't pay attention to what's going on outside their bubble.

HKH2 · 2024-01-23T02:07:43 1705975663

A quick websearch could fix that.

blopp99 · 2024-01-23T01:06:06 1705971966

I hate the trend of software guys naming things after hardware related stuff

sschueller · 2024-01-22T19:46:09 1705952769

It's unfortunate that those two so far unrelated technologies have the same acronym.

girvo · 2024-01-23T01:35:26 1705973726

LoRa the radio tech was first, so as far as I'm concerned it's the canonical definition of the acronym. But I'm biased, I'm an embedded firmware dev

dvngnt_ · 2024-01-24T05:35:27 1706074527

probably better than them being similar but not exactly since context still helps

rsweeney21 · 2024-01-22T19:03:03 1705950183

It's still strange to me to work in a field of computer science where we say things like "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

TacticalCoder · 2024-01-22T21:12:53 1705957973

> "we're not exactly sure how these numbers (hyper parameters) affect the result, so just try a bunch of different values and see which one works best."

Isn't it the same for anything that uses a Monte Carlo simulation to find a value? At times you'll end up on a local maxima (instead of the best/correct) answer, but it works.

We cannot solve something used a closed formula so we just do a billion (or whatever) random samplings and find what we're after.

I'm not saying it's the same for LLMs but "trying a bunch of different values and see which one works best" is something we do a lot.

r3trohack3r · 2024-01-22T19:42:02 1705952522

I feel like it's the difference between something that has been engineered and something that has been discovered.

I feel like most of our industry up until now has been engineered.

LLMs were discovered.

herval · 2024-01-23T04:00:02 1705982402

LLMs were very much engineered... the exact results they yield are hard to determine since they're large statistical models, but I don't think that categorizes the LLMs themselves as a 'discovery' (like say Penicilin)

baq · 2024-01-23T07:46:58 1705996018

There’s an argument that all maths are discovered instead of invented or engineered. LLM hardware certainly is hard engineering but the numbers you put in it aren’t, once you have them; if you stumbled upon them by chance or they were revealed to you in your sleep it’d work just as well. (‘ollama run mixtral’ is good enough for a dream to me!)

SkyMarshal · 2024-01-22T21:21:40 1705958500

If the Black Swan model of science is true, then most of the consequential innovations and advances are discovered rather than engineered.

arketyp · 2024-01-22T19:55:45 1705953345

I understand your distinction, I think, but I would say it is more engineering than ever. It's like the early days of the steam engine or firearms development. It's not a hard science, not formal analysis, it's engineering: tinkering, testing, experimenting, iterating.

peddling-brink · 2024-01-22T20:57:36 1705957056

> tinkering, testing, experimenting, iterating

But that describes science. http://imgur.com/1h3K2TT/

amelius · 2024-01-22T21:26:08 1705958768

AI requires a lot of engineering. However, the engineering is not what makes working in AI interesting. It's the plumbing, basically.

mejutoco · 2024-01-23T08:07:31 1705997251

I believe, from what I saw in Mathematics, this is a matter of taste. Discovered or invented are 2 perspectives. Some people prefer to think that light is reaching in previously dark corners of knowledge waiting to be discovered(discover). Others prefer to think that by force of genius they brought the thing into the world.

To me, personally, these are 2 sides of the coin, without one having more proof than the other.

justanotheratom · 2024-01-22T20:02:49 1705953769

and finally, this justifies the "science" in Computer Science.

SkyMarshal · 2024-01-22T21:18:08 1705958288

That bottom-up tinkering is kinda how CS started in the US, as observed by Dijkstra himself: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD06xx/E...

Ideally we want theoretical foundations, but sometimes random explorations are necessary to tease out enough data to construct or validate theory.

CamperBob2 · 2024-01-22T20:55:13 1705956913

This can be laid at the feet of Minsky and others who dismissed perceptrons because they couldn't model nonlinear functions. LLMs were never going to happen until modern CPUs and GPUs came along, but that doesn't mean we couldn't have a better theoretical foundation in place. We are years behind where we should be.

When I worked in the games industry in the 1990s, it was "common knowledge" that neural nets were a dead end at best and a con job at worst. Really a shame to lose so much time because a few senior authority figures warned everyone off. We need to make sure that doesn't happen this time.

spidersenses · 2024-01-22T22:00:32 1705960832

What is the point you're trying to make?

CamperBob2 · 2024-01-22T22:25:12 1705962312

What is the point you're trying to make?

Answering the GP's point regarding why deep learning textbooks, articles, and blog posts are full of sentences that begin with "We think..." and "We're not sure, but..." and "It appears that..."

What's yours?

UberFly · 2024-01-22T20:41:11 1705956071

This is what researching different Stable Diffusion settings is like. You quickly learn that there's a lot of guessing going on.

fierro · 2024-01-22T22:57:31 1705964251

we have no theories of intelligence. We're like people in the 1500s trying to figure out why and how people get sick, with no concept of bacteria, germs, transmission, etc

thatguysaguy · 2024-01-22T22:22:14 1705962134

I haven't seen this key/buzzword mentioned yet, so I think part of it is the fact that we're now working on complex systems. This was already true (a social network is a complex system), but now we have the impenetrability of a complex system within the scope of a single process. It's hard to figure out generalizable principles about this kind of thing!

manojlds · 2024-01-22T19:17:40 1705951060

Divine benevolence

FuckButtons · 2024-01-23T04:20:41 1705983641

I mean, it’s kind of in the name isn’t it? Computer science. Science is empirical, often poorly understood and even the best theories don’t fully explain all observations, especially when a field gets new tools to observe phenomena. It takes a while for a good theory to come along and make sense of everything in science and that seems like more or less exactly where we are today.

jncfhnb · 2024-01-23T13:03:54 1706015034

Not strange at all. This is largely how biology operates. These things are simpler than bio and more complex than programs

amelius · 2024-01-22T21:23:32 1705958612

AI is more like gardening than engineering. You try things without knowing the outcome. And you wait a very long time to see the outcome.

raxxorraxor · 2024-01-23T09:57:22 1706003842

Welcome to engineering. We don't sketch our controlled systems and forget all about systems theory. Instead we just fiddle with out controllers until the result is acceptable.

stormfather · 2024-01-22T21:22:14 1705958534

It's how God programs

jejeyyy77 · 2024-01-22T20:06:15 1705953975

it's a new paradigm

chenxi9649 · 2024-01-22T19:08:41 1705950521

It's still not too clear to me when we should fine tune versus RAG.

In the past, I used to believe that finetuning is mostly for model behavioral change, but recently it seems that certain companies are also using fine-tuning for knowledge addition.

What are the main use cases for fine tuning?

rasbt · 2024-01-22T19:14:56 1705950896

I think the main use case remains behavior changes: instruction finetuning, finetuning for classification, etc. Knowledge addition to the weights is best done via pretraining. Or, if you have an external database or documentation that you want to query during the generation, RAG as you mention.

PS: All winners of the NeurIPS 2023 LLM Efficiency Challenge (finetuning the "best" LLM in 24h on 1 GPU) used LoRA or QLoRA (quantized LoRA).

CuriouslyC · 2024-01-22T20:14:14 1705954454

Fine tuning is better than RAG when the additional data isn't concise, or requires context. This is because too much context (or "unfocused" context) can dilute prompt following behavior, and RAG doesn't help the model with higher order token associations so you have to get lucky and pull what you need from the augmentation material, at which point it's not much better than a fancy search engine. Of course this is mostly an issue when you're dealing with a specialized corpus with its own micro-dialect that isn't well represented in public data sets, such as with government/big corporation internal documents.

ignoramous · 2024-01-22T19:54:47 1705953287

From what I gather, fine-tuning is unreasonably effective [0] because in-context learning really depends on how powerful the underlying model is and just how you do RAG (process queries, retrieve embeddings, rank outcomes, etc [1]). Per this paper I read, fine-tuning may add new domain knowledge (but as another commenter pointed out, knowledge is better represented from data of the pre-training stage) or boost specific knowledge; while RAG is limited to boosting only; nevertheless, both techniques turn out to be similarly capable with different trade-offs [2].

--

[0] Fast.ai: Can Models learn from one sample, https://www.fast.ai/posts/2023-09-04-learning-jumps/ / https://archive.is/eJMPR

[1] LlamaIndex: Advanced RAG, https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-fo... / https://archive.is/qtBXX

[2] Microsoft: RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study, https://arxiv.org/html/2401.08406v2#S6 / https://archive.is/UQ8Sa#S6

pizza · 2024-01-23T04:00:12 1705982412

These are autoregressive models. When you have a new type of sequence where future elements are able to be predicted from previous parts of the sequence, but in a new kind of way than the models have seen before, it would make sense to finetune.

Admittedly, that's a pretty vague descriptor for how to decide what to do for a given data scenario, but it might be good enough as a rough heuristic. Now, whether knowledge addition falls under that, might be a question of taste (without experiments).

jcuenod · 2024-01-23T14:12:46 1706019166

Exactly this. If you have a model that's never seen JSON and you want JSON to come out, fine-tuning probably not a bad idea. If you have a model trained on English documents and you want it to produce English documents related to your company, you don't need to fine-tune.

somethingsome · 2024-01-22T21:30:32 1705959032

Nice article, I'm not in this field, however, my understanding of the original paper was that the LoRA was applied only on the last dense layer, and not to all independently (maybe I misread it originally).

Digging a bit in why the implementation is like this in the link, I found that in QLoRA they used this and it seems to have some interesting effects, maybe adding a note on the QLoRA decision would be nice :)

I'm not sure I understand why it works though, my neophyte view was that applying LoRA to the last layer made sense, but, I do not wrap my mind on the rationale of applying it repeadly to each linear layer. Can someone explain their intuition?

icyfox · 2024-01-22T21:39:11 1705959551

Like most things in ML, the answer of which layers to use come down to empirical evidence more than theory. In a typical Lora training pipeline, you freeze the contents of the base model and just adjust the Lora layers. The more layers you convert to lora layers the more degrees of freedom you have for the optimization.

There are some finetuning regimens that only recommend finetuning the last layer since this is theorized to have the "highest order" representation of the inputs. Other training regimens will finetune all layers. It's largely data and problem dependent. Lora just mirrors this convention.

somethingsome · 2024-01-22T23:35:32 1705966532

Yeah, but if I remember correctly the paper, LoRA followed the logic that only the last layers on a llm changed drastically during finetuning, and the layers above remained almost unchanged, so it made sense to alterate only the last ones, breaking this by adding a LoRA at each linear layer doesn't seem to follow the logic of why LoRA was created and why it works.

icyfox · 2024-01-23T00:53:21 1705971201

Well, Lora works just because it's a low rank approximation of full updates - much in the same way that SVD works, and regular gradient updating works. It delivers good results by both acting as a regularizer and by allowing larger models to be updated with smaller memory footprints.

My point is that the original Lora paper choosing the last layer is one choice. And it is likely the most common one because of its higher symbolic nature typically being all that's needed for good performance on downstream tasks.

Depending on the size of your finetuning job I've personally seen updating more layers (or updating some only on a certain learning rate schedule) to be more effective. Lora is just the mathematical technique of updating, it doesn't really have a hypothesis on the ideal training regimen.

somethingsome · 2024-01-23T04:34:28 1705984468

Thanks, I'll meditate on that and re read the paper with this view in mind.

The last sentence makes sense to me, if the finetuning job changes significatively more the weights of other layers than just the last one, it is kinda normal to to use Lora on them. I had the impression that it was rarely the case, but I must be mistaken. I'll think about applications where this is the case.

jamesblonde · 2024-01-22T19:30:28 1705951828

I prefer the not from scratch, but from configuration approach by Axolotl. Aolotl supports fine-tuning mistral, llama-2, with lots of the latest techniques - sample packing, flash attention, xformers.

I concentrate on collecting and curating the fine-tuning data, do "data-centric" fine-tuning - not learning LoRA from scratch.

wfalcon · 2024-01-22T22:32:17 1705962737

this is also what our (Lightning AI) lit-gpt library does. https://github.com/Lightning-AI/lit-gpt

jamesblonde · 2024-01-23T06:09:11 1705990151

Thanks, hadn't seen this.

yandrypozo · 2024-01-22T20:56:11 1705956971

gotta say naming is hard I thought this was about LoRa (from "long range") or LoRaWAN, the IoT sensors communication.

helloericsf · 2024-01-22T23:04:31 1705964671

HN friends, What are the most popular libraries for fine-tuning? (Not from scratch)

jasonjmcghee · 2024-01-23T01:50:04 1705974604

https://github.com/OpenAccess-AI-Collective/axolotl

broabprobe · 2024-01-22T18:49:53 1705949393

wow definitely thought this was about LoRa at first.

facu17y · 2024-01-22T20:22:18 1705954938

What's the performance penalty of LoRA?

rasbt · 2024-01-22T20:40:53 1705956053

During training, it's more efficient than full finetuning because you only update a fraction of the parameters via backprop. During inference, it can ...

1) ... be theoretically a tad slower if you add the LoRA values dynamically during the forward pass (however, this is also an advantage if you want to keep a separate small weight set per customer, for example; you run only one large base model and can apply the different LoRA weights per customer on the fly)

2) ... have the exact same performance as the base model if you merge the LoRA weights back with the base model.

huqedato · 2024-01-22T18:18:43 1705947523

Excellent and practical example! I'm curious if there's a comparable one using Julia or JavaScript.

fnordfnordfnord · 2024-01-23T02:46:25 1705977985

I thought this was going to be some neat software defined radio stuff. Still quite interesting though.

z3ugma · 2024-01-23T02:49:41 1705978181

it's all about whether the 'A' is capitalized or not. LoRa - radio LoRA - machine learning

tussa · 2024-01-23T07:22:56 1705994576

It's cheap and sleazy to steal a name from another project to ride it's fame.

andy99 · 2024-01-22T18:06:19 1705946779

"From scratch" seems to be a matter of opinion. "Pure pytorch" maybe, except it uses HF transformers. So it's LoRA on top of common frameworks...

rasbt · 2024-01-22T18:10:18 1705947018

Yeah, the LoRA part is from scratch. The LLM backbone in this example is not, this is to provide a concrete example. But you could apply the exact same LoRA from scratch code to a pure PyTorch model if you wanted to:

E.g.

    class MultilayerPerceptron(nn.Module):

        def __init__(self, num_features, num_hidden_1, num_hidden_2, num_classes):
            super().__init__()

            self.layers = nn.Sequential(
                nn.Linear(num_features, num_hidden_1),
                nn.ReLU(),
                nn.Linear(num_hidden_1, num_hidden_2),
                nn.ReLU(),
                nn.Linear(num_hidden_2, num_classes)
            )

        def forward(self, x):
            x = self.layers(x)
            return x

    model = MultilayerPerceptron(
        num_features=num_features,
        num_hidden_1=num_hidden_1,
        num_hidden_2=num_hidden_2, 
        num_classes=num_classes
    )

    model.layers[0] = LinearWithLoRA(model.layers[0], rank=4, alpha=1)
    model.layers[2] = LinearWithLoRA(model.layers[2], rank=4, alpha=1)
    model.layers[4] = LinearWithLoRA(model.layers[4], rank=4, alpha=1)

_bifc · 2024-01-23T06:51:36 1705992696

If anyone is interested in a more 'pure' or 'scratch' implementation, check out https://github.com/michaelnny/QLoRA-LLM. (author here) It also supports 4-bit quantized LoRA, using only PyTorch and bitsandbytes, without any other tools.

2024throwaway · 2024-01-22T18:13:26 1705947206

This apple pie recipe claims to be from scratch, but they cooked it in an off the shelf oven. So it's from scratch on top of the universe...

dymk · 2024-01-22T17:52:06 1705945926

Not to be confused with LoRa ("long range"), a radio communication protocol. At first I thought this could be about using LLMs to find optimal protocol parameters, but alas.

OJFord · 2024-01-22T17:57:50 1705946270

It's the first thing that comes to my mind too, but this is mentioned in every thread (and there are far more of them for LoRA than LoRa atm), and in this case there's unlikely to be much confusion because it starts by spelling out the acronym: 'LoRA, which stands for Low Rank Adaptation, [...]'.

cpfohl · 2024-01-22T17:54:00 1705946040

I had the exact same confusion

the__alchemist · 2024-01-22T18:45:11 1705949111

Concur; or at least don't use a mix of lower and upper-case, like the radio. I think there would be less mis-assumptions if they had called it "LORA", "Lora", "lora" etc. "LoRA" is asking for trouble.

thelastparadise · 2024-01-22T18:17:05 1705947425

This caught me off-guard as well.

I really wish they could have used abother acronym.

rasbt · 2024-01-22T18:00:12 1705946412

Hah, yeah that's LoRA as in Low-Rank Adaptation :P

ijhuygft776 · 2024-01-22T18:28:39 1705948119

I wish the wireless LoRa protocol would be open source...

ijhuygft776 · 2024-01-23T17:51:46 1706032306

https://www.epfl.ch/labs/tcl/wp-content/uploads/2020/02/Reve...

gourabmi · 2024-01-22T19:00:54 1705950054

Someone somewhere is already working on naming their project Lehsun.. /s

2024-01-22T23:09:38 1705964978

[dead]

Rudeg · 2024-01-23T00:06:08 1705968368

nice, looks very cool and useful! I'll definitely try it!