CURLoRA: Stable LLM Fine-Tuning and Catastrophic Forgetting Mitigation

Majromax · 2024-07-14T23:33:15.000000Z

I think there's one significant flaw in the analysis. Section 4.3.1 states:

>> In CURLoRA, we decompose the original weight matrix W as: W ≈ CUR

… but C and R are chosen with respect to the inverse of their contribution towards W, with lower-norm rows and columns more likely to be chosen. In some sense, the CUR of this fine-tuning is as far as possible from the full weight matrix W while still being drawn from it.

To the extent the method works, it might work by defining a very low-dimensional space for the fine-tuning that is vaguely compatible with the original weight matrix.

I also think that there's something a bit off in the results (section 7). Neither the test nor control (LoRA) results show improving accuracy on the first (MRPC) fine-tuning task with increasing rank, suggesting that this task was never constrained by the dimensionality available to the fine-tuning. The orders-of-magnitude difference in the number of trainable parameters also suggests that LoRA-1/2/4 could have been fairly considered; it's not obvious that CURLoRA-n is even approximately equivalent to LoRA-n.

mnoorfawi · 2024-07-16T08:49:19.000000Z

Thanks a lot for your comment. I appreciate it. I believe it involves two points:

1. Choosing C and R:

Normally C and R are chosen from the higher norm rows and columns to be able to reconstruct W or an approximation of it from C@U@R. However you stated it correctly here "but C and R are chosen with respect to the inverse of their contribution towards W, with lower-norm rows and columns more likely to be chosen. In some sense, the CUR of this fine-tuning is as far as possible from the full weight matrix W while still being drawn from it." So the whole idea is to mitigate catastrophic forgetting by constraining changes to less critical parts of the original weights giving the original matrix some control over the fine-tuned matrices via C and R which are drawn from W not initiated randomly.

2. You are right about the results in MRPC dataset. This is due to a couple of reasons: a. The learning rate I chose was relatively slow 2.5e-4 b. not enough text preprocessing was done to increase accuracy as the core idea was to measure catastrophic forgetting more than accuracy.

I actually recently tried different learning rates, epochs, ranks, alpha and different tasks (planning to conduct them in the preprint later) and got different numbers but with the same conclusion; CURLoRA-n was always better both in mitigating catastrophic forgetting with much less trainable parameters while having a good and sometimes better accuracy from LoRA-n.

Here is the paper repo: https://github.com/mnoorfawi/curlora It has the implementation code ready for both LoRA and CURLoRA.

if you would like, feel free to give it a try on a specific task of yours with different parameters.

Thanks,

ttul · 2024-07-14T20:19:29.000000Z

I didn’t want to read this, so I asked Claude to ELI5 it to an EE-graduate level. I hope this summary is useful:

CURLoRA is a new way to fine-tune large language models (LLMs) that aims to solve two main problems:

1. Catastrophic forgetting: When you fine-tune an LLM on a new task, it often "forgets" what it learned before.

2. Computational efficiency: Fine-tuning LLMs usually requires a lot of computing power.

Here's how CURLoRA works:

1. It uses a matrix decomposition method called CUR decomposition. This breaks down a big weight matrix (W) into three smaller matrices: C, U, and R.

2. The clever part is how they choose C and R: - They pick columns and rows that are less "important" in the original matrix. - This acts like a built-in regularization, preventing the model from changing too much.

3. They initialize the U matrix with all zeros and only train this matrix.

The benefits:

1. It helps prevent catastrophic forgetting because the changes are constrained to less important parts of the original weights.

2. It's more memory-efficient than full fine-tuning or even LoRA (another popular fine-tuning method) because you're only training the U matrix.

3. It maintains the model's general language understanding better than LoRA, as shown by perplexity scores on a general language task.

The experiments showed that CURLoRA: - Performed better on specific tasks (like sentiment analysis) compared to LoRA. - Maintained performance across multiple tasks better than LoRA (showing less forgetting). - Kept its general language understanding intact, while LoRA's understanding degraded significantly.

In essence, CURLoRA is like giving the model a very selective memory upgrade. It allows the model to learn new tasks efficiently while keeping most of its original knowledge intact.

sulandor · 2024-07-15T06:31:44.000000Z

concerning reuse of significant acronyms.

"just" marketing or heightened malicious intent?

impossiblefork · 2024-07-15T06:36:02.000000Z

No, it's a LoRA variant, so there's nothing strange about LoRA being in the name.

sulandor · 2024-07-15T06:39:05.000000Z

for the uninitiated

https://en.wikipedia.org/wiki/CURL

https://en.wikipedia.org/wiki/LoRa

sulandor · 2024-07-15T07:01:17.000000Z

i take it from the downvotes that you really didn't know

impossiblefork · 2024-07-15T07:06:36.000000Z

I haven't downvoted anybody; and I do know about cURL and LoRaWAN, but LoRA is already 'long' established as a DL term.

I didn't connect this acronym to cURL though.

sulandor · 2024-07-15T07:16:46.000000Z

my apologies then

i guess we established, once again, why naming stuff is a hard problem (not just in cs)