I think there's one significant flaw in the analysis. Section 4.3.1 states: >> I...

mnoorfawi · 2024-07-16T08:49:19.000000Z

Thanks a lot for your comment. I appreciate it. I believe it involves two points:

1. Choosing C and R:

Normally C and R are chosen from the higher norm rows and columns to be able to reconstruct W or an approximation of it from C@U@R. However you stated it correctly here "but C and R are chosen with respect to the inverse of their contribution towards W, with lower-norm rows and columns more likely to be chosen. In some sense, the CUR of this fine-tuning is as far as possible from the full weight matrix W while still being drawn from it." So the whole idea is to mitigate catastrophic forgetting by constraining changes to less critical parts of the original weights giving the original matrix some control over the fine-tuned matrices via C and R which are drawn from W not initiated randomly.

2. You are right about the results in MRPC dataset. This is due to a couple of reasons: a. The learning rate I chose was relatively slow 2.5e-4 b. not enough text preprocessing was done to increase accuracy as the core idea was to measure catastrophic forgetting more than accuracy.

I actually recently tried different learning rates, epochs, ranks, alpha and different tasks (planning to conduct them in the preprint later) and got different numbers but with the same conclusion; CURLoRA-n was always better both in mitigating catastrophic forgetting with much less trainable parameters while having a good and sometimes better accuracy from LoRA-n.

Here is the paper repo: https://github.com/mnoorfawi/curlora It has the implementation code ready for both LoRA and CURLoRA.

if you would like, feel free to give it a try on a specific task of yours with different parameters.

Thanks,