Hacker News new | past | comments | ask | show | jobs | submit login

I think there's one significant flaw in the analysis. Section 4.3.1 states:

>> In CURLoRA, we decompose the original weight matrix W as: W ≈ CUR

… but C and R are chosen with respect to the inverse of their contribution towards W, with lower-norm rows and columns more likely to be chosen. In some sense, the CUR of this fine-tuning is as far as possible from the full weight matrix W while still being drawn from it.

To the extent the method works, it might work by defining a very low-dimensional space for the fine-tuning that is vaguely compatible with the original weight matrix.

I also think that there's something a bit off in the results (section 7). Neither the test nor control (LoRA) results show improving accuracy on the first (MRPC) fine-tuning task with increasing rank, suggesting that this task was never constrained by the dimensionality available to the fine-tuning. The orders-of-magnitude difference in the number of trainable parameters also suggests that LoRA-1/2/4 could have been fairly considered; it's not obvious that CURLoRA-n is even approximately equivalent to LoRA-n.




Thanks a lot for your comment. I appreciate it. I believe it involves two points:

1. Choosing C and R:

Normally C and R are chosen from the higher norm rows and columns to be able to reconstruct W or an approximation of it from C@U@R. However you stated it correctly here "but C and R are chosen with respect to the inverse of their contribution towards W, with lower-norm rows and columns more likely to be chosen. In some sense, the CUR of this fine-tuning is as far as possible from the full weight matrix W while still being drawn from it." So the whole idea is to mitigate catastrophic forgetting by constraining changes to less critical parts of the original weights giving the original matrix some control over the fine-tuned matrices via C and R which are drawn from W not initiated randomly.

2. You are right about the results in MRPC dataset. This is due to a couple of reasons: a. The learning rate I chose was relatively slow 2.5e-4 b. not enough text preprocessing was done to increase accuracy as the core idea was to measure catastrophic forgetting more than accuracy.

I actually recently tried different learning rates, epochs, ranks, alpha and different tasks (planning to conduct them in the preprint later) and got different numbers but with the same conclusion; CURLoRA-n was always better both in mitigating catastrophic forgetting with much less trainable parameters while having a good and sometimes better accuracy from LoRA-n.

Here is the paper repo: https://github.com/mnoorfawi/curlora It has the implementation code ready for both LoRA and CURLoRA.

if you would like, feel free to give it a try on a specific task of yours with different parameters.

Thanks,




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: