Hacker News new | past | comments | ask | show | jobs | submit login

One of the most significant quantization papers of the last year [1] found precisely that these outliers only start occuring with LLMs at 6.7B parameters and above.

One of the most important keys to the success of deep learning in the last couple years has been the fact that emergent features exist after certain scales, so I wouldn't be too quick to dismiss things that don't help at smaller scales, nor would I be certain that all the tricks that help in small data/parameter regimes will necessarily help in larger models. Unfortunately!

[1] https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...




Looking at that paper, they appear to be saying that 6.7B is where the problem becomes so intense that no single quantization method can keep up. From what I gather, the paper claims that such outliers start occur down to 125M param models, then at around 1.3B they begin to affect the FFN, and at around 6.7B is when the issue really starts to become apparent because "100% of layers use the same dimension for outliers."

So while you obviously wouldn't be able to conclusively prove the idea fixes the issue in larger models, if you know what you are looking for you should be able to validate that the method works in general down to very small models.

That said, consumer grade cards should be able to train an 8B model with quantization, so you might as well train the whole thing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: