Hacker News new | past | comments | ask | show | jobs | submit login

Have you tried quantization? It's often a cheap and simple way to reduce the VRAM requirements.

What hardware are you using? (CPU,RAM,GPU,VRAM)

Have you considered using llama.cpp for a mixed CPU+GPU use (if you have enough RAM)




Yeah I am using the default training script with int8 quantisation. It uses peft with lora but this still requires 26gb


I'm not sure about this model specifically, but training with 4-bit quantization has been a thing with LLaMA for a while now, although the setup involves manual hacks of various libraries.


Is it possible to offload some layers to CPU and still train in a reasonable amount of time?


There’s also that pruning tool that was on hn in the last couple weeks. It seemed to work really well on the larger models, and could reduce size by 30-50%


You probably don't want to fine-tune a quantized model. They are fine for inference but not great for training.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: