I'm not sure about this model specifically, but training with 4-bit quantization has been a thing with LLaMA for a while now, although the setup involves manual hacks of various libraries.
There’s also that pruning tool that was on hn in the last couple weeks. It seemed to work really well on the larger models, and could reduce size by 30-50%
What hardware are you using? (CPU,RAM,GPU,VRAM)
Have you considered using llama.cpp for a mixed CPU+GPU use (if you have enough RAM)