Hacker News new | past | comments | ask | show | jobs | submit login

You can quantize the model to 8-bit tensors instead of 16- or 32-bit bfloats. NVidia has dedicated hardware in their latest series of GPUs so that they can do inference with 8-bit quantization quickly, and it yields 1/2-1/4x of the model in memory. There are other tricks that can be used like sparse tensors, which have been applied to language models and can reduce the memory overhead 10-100x.

See also: "From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: