BMList – A list of big pre-trained models (GPT-3, DALL-E2...)

Piezoid · on July 30, 2022

I can think of many specialized applications where the versatility is superfluous while the size of the model prohibit inference on the edge.

Do you know if there is available methods for shrinking a fine-tuned derivative of such big models?

Beside generating a specialized corpora using the big model and then train a smaller model on it, is there a more direct way to reduce the matrices dimensions while optimizing for a more specific inference problem? How far can we scale down before the need of a different network topology?

f38zf5vdt · on July 30, 2022

You can quantize the model to 8-bit tensors instead of 16- or 32-bit bfloats. NVidia has dedicated hardware in their latest series of GPUs so that they can do inference with 8-bit quantization quickly, and it yields 1/2-1/4x of the model in memory. There are other tricks that can be used like sparse tensors, which have been applied to language models and can reduce the memory overhead 10-100x.

See also: "From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression"

fishingboy · on July 30, 2022

As far as I am concerned, there are many ways to compress a model such as quantization, pruning, and knowledge distillation.

By the way, I found a package called BMCook when I browsed the OpenBMB repo, which implements several algorithms and also compares it with other model compression packages. Hope this can help you.