Mixed precision is a default method to pretrain and full fine tune right now. It is especially good in transformers, because they have memory bottleneck in activations (outputs of intermediate layers stored for backprop), and running forward pass in fp16/bf16 reduces VRAM by almost half (speeds up forward pass as well).
I wonder about that too. With the small precision, parameter updates might be too small to have an effect (is it possible to use some sort of probabilistic update in that case?) Unfortunately, I haven’t found any resources describing the feasibility of full fp16 or bf16 training.
You are correct, training sorely in fp16/bf16 can lead to imprecise weight updates or even gradients turning to zero. Because of that, mixed precision is used. In mixed precision training, we keep a copy of the weights in fp32 (master model) and the training loop looks like this:
compute the output with the fp16 model, then the loss
-> back-propagate the gradients in half-precision
-> copy the gradients in fp32 precision
-> do the update on the master model (in fp32 precision)
-> copy the master model in the fp16 model.
We also do loss scaling which means multiplying the output of the loss function by some scalar number before backprop (necessary in fp16 but not required in bf16).