Hacker News new | past | comments | ask | show | jobs | submit login

Pure 16bit is horrible for training, sorry.



Doesn't using bf16 alleviate the problem? At least I've had success training a Bert like model from scratch


Mixed precision is a default method to pretrain and full fine tune right now. It is especially good in transformers, because they have memory bottleneck in activations (outputs of intermediate layers stored for backprop), and running forward pass in fp16/bf16 reduces VRAM by almost half (speeds up forward pass as well).


I wonder about that too. With the small precision, parameter updates might be too small to have an effect (is it possible to use some sort of probabilistic update in that case?) Unfortunately, I haven’t found any resources describing the feasibility of full fp16 or bf16 training.


You are correct, training sorely in fp16/bf16 can lead to imprecise weight updates or even gradients turning to zero. Because of that, mixed precision is used. In mixed precision training, we keep a copy of the weights in fp32 (master model) and the training loop looks like this: compute the output with the fp16 model, then the loss -> back-propagate the gradients in half-precision -> copy the gradients in fp32 precision -> do the update on the master model (in fp32 precision) -> copy the master model in the fp16 model. We also do loss scaling which means multiplying the output of the loss function by some scalar number before backprop (necessary in fp16 but not required in bf16).

Check out the fastai docs for more details: https://docs.fast.ai/callback.fp16.html


Ah my bad. I am using mixed precision training in the my previous comment.

You might find this paper interesting: https://arxiv.org/pdf/2010.06192.pdf


Hmm, what do you mean? I thought bf16 is used extensively for LLM training.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: