Smaller models are typically dumber. Sure you could fine-tune a smaller model on...

muttled · on Feb 27, 2023

They're claiming the 13b model beats GPT-3 175b which is an extraordinary claim requiring extraordinary evidence. If that's true, though, it'd be interesting to see if that also applies to fine-tuning. Since the claim is predicated on the 13b model being better trained (amongst other things?), I wonder if limited fine-tuning data handicaps the 13b model even if the base model can outperform GPT-3 Davinci, given your point about large models handling fine-tuning better with limited data.

og_kalu · on Feb 27, 2023

I mean the benchmarks are there. Can't exactly fake that. It should apply to fine-tuning. fine-tuning works off the back of the weights. That's why instruction-finetuned models even of small models like the T5 converge much faster on any additional fine-tuning or training than their non instruct counterparts as per the flan paper. Honestly, what i'm taking from this paper is that even chinchilla is undertrained. 13B was trained on 1T tokens.