Hacker News new | past | comments | ask | show | jobs | submit login

Smaller models are typically dumber. Sure you could fine-tune a smaller model on say the medical domain and they might even perform better on some benchmarks but they won't reason or generalize as well. domain finetuned large models >>> domain finetuned small models. And because competence in one area bleeds over to other areas, you often need much less domain specific data to finetune on compared to the smaller models.

You can see instances of this with Minerva, where the finetuned 540b version beats the finetuned 62b version despite being finetuned on only a quarter of the data the 62b version was finetuned on.




They're claiming the 13b model beats GPT-3 175b which is an extraordinary claim requiring extraordinary evidence. If that's true, though, it'd be interesting to see if that also applies to fine-tuning. Since the claim is predicated on the 13b model being better trained (amongst other things?), I wonder if limited fine-tuning data handicaps the 13b model even if the base model can outperform GPT-3 Davinci, given your point about large models handling fine-tuning better with limited data.


I mean the benchmarks are there. Can't exactly fake that. It should apply to fine-tuning. fine-tuning works off the back of the weights. That's why instruction-finetuned models even of small models like the T5 converge much faster on any additional fine-tuning or training than their non instruct counterparts as per the flan paper. Honestly, what i'm taking from this paper is that even chinchilla is undertrained. 13B was trained on 1T tokens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: