"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?) With 1-hot ...

adtac · 2024-09-26T00:22:28 1727310148

>WRONG! ENJOY MORE PENALTY, SCRUB!

Is that true tho? During training, the model predicts {"wall": 0.65, "fence": 0.25, "river": 0.03}. Then backprop modifies the weights such that it produces {"wall": 0.67, "fence": 0.24, "river": 0.02} next time.

But it does that with a much richer feedback than WRONG! because we're also telling the model how much more likely "fence" is than "wall" in an indirect way. It's likely most of the neurons that supported "wall" also supported "fence", so the average neuron that supported "river" gets penalised much more than a neuron that supported "fence".

I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.

snovv_crash · 2024-09-26T10:27:23 1727346443

You are in violent agreement with GP.

croes · 2024-09-26T17:45:56 1727372756

Isn't jumping over a fence more likely than jumping over a wall?

refulgentis · 2024-09-26T00:56:06 1727312166

They don't, they're playing "hide the #s" a bit. Llama 3.2 3B is definitively worse than Phi-3 from May, both on any given metric and in an hour of playing with the 2, trying to justify moving to Llama 3.2 at 3B, given I'm adding Llama 3.2 at 1B.

grahamj · 2024-09-26T16:56:15 1727369775

I would have went with “moon”

illwrks · 2024-09-26T17:19:57 1727371197

whimsicalism · 2024-09-26T07:45:02 1727336702

yeah i mean that is exactly why distillation works. if you just were one hotting it would be the same as training on same dataset