"The Llama jumped over the ______!" (Fence? River? Wall? Synagogue?)
With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!
I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.
Is that true tho? During training, the model predicts {"wall": 0.65, "fence": 0.25, "river": 0.03}. Then backprop modifies the weights such that it produces {"wall": 0.67, "fence": 0.24, "river": 0.02} next time.
But it does that with a much richer feedback than WRONG! because we're also telling the model how much more likely "fence" is than "wall" in an indirect way. It's likely most of the neurons that supported "wall" also supported "fence", so the average neuron that supported "river" gets penalised much more than a neuron that supported "fence".
I agree that distillation is more efficient for exactly the same reason, but I think even models as old as GPT-3 use this trick to work as well as they do.
They don't, they're playing "hide the #s" a bit. Llama 3.2 3B is definitively worse than Phi-3 from May, both on any given metric and in an hour of playing with the 2, trying to justify moving to Llama 3.2 at 3B, given I'm adding Llama 3.2 at 1B.
With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!
I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.