Hacker News new | past | comments | ask | show | jobs | submit login

A really important nuance here is that they are building on top of Llama-2, the pretrained model, and not Llama-2-chat.

I really think the entire field is doing a degree of damage with the chat fine tuning beyond what might be expected, because regularly part of that chat instruction is an emphasis on identification as a LLM.

The problem with this is that nearly all of the training data it's performing next token prediction on is text generated by humans.

So there's an inherent narrowing of the model scope with most of the fine tuning I've seen such that while pretrained models are harder to use, I regularly prefer them over chat models when both are available as even at similar temperatures the quality and variety of language is much improved in the pretrained over chat model.

This fine tuning was only introducing bias towards logical step by step analysis and problem solving techniques, and the results are great. But I'm willing to bet that an identical fine tuning on top of the chat model would have been much worse on the evaluations - not just the compounding of a typical fine tuning loss of a few percent, but more like a double digit relative difference.

It's quite frustrating that the anxiety over model safety is likely throwing out tens of millions of dollars worth of data in the pretrained model when only chat models are available for the SotA, and I hope in the future a lighter touch is taken on fine tuning the pretrained model and instead of focusing on safety inherent to the model it is just set behind a safety oriented discriminator or 'editor' which filters or modifies responses accordingly.

I'd happily take a 2-3x increased API cost for a much more broadly capable and performant model with similar safety characteristics but without the handicaps that come with it.

So while a lot of the gains here might be due to the fine tuning, I expect at least part is shrugging off the baggage of the chat/safety fine tuning as well. Even in the first detailed example, we can see that while Llama-2 goes off rambling later on, its statement of the relative knowledge of John vs Llama-2-chat is much more clear and connected between initial conditions and result particularly regarding theory of mind (i.e. "he assumed" vs the latter's "it must be in").




Adding to this - it's really interesting the safety stuff that *is* in this paper. Such as:

> We probe some of the categories where we see a larger difference (e.g., violent) and observe that Orca 2 tends to counter the harmful positions more often (which is penalized by the metric), while models that have gone through RLHF safety training tend to decline to respond more often (which is rewarded by the metric).

Or the fact Orca 2 is less likely to extend hate speech than Llama-2-chat which theoretically went through safety fine tuning even though Orca 2 did not have any explicit safety fine tuning.

Research over the past year has really demonstrated (a) just how impactful fine tuning can be - to the point of transmitting capabilities from larger models to smaller, and (b) that we're still clumsily wading through that process with only partial clarity on best practices as the foundational pretrained models get better and better at astounding rates.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: