We need to wait for LMSYS Chatbot Arena to actually see the performance of the m...

tosh · 2024-05-12T19:12:26

I had good results with the previous Yi-34b and its fine tunes like Nous-Capybara-34B. Will be interesting to see what Chatbot Arena thinks but my expectations are high.

https://huggingface.co/NousResearch/Nous-Capybara-34B

zone411 · 2024-05-12T22:48:59

No, Lmsys is just another very obviously flawed benchmark.

CuriouslyC · 2024-05-12T23:48:27

Flawed in some ways but still fairly hard to game and useful.

aubanel · 2024-05-12T23:22:31

Please elaborate on this: how is it flawed?

BoorishBears · 2024-05-13T04:50:33

It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.

The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".

Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.

GaggiX · 2024-05-13T08:39:16

This is so nonsensical it's hilarious, "corporate" models have always been at the top of the leaderboard.

BoorishBears · 2024-05-13T17:02:51

Maybe just more nuanced a comment than you're used to. "Corporate" models are interspersed in a way that doesn't reflect their real world performance.

There aren't nearly as many 3.5 level models as the leaderboard implies for example.