I had good results with the previous Yi-34b and its fine tunes like Nous-Capybara-34B. Will be interesting to see what Chatbot Arena thinks but my expectations are high.
It's horribly useless for most use cases since half of it is people probing for riddles that don't transfer to any useful downstream task, and the other half is people probing for morality. Some tiny portion is people asking for code, but every model has its own style of prompting and clarification that works best, so you're not going to be able to use a side-by-side view to get the best result.
The "will it tell me how to make meth" stuff is a huge source of noise, which you could argue is digging for refusals which can be annoying, and the benchmark claims to filter out... but in reality a bunch of the refusals are soft refusals that don't get caught, and people end up downvoting the model that's deemed "corporate".
Honestly the fact that any closed source model with guardrails can even place is a miracle, in a proper benchmark the honest to goodness gap between most closed source models and open source models would be so large it'd break most graphs.