Thanks. There's been a few substantial and laudable efforts which are much appreciated but what I'm suggesting is an actual continuous infrastructure, like how those benchmarking sites have software for people to run on their machines that phone home so that people who make new benchmarks or new variations can submit them and refine the results.
For instance, are any of your prompting tests in say, Korean? What about winograd schema challenges in languages other than English? Japanese for instance, comes with its own unique set of context ambiguities that do not appear in English. I'm sure dozens of languages are similar. It'd be nice to have user contributable tests to cover the breadth of use cases here.
A great optimization that moves a score let's say from 95% -> 5% on "winograd-persian" may be fine or may be a show stopper, depends on what you care about.
That's why it's gotta be normalized, future-proof, and crowdsourced.
Hey Daniel, I would love to help out on this. I'm learning about LLMs and this benchmarking project sounds like a fun way to further my knowledge and skills. I sent you a message on LinkedIn.
For instance, are any of your prompting tests in say, Korean? What about winograd schema challenges in languages other than English? Japanese for instance, comes with its own unique set of context ambiguities that do not appear in English. I'm sure dozens of languages are similar. It'd be nice to have user contributable tests to cover the breadth of use cases here.
A great optimization that moves a score let's say from 95% -> 5% on "winograd-persian" may be fine or may be a show stopper, depends on what you care about.
That's why it's gotta be normalized, future-proof, and crowdsourced.