Hacker News new | past | comments | ask | show | jobs | submit login
Evaluating 55 LLMs with GPT-4 (llmonitor.com)
36 points by vincelt on Oct 8, 2023 | hide | past | favorite | 8 comments



How is this benchmark not inherently biased towards GPT?

If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?


I always find evals of this flavor offputting given that 3.5 and 4 likely share preference models (or at least feedback data)


Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times


Why no multi-turn evaluation? A lot of these benchmarks fail to capture the strength of ghost attention used in Llama 2 chat models.


Any reason why palm or cohere models are not here ?


Palm 2 is tied for #10


GPT-4-0314 is top of the league table (ie. Not the latest version, but the version released in March).

Is this our Concorde moment?


Really cool thanks




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: