How is this benchmark not inherently biased towards GPT?
If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?
Should be evaluating each prompt multiple times to see how much variance in the scores there are. Even gpt-4 grading gpt-4 should probably be done several times
If I did the same sort of thing but used Claude to grade the tests, would I get similar results? Or would that be inherently biased towards Claude scoring high?