I find this argument weird. I'm not saying you can trust the open evals, I'm jus...

godelski 10 months ago | parent | context | favorite | on: DeepSeekMath 7B achieved 51.7% on MATH benchmark

I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.