Hacker News new | past | comments | ask | show | jobs | submit login

It's time to design a public benchmark for these types of systems to compare between versions. Of course, any vendor who trains on the benchmark should face extreme contempt, but we'd also need to generate novel questions of equal complexity.

Alternatively, there should be a trusted auditor who uses a secret benchmark.




But this is the same version that changes without a change of the version number.


Well, people suspect it isn't, and it's not like we can see the internal version designation, and it's not even like we would care a lot, if it performed identically from day to day.

Indeed, you could do better or worse with the exact same raw checkpoint, just depending on inference-optimizing tricks.


So the version number is the day the benchmark is run. Version yyyy-mm-dd




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: