It's time to design a public benchmark for these types of systems to compare between versions. Of course, any vendor who trains on the benchmark should face extreme contempt, but we'd also need to generate novel questions of equal complexity.
Alternatively, there should be a trusted auditor who uses a secret benchmark.
Well, people suspect it isn't, and it's not like we can see the internal version designation, and it's not even like we would care a lot, if it performed identically from day to day.
Indeed, you could do better or worse with the exact same raw checkpoint, just depending on inference-optimizing tricks.
Alternatively, there should be a trusted auditor who uses a secret benchmark.