Model releases without comprehensive coverage of benchmarks make me deeply skeptical.
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon.
---
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon. ---
edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905