1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which itself has an ELO neck-and-neck with GPT4o, and it's even. Because everyone in open source tends to train on validation benchmarks, you really can not trust them.
2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.
It’s not just open source that trains on the validation set. The big labs have already forgotten more about gaming MMLU down to the decimal than the open source community ever knew. Every once in a while they get sloppy and Claude does a faux pas with a BIGBENCH canary string or some other embarrassing little admission of dishonesty like that.
A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.
2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.
[1] https://molmo.allenai.org/blog