1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which i...

espadrine · 2024-09-26T12:40:26 1727354426

> I've been A/Bing 11B today with Molmo 72B

How are you testing Molmo 72B? If you are interacting with https://molmo.allenai.org/, they are using Molmo-7B-D.

benreesman · 2024-09-26T04:15:24 1727324124

It’s not just open source that trains on the validation set. The big labs have already forgotten more about gaming MMLU down to the decimal than the open source community ever knew. Every once in a while they get sloppy and Claude does a faux pas with a BIGBENCH canary string or some other embarrassing little admission of dishonesty like that.

A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.

sumedh · 2024-09-26T09:54:47 1727344487

I tried some OCR use cases, Claude Sonnet just blows Molmo.

knicholes · 2024-09-26T12:55:55 1727355355

When you say "blows," do you mean in a subservient sense or more like, "it blows it out of the water?"

grahamj · 2024-09-26T16:07:34 1727366854

yeah does it suck or does it suck?

GaggiX · 2024-09-26T01:43:41 1727315021

How about its performance compare to Qwen-2-72B tho?

f38zf5vdt · 2024-09-26T02:59:33 1727319573

Refer to the blog post I linked. Molmo is ahead of Qwen2 72b.