Llama 3.2 vision models don't seem that great if they have to compare them to Cl...

f38zf5vdt · 2024-09-26T00:55:44 1727312144

1. Ignore the benchmarks. I've been A/Bing 11B today with Molmo 72B [1], which itself has an ELO neck-and-neck with GPT4o, and it's even. Because everyone in open source tends to train on validation benchmarks, you really can not trust them.

2. The method of tokenization/adapter is novel and uses many fewer tokens than all comparable CLIP/SigLIP-adapter models, making it _much_ faster. Attention is O(n^2) on memory/compute per sequence length.

[1] https://molmo.allenai.org/blog

espadrine · 2024-09-26T12:40:26 1727354426

> I've been A/Bing 11B today with Molmo 72B

How are you testing Molmo 72B? If you are interacting with https://molmo.allenai.org/, they are using Molmo-7B-D.

benreesman · 2024-09-26T04:15:24 1727324124

It’s not just open source that trains on the validation set. The big labs have already forgotten more about gaming MMLU down to the decimal than the open source community ever knew. Every once in a while they get sloppy and Claude does a faux pas with a BIGBENCH canary string or some other embarrassing little admission of dishonesty like that.

A big lab gets exactly the score on any public eval that they want to. They have their own holdouts for actual ML work, and they’re some of the most closely guarded IP artifacts, far more valuable than a snapshot of weights.

sumedh · 2024-09-26T09:54:47 1727344487

I tried some OCR use cases, Claude Sonnet just blows Molmo.

knicholes · 2024-09-26T12:55:55 1727355355

When you say "blows," do you mean in a subservient sense or more like, "it blows it out of the water?"

grahamj · 2024-09-26T16:07:34 1727366854

yeah does it suck or does it suck?

GaggiX · 2024-09-26T01:43:41 1727315021

How about its performance compare to Qwen-2-72B tho?

f38zf5vdt · 2024-09-26T02:59:33 1727319573

Refer to the blog post I linked. Molmo is ahead of Qwen2 72b.

dannyobrien · 2024-09-26T00:26:13 1727310373

What interface do you use for a locally-run Qwen2-VL-7B? Inspired by Simon Willison's research[1], I have tried it out on Hugging Face[2]. Its handwriting recognition seems fantastic, but I haven't figured out how to run it locally yet.

[1] https://simonwillison.net/2024/Sep/4/qwen2-vl/ [2] https://huggingface.co/spaces/GanymedeNil/Qwen2-VL-7B

Eisenstein · 2024-09-26T02:52:05 1727319125

MiniCPM-V 2.6 is based on Qwen 2 and is also great at handwriting. It works locally with KoboldCPP. Here are the results I got with a test I just did.

Image:

* https://imgur.com/wg0kdQK

Output:

* https://pastebin.com/RKvYQasi

OCR script used:

* https://github.com/jabberjabberjabber/LLMOCR/blob/main/llmoc...

Model weights: MiniCPM-V-2_6-Q6_K_L.gguf, mmproj-MiniCPM-V-2_6-f16.gguf

Inference:

* https://github.com/LostRuins/koboldcpp/releases/tag/v1.75.2

jona-f · 2024-09-26T10:47:38 1727347658

Should the line "p.o. 5rd w/ new W5 533" say "p.o. 3rd w/ new WW 5W .533R"?

What does p.o. stand for? I can't make out the first letter. It looks more like the f, but the nodge on the upper left only fits the p. All the other p's look very different though.

Eisenstein · 2024-09-26T15:03:18 1727362998

'Replaced R436, R430 emitter resistors on right-channel power output board with new wire-wound 5watt .33ohm 5% with ceramic lead insulators'

jona-f · 2024-09-26T20:28:06 1727382486

Thx :). I thought the 3 looked like a b but didn't think brd would make any sense. My reasoning has led me astray.

Eisenstein · 2024-09-26T21:13:32 1727385212

Yeah. If you realize that a large part of the llm's 'ocr' is guessing due to context (token prediction) and not actually recognizing the characters exactly, you can see that it is indeed pretty impressive because the log it is reading uses pretty unique terminology that it couldn't know from training.

jona-f · 2024-09-27T06:57:20 1727420240

I'd say as an llm it should know this kind of stuff from training, contrary to me, for whom this is out of domain data. Anyhow I don't think the AI did a great job on that line. Would require better performance for it to be useful for me. I think larger models might actually be better at this than I am, which would be very useful.

Eisenstein · 2024-09-27T08:52:35 1727427155

Be aware that a lot of this also has to do with prompting and sampler settings. For instance changing the prompt from 'write the text on the image verbatim' to something like 'this is an electronics repair log using shorthand...' and being specific about it will give the LLM context in which to make decisions about characters and words.

hansoolo · 2024-09-26T10:16:43 1727345803

Thanks for the hint. Will try the out!