Hacker News new | past | comments | ask | show | jobs | submit login

Exactly... I've found GPT-4o to be good at OCR for instance... doesn't seem "blind" to me.



Well maybe not blind but the analogy with myopia might stand.

For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.

I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.


I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.

So unless I missed something but as far as I am concerned, they are optimized for benchmarks.

So while I enjoy gen AI, image-to-text is highly subpart.


Most adults with 20/20 vision will also fail to recognize the first note on a first week solfege book.


Useful to know, thank you!


You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.

At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.


It's not that you need an LLM for OCR but the fact that an LLM can do OCR (and handwriting recognition which is much harder) despite not being made specifically for that purpose is indicative of something. The jump from knowing "this is a picture of a paper with writing on it" like what you get with CLIP to being able to reproduce what's on the paper is, to me, close enough to seeing that the difference isn't meaningful anymore.


GPT-4v is provided with OCR


That's a common misconception.

Sometimes if you upload an image to ChatGPT and ask for OCR it will run Python code that executes Tesseract, but that's effectively a bug: GPT-4 vision works much better than that, and it will use GPT-4 vision if you tell it "don't use Python" or similar.


No reason to believe that. Open source VLMs can do OCR.[1]

[1] https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: