Hacker News new | past | comments | ask | show | jobs | submit login

There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.

This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.




If we're throwing "citation needed" tags on stuff, how about the first sentence?

"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"

I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.

This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).

The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.


"maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it."

Goldman Sachs recently issued a report.

https://www.goldmansachs.com/intelligence/pages/gs-research/...

"We estimate that the AI infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will AI solve? Replacing low- wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry"


Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.

You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.

Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.

Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.


Embodied is useful, but I think not necessary even if you need learning in a 3D environment. Synthesized embodiment should be enough. While in some cases[0] it may have problems with fidelity, simulating embodied experience in silico scales much better, and more importantly, we have control over time flow. Humans always learn in real-time, while with simulated embodiment, we could cram years of subjective-time experiences into a model in seconds, and then for novel scenarios, spend an hour per each second of subjective time running a high-fidelity physics simulation[1].

--

[0] - Like if you plugged a 3D game engine into the training loop.

[1] - Results of which we could hopefully reuse in training later. And yes, a simulation could itself be a recording of carefully executed experiment in real world.


> Humans always learn in real-time

In the sense that we can't fast-forward our offline training, sure, but humans certainly "go away and think about it" after gaining IRL experience. This process seems to involve both consciously and subconsciously training on this data. People often consciously think about recent experiences, run through imagined scenarios to simulate the outcomes, plan approaches for next time etc. and even if they don't, they'll often perform better at a task after a break than they did at the start of the break. If this process of replaying experiences and simulating variants of them isn't "controlling the flow of (simulated) time" I don't know what else you'd call it.


> Like if you plugged a 3D game engine into the training loop

Isn't this what synthesized embodiment basically always is? As long as the application of the resulting technology is in a restricted, well controlled environment, as is the case for example for an assembly-line robot, this is a great strategy. But I expect fidelity problems will make this technique ultimately a bad idea for anything that's supposed to interact with humans. Like self-driving cars, for example. Unless, again, those self-driving cars are segregated in dedicated lanes.


The paper I linked should hopefully mark me out as far from an AI apologist, it's actually really bad news for GenAI if correct. All I mean to say is the clickbait conclusion and the evidence do not match up.


We have started the ara of ai.

It really doesn't matter how good current llms are.

They have been good enough to start this ara.

And no it's not and never has been just llms. Look what Nvidia is doing with ml.

Whisper huge advantage, segment anything again huge. Alpha fold 2 again huge.

All the robot announcements -> huge

I doubt we will reach agi just through llms. We will reach agi through multi modal, mix of experts, some kind of feedback loop, etc.

But the stone started to roll.

And you know I prefer to hear about ai advantages for the next 10-30 years. That's a lot better than the crypto shit we had the last 5 years.


We won't reach agi in our lifetimes.


Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?


Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.

With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.


I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.

“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).


Based take


I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.

I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.


Exactly... I've found GPT-4o to be good at OCR for instance... doesn't seem "blind" to me.


Well maybe not blind but the analogy with myopia might stand.

For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.

I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.


I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.

So unless I missed something but as far as I am concerned, they are optimized for benchmarks.

So while I enjoy gen AI, image-to-text is highly subpart.


Most adults with 20/20 vision will also fail to recognize the first note on a first week solfege book.


Useful to know, thank you!


You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.

At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.


It's not that you need an LLM for OCR but the fact that an LLM can do OCR (and handwriting recognition which is much harder) despite not being made specifically for that purpose is indicative of something. The jump from knowing "this is a picture of a paper with writing on it" like what you get with CLIP to being able to reproduce what's on the paper is, to me, close enough to seeing that the difference isn't meaningful anymore.


GPT-4v is provided with OCR


That's a common misconception.

Sometimes if you upload an image to ChatGPT and ask for OCR it will run Python code that executes Tesseract, but that's effectively a bug: GPT-4 vision works much better than that, and it will use GPT-4 vision if you tell it "don't use Python" or similar.


No reason to believe that. Open source VLMs can do OCR.[1]

[1] https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...


I think the conclusion of the paper is far more mundane. It's curious that VLM can recognize complex novel objects in a trained category, but cannot perform basic visual tasks that human toddlers can perform (e.g. recognizing when two lines intersect or when two circles overlap). Nevertheless I'm sure these models can explain in great detail what intersecting lines are, and even what they look like. So while LLMs might have image processing capabilities, they clearly do not see the way humans see. That, I think, would be a more apt title for their abstract.


Ah yes the blind person who constantly needs to know if two lines intersect.

Let's just ignore what a blind person normally needs to know.

You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.

Sometimes they need to read a basic sign and models can do this.

Those models help people already and they will continue to get better.

I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.

Valid criticism doesn't need to be shitty


As an aside... from 2016 this is what was a valid use case for a blind person with an app.

Seeing AI 2016 Prototype - A Microsoft research project - https://youtu.be/R2mC-NUAmMk

https://www.seeingai.com are the actual working apps.

The version from 2016 I recall showing (pun not intended) to a coworker who had some significant vision impairments and he was really excited about what it could do back then.

---

I still remain quite impressed with its ability to parse the picture and likely reason behind it https://imgur.com/a/JZBTk2t




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: