Hacker News new | past | comments | ask | show | jobs | submit | sweezyjeezy's comments login

Not particularly, only thing I can think of is if we analysed it and saw there was some bias in the digits, but no one expects that (pi should be a 'normal number' [1]). I think they did it as a flex of their hardware.

[1] https://en.wikipedia.org/wiki/Normal_number


Isn't there a non-zero chance that given an infinite number of digits, the probability of finding repeats of pi, each a bit longer, increases until a perfect, endless repeat of pi will eventually be found thus nullifying pi's own infinity?


No, because it would create a contradiction. If a "perfect, endless repeat of pi" were eventually found (say, starting at the nth digit), then you can construct a rational number (a fraction with an integer numerator and denominator) that precisely matches it. However, pi is provably irrational, meaning no such pair of integers exists. That produces a contradiction, so the initial assumption that a "perfect, endless repeat of pi" exists cannot be true.


Yes and that contradiction is already present in my premise which is the point. Pi, if an infinite stream of digits and with the prime characteristic it is normal/random, will, at some point include itself, by chance. Unless, not random...

This applies to every normal, "irrational" number, the name with which I massively agree, because the only way they can be not purely random suggests they are compressible further and so they have to be purely random, and thus... can't be.

It is a completely irrational concept, thinking rationally.


> Pi, if an infinite stream of digits and with the prime characteristic it is normal/random, will, at some point include itself, by chance.

What you are essentially saying is that pi = 3.14....pi...........

If that was the case, wouldn't it mean that the digits of pi are not countably infinite but instead is a continuum. So you wouldn't be able to put the digits of pi in one to one correspondence with natural numbers. But obviously we can so shouldn't our default be to assume our premise was wrong?

> It is a completely irrational concept, thinking rationally.

It is definitely interesting to think about.


The belief that a normal number must eventually contain itself arises from extremely flawed thinking about probability. Like djkorchi mentioned above, if we knew pi = 3.14....pi..., that would mean pi = 3.14... + 10^n pi for some n, meaning (1 - 10^n) pi = 3.14... and pi = (3.14...) / (1 - 10^n), aka a rational number.


> The belief that a normal number must eventually contain itself arises from extremely flawed thinking about probability.

Yes. There is an issue with the premise as it leads to a contradiction.

> Like djkorchi mentioned above, if we knew pi = 3.14....pi..., that would mean pi = 3.14... + 10^n pi for some n, meaning (1 - 10^n) pi = 3.14... and pi = (3.14...) / (1 - 10^n), aka a rational number.

Yes. If pi = 3.14...pi ( pi repeats at the end ), then it is rational as the ending pi itself would contain an ending pi and it would repeat forever ( hence a rational number ). I thought the guy was talking about pi contain pi somewhere within itself.

pi = 3.14...pi... ( where the second ... represents an infinite series of numbers ). Then we would never reach the second set of ... and the digits of pi would not be enumerable.

So if pi cannot be contained within ( anywhere in the middle of pi ) and pi cannot be contained at the end, then pi must not contain pi.


> If that was the case, wouldn't it mean that the digits of pi are not countably infinite but instead is a continuum.

No; combining two countably infinite sets doesn't increase the cardinality of the result (because two is finite). Combining one finite set with one countably infinite set won't give you an uncountable result either. The digits would still be countably infinite.

Looking at this from another direction, it is literally true that, when x = 1/7, x = 0.142....x.... , but it is obviously not true that the decimal expansion of 1/7 contains uncountably many digits.


> No; combining two countably infinite sets doesn't increase the cardinality of the result (because two is finite).

Agreed. But pi = 3.14...pi... isn't combing 2 infinite sets. It 'combining' infinite amounts of infinite sets and not in a linear fashion either.

You have to keep in mind the 2nd pi in the equation can be expanded to 3.14...pi...

pi = 3.14...pi... when expanded is pi = 3.14...(3.14...pi...)...

and you can keep expanding the inner pi forever.

> The digits would still be countably infinite.

How can you ever reach the first number after the inner pi in (pi = 3.14...pi...). Or put another way how do you get to the 4th '.'? You can't.

This is a classical example of countably infinite and a continuum.


> Pi, if an infinite stream of digits and with the prime characteristic it is normal/random, will, at some point include itself, by chance.

A normal number would mean that every finite sequence of digits is contained within the number. It does not follow that the number contains every infinite sequence of digits.

In general, something that holds for all finite x does not necessarily hold for infinite x as well.


Exactly - and when you remove the assumptions, what's left?

Pi is assumed to be infinite, random, and normal. The point here is not these assumptions may be wrong. Underneath them may sit a greater point; that irrationality is defined in a contradictory way - which may be correct, or not, or, both.

Given proof Pi is infinite lay on irrationality, it is rather an important issue. Pi may not be infinite, and a great place to observe that may be Planck.


> A normal number would mean that every finite sequence of digits is contained within the number.

Is that true? I don't see how that could be true. The sequence 0-9 repeated infinitely is, by definition, a normal number (in that the distribution of digits is uniform)

...and yet nowhere in that sequence does "321" appear ...or "654" ...or "99"

There are an infinite number of combinations of digits that do not appear in that normal number I've just described. So, I don't think your statement is true.


> I don't see how that could be true. The sequence 0-9 repeated infinitely is, by definition, a normal number (in that the distribution of digits is uniform)

Well, your first problem is that you don't know the definition of a normal number. Your second problem is that this statement is clearly false.

Here's Wolfram Alpha:

> A normal number is an irrational number for which any finite pattern of numbers occurs with the expected limiting frequency in the expansion in a given base (or all bases). For example, for a normal decimal number, each digit 0-9 would be expected to occur 1/10 of the time, each pair of digits 00-99 would be expected to occur 1/100 of the time, etc. A number that is normal in base-b is often called b-normal.

Your "counterexample" is not a normal number in any sense, most obviously because it isn't irrational, but only slightly less obviously because, as you note yourself, the sequences "321", "654", and "99" do not ever appear.


> Your "counterexample" is not a normal number in any sense, most obviously because it isn't irrational, but only slightly less obviously because, as you note yourself, the sequences "321", "654", and "99" do not ever appear.

lol. Your counterargument is a tautology because it contains "the sequences "321", "654", and "99" do not ever appear."

It's like if you claim, "A has the property B" then I say, "based on this definition, I don't think A has property B"

Then you say, "if it doesn't have property B, then it's not A"

...okay, but my point is, the definition that I had (from wikipedia) doesn't imply B. So for you to say, "if it doesn't have B, then it's not A" is just circular.

Now, you can point out that the definition I got from wikipedia is different from the one you got from wolfram. That's fine. That's also true. And you can argue that the definition you used does indeed imply B.

But what you cannot do is use B as part of the definition, when that's the thing I'm asking you to demonstrate.

You: all christians are pro-life

Me: I don't see how that's true. Here's the definition of christianity. I don't see how it necessarily implies being against abortion.

You: your """"counterexample"""" (sarcastic quotes to show how smart I am) is obviously wrong because, as you note yourself, that person is pro-choice, therefore, not a christian.

^^^^^ do you see how this exchange inappropriately uses the thing you're being asked to prove, which is that christians are pro-life, as a component of the argument?

Again, it's totally cool if you fine a different definition of christian that explicitly requires they be pro-life. But given that I didn't use that definition, that doesn't make it the slam dunk you imagine.


> But given that I didn't use that definition, that doesn't make it the slam dunk you imagine.

You might have a better argument if there were more than one relevant definition of a normal number. As you should have read in the other responses to your comment, the definition given on wikipedia does not differ from the one given on Wolfram Alpha.

> And you can argue that the definition you used does indeed imply B.

Given that the implication of "B" is stated directly within the definition ("For example, ..."), this seemed unnecessary.

> but my point is, the definition that I had (from wikipedia) doesn't imply B. So for you to say, "if it doesn't have B, then it's not A" is just circular.

Look at it this way:

1. You provided a completely spurious definition, which you obviously did not get from wikipedia.

2. You provided a number satisfying your spurious definition, which - not being normal - didn't have the properties of a normal number.

3. I responded that you weren't using the definition of a normal number.

4. And I also responded that it's easy to see that the number you provided is not normal, because it doesn't have the properties that a normal number must have.

Try to identify the circular part of the argument.

And, consider whether it's cause for concern that you believe you got a definition of "normal number" from wikipedia when that definition of "normal number" is not available on wikipedia.


> Try to identify the circular part of the argument.

I did. Should I repeat it?


It depends on your definition of "normal number". You seem to be using what wikipedia[1] calls "simply normal", which is that every digit appears with equal probability.

What people usually call "normal number" is much stronger: a number is normal if, when you write it in any base b, every n-digit sequence appears with the same probability 1/b^n.

[1] https://en.wikipedia.org/wiki/Normal_number


IIRC the property ‘each single digit has the same density’ is the definition for a ‘simply normal number’ (in a given base), while ‘each finite string of a particular length has the same density as all other strings of that length’ is the definition for a ‘normal number’ (in a given base). And then ‘normal in all bases’ is sometimes called ‘absolutely normal’, or just ‘normal’ without reference to a base.


The chance of that loop repeating forever is 0.


  Infinity has entered the chat.


In this case, the infinite sum

  0+0+0+0+…
is still zero.


You're using "AI" quite broadly here. Here's a perspective from computer vision (my field).

For decades, CV was focused on trying to 'understand' how to do the task. This meant a lot of hand crafting of low level features that are common in images, finding clever ways to make them invariant to typical 3D transformations. This works well for some tasks, and is still used today in things like robotics, SLAM etc. However - when we then want to add an extra level of complexity - e.g. to try and model an abstract concept like "cat", we hit a bit of a brick wall. This happens to be a task where feeding a large dataset into an (mostly) unconstrained machine learning model does very well.

> The classical approach was to understand how genes transcribe to mRNA, and how mRNA translates to polypeptides; how those are cleaved by the cell, and fold in 3D space; and how those 3D shapes results in actual biological function.

I don't have the expertise to critique this, but it does sound like we're in the extreme 'high complexity' zone to me. Some questions for you:

- how accurate does each stage of this need to get to useful performance? Are you sure there are no brick walls here? How long do you think this approach will take to deliver results?

- do you not have to validate a surprising classical finding in the same way that you would an AI model - i.e. how much does the "why" matter? "the AI can never provide a guarantee of correctness" - is true, but what it was merely extremely accurate, in the same way that many computer vision models are?


> do you not have to validate a surprising classical finding in the same way that you would an AI model - i.e. how much does the "why" matter? "the AI can never provide a guarantee of correctness" - is true, but what it was merely extremely accurate, in the same way that many computer vision models are?

The lack of asking "why" is one of my biggest frustrations in much of the research I have seen in biology and genetics today. The why is hugely important, without knowing why something happens or how it works we're left only with knowing what happened. When we go to use that as knowledge we have no idea what unintended side effects may occur and no real information telling us where to look or how to identify side effects should they occur.

Researching what happens when we throw crap at the wall can occasionally lead to a sellable product but is a far cry from the scientific method.


I mean - it's more than a sellable product, the reason we're doing this is to be able to advance medicine. A good understanding of the "why" - would be great, but if we can advance medicine quicker in the here and now without it, I think that's worth doing?

> When we go to use that as knowledge we have no idea what unintended side effects may occur and no real information telling us where to look or how to identify side effects should they occur.

Alright and what if this is also a lot quicker to solve with AI?


> I mean - it's more than a sellable product, the reason we're doing this is to be able to advance medicine

I get this approach for trauma care, but that's not really what we're talking about here. With medicine, how do we know we aren't making things worse without knowing how and why it works? We can focus on immediate symptom relief, but that's a very narrow window with regards to unintended harm.

> Alright and what if this is also a lot quicker to solve with AI?

Can we really call it solved if we don't know how or why it works, or what the limitations are?

Its extremely important to remember that we don't have Artificial Intelligence today, we have LLMs and similar tools designed to mimic human behaviors. An LLM will never invent a medical treatment or medication, or more precisely it may invent one by complete accident and it will look exactly like all the wrong answers it gave along the way. LLMs are tasked with answering questions in a way that statistically matches what humans might say, with variance based on randomness factors and a few other control knobs.

If we do get to actual AI that's a different story. It takes intelligence to invent these new miracle cures we hope they will invent. The AI has to reason about how the human body works, complex interactions between the body, environment, and any interventions, and it had to reason through the necessary mechanisms for a novel treatment. It would also need to understand how to model these complex systems in ways that humans have yet to figure out, if we already could model the human body in a computer algorithm we wouldn't need AI to do it for us.

Even at that point, let's say an AI invents a cure for cancer. Is that really worth all the potential downsides of all the dangerous things such a powerful AI could do? Is a cure for cancer worth knowing that the same AI could also be used to create bioweapons on a level that no human would be able to create? And that doesn't even get into the unknown risks of what an AI would want to do for itself, what its motivations would be, or what emotions and consciousness would look like when they emerge in am entirely new evolutionary system separate from biological life.


> how much does the "why" matter? [...] merely extremely accurate, in the same way that many computer vision models are?

Because without a "why" (causal reasoning) they cannot generalize, and their accuracy is always liable to tank when they encounter out-of-(training)-distribution samples. And when an ML system is deployed among other live actors, they are highly incentivized to figure out how to perturb inputs to exploit the system. Adversarial examples in computer vision, adversarial prompts / jailbreaks for large language models, etc.


Entertaining, but I think the conclusion is way off.

> their vision is, at best, like that of a person with myopia seeing fine details as blurry

is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

[1] https://arxiv.org/abs/2404.04125


There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.

This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.


If we're throwing "citation needed" tags on stuff, how about the first sentence?

"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"

I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.

This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).

The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.


"maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it."

Goldman Sachs recently issued a report.

https://www.goldmansachs.com/intelligence/pages/gs-research/...

"We estimate that the AI infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will AI solve? Replacing low- wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry"


Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.

You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.

Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.

Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.


Embodied is useful, but I think not necessary even if you need learning in a 3D environment. Synthesized embodiment should be enough. While in some cases[0] it may have problems with fidelity, simulating embodied experience in silico scales much better, and more importantly, we have control over time flow. Humans always learn in real-time, while with simulated embodiment, we could cram years of subjective-time experiences into a model in seconds, and then for novel scenarios, spend an hour per each second of subjective time running a high-fidelity physics simulation[1].

--

[0] - Like if you plugged a 3D game engine into the training loop.

[1] - Results of which we could hopefully reuse in training later. And yes, a simulation could itself be a recording of carefully executed experiment in real world.


> Humans always learn in real-time

In the sense that we can't fast-forward our offline training, sure, but humans certainly "go away and think about it" after gaining IRL experience. This process seems to involve both consciously and subconsciously training on this data. People often consciously think about recent experiences, run through imagined scenarios to simulate the outcomes, plan approaches for next time etc. and even if they don't, they'll often perform better at a task after a break than they did at the start of the break. If this process of replaying experiences and simulating variants of them isn't "controlling the flow of (simulated) time" I don't know what else you'd call it.


> Like if you plugged a 3D game engine into the training loop

Isn't this what synthesized embodiment basically always is? As long as the application of the resulting technology is in a restricted, well controlled environment, as is the case for example for an assembly-line robot, this is a great strategy. But I expect fidelity problems will make this technique ultimately a bad idea for anything that's supposed to interact with humans. Like self-driving cars, for example. Unless, again, those self-driving cars are segregated in dedicated lanes.


The paper I linked should hopefully mark me out as far from an AI apologist, it's actually really bad news for GenAI if correct. All I mean to say is the clickbait conclusion and the evidence do not match up.


We have started the ara of ai.

It really doesn't matter how good current llms are.

They have been good enough to start this ara.

And no it's not and never has been just llms. Look what Nvidia is doing with ml.

Whisper huge advantage, segment anything again huge. Alpha fold 2 again huge.

All the robot announcements -> huge

I doubt we will reach agi just through llms. We will reach agi through multi modal, mix of experts, some kind of feedback loop, etc.

But the stone started to roll.

And you know I prefer to hear about ai advantages for the next 10-30 years. That's a lot better than the crypto shit we had the last 5 years.


We won't reach agi in our lifetimes.


Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?


Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.

With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.


I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.

“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).


Based take


I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.

They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.

I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.


Exactly... I've found GPT-4o to be good at OCR for instance... doesn't seem "blind" to me.


Well maybe not blind but the analogy with myopia might stand.

For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.

I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.


I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.

So unless I missed something but as far as I am concerned, they are optimized for benchmarks.

So while I enjoy gen AI, image-to-text is highly subpart.


Most adults with 20/20 vision will also fail to recognize the first note on a first week solfege book.


Useful to know, thank you!


You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.

At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.


It's not that you need an LLM for OCR but the fact that an LLM can do OCR (and handwriting recognition which is much harder) despite not being made specifically for that purpose is indicative of something. The jump from knowing "this is a picture of a paper with writing on it" like what you get with CLIP to being able to reproduce what's on the paper is, to me, close enough to seeing that the difference isn't meaningful anymore.


GPT-4v is provided with OCR


That's a common misconception.

Sometimes if you upload an image to ChatGPT and ask for OCR it will run Python code that executes Tesseract, but that's effectively a bug: GPT-4 vision works much better than that, and it will use GPT-4 vision if you tell it "don't use Python" or similar.


No reason to believe that. Open source VLMs can do OCR.[1]

[1] https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...


I think the conclusion of the paper is far more mundane. It's curious that VLM can recognize complex novel objects in a trained category, but cannot perform basic visual tasks that human toddlers can perform (e.g. recognizing when two lines intersect or when two circles overlap). Nevertheless I'm sure these models can explain in great detail what intersecting lines are, and even what they look like. So while LLMs might have image processing capabilities, they clearly do not see the way humans see. That, I think, would be a more apt title for their abstract.


Ah yes the blind person who constantly needs to know if two lines intersect.

Let's just ignore what a blind person normally needs to know.

You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.

Sometimes they need to read a basic sign and models can do this.

Those models help people already and they will continue to get better.

I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.

Valid criticism doesn't need to be shitty


As an aside... from 2016 this is what was a valid use case for a blind person with an app.

Seeing AI 2016 Prototype - A Microsoft research project - https://youtu.be/R2mC-NUAmMk

https://www.seeingai.com are the actual working apps.

The version from 2016 I recall showing (pun not intended) to a coworker who had some significant vision impairments and he was really excited about what it could do back then.

---

I still remain quite impressed with its ability to parse the picture and likely reason behind it https://imgur.com/a/JZBTk2t


Entertaining is indeed the right word. Nice job identifying corner cases of models' visual processing; curiously, they're not far conceptually from some optical illusions that reliably trip humans up. But to call the models "blind" or imply their low performance in general? That's trivially invalidated by just taking your phone out and feeding a photo to ChatGPT app.

Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).


> their vision is, at best, like that of a person with myopia seeing fine details as blurry

It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description


They didn't test that claim at all though. Vision isn't some sort of 1D sliding scale with every vision condition lying along one axis.

First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.

And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.

I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.


It seems to me they've brought up myopia only to make it more approachable to people how blurry something is, implying they believe models work with a blurry image just like a nearsighted person sees blurry images at a distance.

While myopia is common, it's not the best choice of analogy and "blurry vision" is probably clear enough.

Still, I'd only see it as a bad choice of analogy — I can't imagine anyone mistaking optical focus problems for static image processing problems — so in the usual HN recommendation, I'd treat their example in the most favourable sense.


My thoughts as well. I too would have trouble with the overlapping lines tests if all the images underwent convolution.


> these huge GenAI models are pretty good at things

Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.

Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.

AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.


>I could (well actually I can't)

I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.


LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.


Those don't really have anything to do with fine detail/nearsightedness. What they measured is valid/interesting - what they concluded is unrelated.


> Did they try to probe that hypothesis at all?

I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.

> I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.

I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.

But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.

> But in practice, we aren't just making up tasks to trip up these models.

I see this sentiment quite often and it is baffling to me.

First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.

Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.

> They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.

I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.


> Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly?

People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

The real world outside is designed to trip up the model. Strange things happen all the time.

Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.


> The real world outside is designed to trip up the model. Strange things happen all the time.

Counterpoint: real world is heavily sanitized towards things that don't trip human visual perception up too much, or otherwise inconvenience us. ML models are trained on that, and for that. They're not trained for dealing with synthetic images, that couldn't possibly exist in reality, and designed to trip visual processing algorithms up.

Also:

> People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.

Glass half-full (of gasoline) take: those products will trip over real-world problems, identifying them in the process, and the models will get better walking over the corpses of failed AI-get-rich-quick companies. The people involved may not want to know where the problems are, but by deploying the models, they'll reveal those problems to all.

> Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

That, unfortunately, I 100% agree with. Though AI isn't special here - not giving a fuck kills people regardless of the complexity of software involved.


> They're not trained for dealing with synthetic images, that couldn't possibly exist in reality, and designed to trip visual processing algorithms up

Neither of these claims are true. ML is highly trained on synthetic images. In fact, synthetic data generation is the way forward for the scale is all you need people. And there are also loads of synthetic images out in the wild. Everything from line art to abstract nonsense. Just take a walk down town near the bars.

> not giving a fuck kills people regardless of the complexity of software involved.

What has me the most frustrated is that this "move fast break things and don't bother cleaning up" attitude is not only common in industry but also in academia. But these two are incredibly intertwined these days and it's hard to publish without support from industry because people only evaluate on benchmarks. And if you're going to hack your benchmarks, you just throw a shit ton of compute at it. Who cares where the metrics fail?


> Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.

The conclusion and the premise are both true, but not the causality. On AI, the Overton window is mostly filled with people going "this could be very bad if we get it wrong".

Unfortunately, there's enough people who think "unless I do it first" (Musk, IMO) or "it can't possibly be harmful" (LeCun) that it will indeed kill more people than it already has.

The number who are already (and literally) "dead in a ditch" is definitely above zero if you include all the things that used to be AI when I was a kid e.g. "route finding": https://www.cbsnews.com/news/google-sued-negligence-maps-dri...


> I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis.

I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?

> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry

And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

>I find it hard to believe that there is no example you can give.

I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.


> I don't know, words have meanings.

That's quite true. Words mean exactly what people agree upon them meaning. Which does not require everyone, or else slang wouldn't exist. Nor the dictionary, which significantly lags. Regardless, I do not think this is even an unusual use of the word, though I agree the mention of myopia is. The usage makes sense if you consider that both myopic and resolution have more than a singular meaning.

  Myopic:
  lacking in foresight or __discernment__ : narrow in perspective and without concern for broader implications

  Resolution:
  the process or capability of making distinguishable the individual parts of an object, closely adjacent optical images, or sources of light
I agree that there are far better ways to communicate. But my main gripe is that they said it was "their hypothesis." If reading the abstract as a whole, I find it an odd conclusion to come to. It doesn't pair with the words that follow with blind guessing (and I am not trying to defend the abstract. It is a bad abstract). But if you read the intro and look at the context of their landing page, I find it quite difficult to come to this conclusion. It is poorly written, but it is still not hard to decode the key concepts the authors are trying to convey.

I feel the need to reiterate that language has 3 key aspects to it: the concept attempted to be conveyed, the words that concept is lossy encoded into, and the lossy decoding of the person interpreting it. Communication doesn't work by you reading/listening to words and looking up those words in a dictionary. Communication is a problem where you use words (context/body language/symbols/etc) to decrease the noise and get the reciever to reasonably decode your intended message. And unfortunately we're in a global world and many different factors, such as culture, greatly affect how one encodes and/or decodes language. It only becomes more important to recognize the fuzziness around language here. Being more strict and leaning into the database view of language only leads to more errors.

> But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

Because they didn't claim that image size and sharpness was an issue. They claimed the VLM cannot resolve the images "as if" they were blurry. Determining what the VLM actually "sees" is quite challenging. And I'll mention that arguably they did test some factors that relate to blurriness. Which is why I'm willing to overlook the poor analogy.

> I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those

I'm not. Depending on the examples you pulled, 2 random ones passing isn't unlikely given the results.

Something I generally do not like about these types of papers is that they often do not consider augmentations. Since these models tend to be quite sensitive to both the text (prompt) inputs and image inputs. This is quite common in generators in general. Even the way you load in and scale an image can have significant performance differences. I've seen significant differences in simple things like loading an image from numpy, PIL, tensorflow, or torch have different results. But I have to hand it to these authors, they looked at some of this. In the appendix they go through with confusion matrices and look at the factors that determine misses. They could have gone deeper and tried other things, but it is a more than reasonable amount of work for a paper.


I think gpt4o is probably doing some ocr as preprocessing. It's not really controversial to say the vmls today don't pick up fine grained details - we all know this. Can just look at the output of a vae to know this is true.


If so, it's better than any other ocr on the market.

I think they just train it on a bunch of text.

Maybe counting squares in a grid was not probably considered important enough to train for.


Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp.


There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait.


Yeah I think their findings are def interesting but the title and the strong claims are a tad hyperbolic.


Edit: previous title was "Leetcode for ML" or somesuch...

I like the idea and might try some! But as a warning: leetcode is specifically aimed at prepping for interviews, and I've never seen questions like these in an interview (I'm somewhere between an MLE and ML researcher FWIW). The most common kinds of ML-specific things in my experience are:

- ML system design (basically everyone does this)

- ML knowledge questions ("explain ADAM etc.")

- probability + statistics knowledge

- ML problem solving in a notebook (quite rare, but some do it)


Probably should have titled it something else, I made it more as a learning platform for people to get better at ml by implementing algorithms from scratch. I’m currently a data scientist but wanted to become a machine learning researcher or engineer and I thought these types of questions would help


I saw the k-means one a couple times


Not sure I understand your argument - from a customer perspective why is the service worse, and why do you believe that it has BECOME worse?


If it were as good or better than before, nobody would want to pay attention to when to schedule deliveries based on the deliverator. Since this person specifically wants some deliverators over others, that means that the quality of the service is less good. They need to spend time considering scheduling, where they did not before.

I don't worry about which USPS mail carrier delivers my mail -- I know it will be consistent and good enough. I happen to know who my usual carrier is, because I work from home and she likes to say hi to cats if they are in the front window. I also know the face of the usual UPS driver and the usual FedEx driver; they aren't here 6 days a week, but often enough that I recognize them.

In none of those cases do I expect a quality change based on the driver. I expect competence, and I get it so often that the exceptions really stand out.

From the Shipt workers' perspective, they now need to worry about customers discriminating among them rather than just getting the job done.


This “Shipt,” though, involves an opportunity for some degree of relationship to make a difference, right? Your mail carrier must deliver your package, the package is the package, it’s either delivered or not. Maybe there’s a small margin around the edge where one carrier is nice to the cats and the other isn’t.

These Shipt people, though, have to interpret your preferences and essentially act as your agent as they decide what to pick from the store shelves on your behalf. Sometimes they make decisions that you probably would have made, sometimes less so; sometimes they’re confident that you understand each other, sometimes they’re nervous and want to hassle you about each of 10 different little decision points. When you find somebody I work well with, isn’t it a positive that you get to try to keep that relationship for future transactions? Isn’t this the same dynamic underpinning virtually every in-person service, from your hair cutting human to the tradies who do work on your house to the dry cleaner?

For that matter, doesn’t it create a perverse incentive if worker doesn’t believe that trying to understand my preferences will ever pay off? That it’s a one-off game rather than an iterated series of games, and effort to excel and bring human judgment to bear is wasted because there’s no way to reward it?

Doesn’t the enshittification tend to require as a prerequisite that a platform is successful at alienating service providers from service recipients (and from each other) like that?


Some research to the contrary [1] - tldr is that they didn't find evidence that generative models really do zero shot well at all yet, if you show it something it literally hasn't seen before, it isn't "generally intelligent" enough to do it well. This isn't an issue for a lot of use-cases, but does seem to add some weight to the "giga-scale memorization" hypothesis.

[1] https://arxiv.org/html/2404.04125v2


If all humans could echo-locate except a subset who couldn't, I would say that group is at a small disadvantage, yes, because the set of things the main group can do is 'objectively' greater. I don't know how relevant it would be for us especially in the daytime, but hey.


I agree. However, whether all kids should receive gene therapy to develop the echo-location once the science allows for that is a more nuanced question.


If most buildings lacked light because the vast majority just echolocated, then the those unable would be disabled.

It is okay for there to be a normal human experience, and define inability to participate as a disability.


> It is okay for there to be a normal human experience, and define inability to participate as a disability.

That's what most people normally do, yes. Then many people out there define an "ability to enjoy hetero sex" as a "normal human experience" and therefore see gayness as a disability that needs a cure.

I'm not arguing about the conclusion here, but about the method and the basis for deriving this conclusion.

The initial comment in this thread declared deafness to be an "actual defect" while gayness "is just people being inherently gay". Such division is completely arbitrary and doesn't follow from any law of nature. Only from current societal views which change a lot with time.


> doesn't follow from any law of nature

Natural selection gave us hearing.


Right, and it also gave us a strong desire for the opposite sex. So if you draw the line based on this principle then gayness and deafness will fall on the same side of this line, whichever side it is.

Hence I write that the initial comment making the distinction between gayness being obviously OK and deafness being obviously not OK, look arbitrary to me. This division is cultural.


This feels a bit tenuous. What world do you envisage where it's a completely level playing field? Do we ban talking, music, sound in movies etc etc??


I think building a completely level playing field is a dangerous utopia. Essentially it's the same idea as fixing people to make them equal, just addressed from the opposite end.


The task is not to compress general data by a factor of 200, the task is to compress a very domain-specific kind of data by a factor of 200. Presumably the hope is this data has lower entropy than e.g. the Hutter prize data.

If I tell you to write an image compression algorithm, you aren't going to be able to do much with a bitmap of uniform randomly generated pixels. However if I tell you that in the domain I'm working in there are only two colors white and black, immediately I can reduce storing each pixel from 24 bits to 1bit, saving a factor of 24. If I tell you further that >99% of pixels are going to be black, more compression tricks become possible, etc.

I don't have expertise in this particular problem, but a priori dismissing it by comparing to Hutter is not valid.


> Space in academic journals is too precious to waste on inessential content

Not the biggest issue in maths - the arxiv version usually won't match the journal version 1:1


and they publish on internet, solid copy is not popular today.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: