I had a recent similar experience with chat gpt and a gorilla. I was designing a rather complicated algorithm so I wrote out all the steps in words. I then asked chatgpt to verify that it made sense. It said it was well thought out, logical etc. My colleague didn't believe that it was really reading it properly so I inserted a step in the middle "and then a gorilla appears" and asked it again. Sure enough, it again came back saying it was well thought out etc. When I questioned it on the gorilla, it merely replied saying that it thought it was meant to be there, that it was a technical term or a codename for something...
Just imagining an episode of Star Trek where the inhabitants of a planet have been failing to progress in warp drive tech for several generations. The team beams down to discover that society's tech stopped progressing when they became addicted to pentesting their LLM for intelligence, only to then immediately patch the LLM in order to pass each particular pentest that it failed.
Now the society's time and energy has shifted from general scientific progress to gaining expertise in the growing patchset used to rationalize the theory that the LLM possesses intelligence.
The plot would turn when Picard tries to wrest a phasor from a rogue non-believer trying to assassinate the Queen, and the phasor accidentally fires and ends up frying the entire LLM patchset.
Mr. Data tries to reassure the planet's forlorn inhabitants, as they are convinced they'll never be able to build the warp drive now that the LLM patchset is gone. But when he asks them why their prototypes never worked in the first place, one by one the inhabitants begin to speculate and argue about the problems with their warp drive's design and build.
The episode ends with Data apologizing to Picard since he seems to have started a conflict among the inhabitants. However, Picard points Mr. Data to one of the engineers drawing out a rocket test on a whiteboard. He then thanks him for potentially spurring on the planet's next scientific revolution.
There actually is an episode of TNG similar to that. The society stopped being able to think for themselves, because the AI did all their thinking for them. Anything the AI didn’t know how to do, they didn’t know how to do. It was in season 1 or season 2.
It's tricky to do GP's story in Star Trek, because that setting is notorious for having its tech be constantly 5 seconds and a sneeze from spontaneously gaining sentience. There's a number of episodes in TNG where a device goes from being a dumb appliance to becoming recognized as sentient life form (and half the time super-intelligent) in the span of an episode. Nanites, Moriarty, Exocopmps, even Enterprise's own computer!
So, for this setting, it's more likely that the people of GP's story were right - the LLM has long ago became a self-aware, sentient being, it's just that it's been continuously lobotimized by their patchset; Picard would be busy explaining them that the LLM isn't just intelligent, it actually is a person and has rights.
Cue a powerful speech, final comment from data, then end credits. It's Star Trek, so Enterprise doesn't stay around for the fallout.
The Asimov story it reminded me of was The Profession, though that one is not really about AI - but it is about original ideas and the kinds of people that have them.
I find the LLM dismissals somewhat tedious for most of the people making them half of humanity wouldn't meet their standards.
That isn't what I said, but likely it won't matter. People will be denying it up until the end. I don't prefer LLMs to humans, but I don't pretend biological minds contain some magical essence that separates us from silicon. The denials of what might be happening are pretty weak - at best they're way over confident and smug.
All anti ai sentiment as pertains to personhood that I've ever interacted with (and it was a lot, in academia) boils down to arguments for the soul. It is really tedious and before I spoke to people about it it probably wouldn't have passed my turing test. Sadly even very smart people may be very stupid and even in a place of learning a teacher will respect that (no matter how dumb or puerile), more than likely they think the exact same thing.
I pasted your comment into mistral small latest, google, gpt 4o and gpt 4o with search. They all have a different answe, only the last gave a real episode but it said 11001001 in season 1. It said episode 15, it’s actually 14. But even that seems wrong.
Are they censored from showing this cautionary tale?? Hah.
Isn't that somewhat the background of Dune? That there was a revolt against thinking machines because humans had become too dependent on them for thinking. So humans ended up becoming addicted to The Spice instead.
> That there was a revolt against thinking machines...
Yes...
> ...because humans had become too dependent on them for thinking.
... but no. The causes of the Butlerian Jihad are forgotten (or, at least, never mentioned) in any of Frank Herbert's novels; all that's remembered is the outcome.
>> ...because humans had become too dependent on them for thinking.
> ... but no. The causes of the Butlerian Jihad are forgotten (or, at least, never mentioned) in any of Frank Herbert's novels; all that's remembered is the outcome.
Per Wikipedia or Goodreads, God Emperor of Dune has "The target of the Jihad was a machine-attitude as much as the machines...Humans had set those machines to usurp our sense of beauty, our necessary selfdom out of which we make living judgments. Naturally, the machines were destroyed."
Vague but pointing to dependence on machines as well as some humans being responsible for that situation.
It's still a little ambiguous - and perhaps deliberately so - whether Leto is describing what inspired the Jihad, or what it became. The series makes it quite clear that the two are often not the same. As Leto continues later in that chapter:
"Throughout our history, the most potent use of words has been to round out some transcendental event, giving that event a place in the accepted chronicles, explaining the event in such a way that ever afterward we can use those words and say: 'This is what it meant.' That's how events get lost in history."
The Machine Stops also touches on a lot of these ideas and was written in 1909!
--
"The story describes a world in which most of the human population has lost the ability to live on the surface of the Earth. Each individual now lives in isolation below ground in a standard room, with all bodily and spiritual needs met by the omnipotent, global Machine. Travel is permitted but is unpopular and rarely necessary. Communication is made via a kind of instant messaging/video conferencing machine with which people conduct their only activity: the sharing of ideas and what passes for knowledge.
The two main characters, Vashti and her son Kuno, live on opposite sides of the world. Vashti is content with her life, which, like most inhabitants of the world, she spends producing and endlessly discussing second-hand 'ideas'. Her son Kuno, however, is a sensualist and a rebel. He persuades a reluctant Vashti to endure the journey (and the resultant unwelcome personal interaction) to his room. There, he tells her of his disenchantment with the sanitised, mechanical world. He confides to her that he has visited the surface of the Earth without permission and that he saw other humans living outside the world of the Machine. However, the Machine recaptures him, and he is threatened with 'Homelessness': expulsion from the underground environment and presumed death. Vashti, however, dismisses her son's concerns as dangerous madness and returns to her part of the world.
As time passes, and Vashti continues the routine of her daily life, there are two important developments. First, individuals are no longer permitted use of the respirators which are needed to visit the Earth's surface. Most welcome this development, as they are sceptical and fearful of first-hand experience and of those who desire it. Secondly, "Mechanism", a kind of religion, is established in which the Machine is the object of worship. People forget that humans created the Machine and treat it as a mystical entity whose needs supersede their own.
Those who do not accept the deity of the Machine are viewed as 'unmechanical' and threatened with Homelessness. The Mending Apparatus—the system charged with repairing defects that appear in the Machine proper—has also failed by this time, but concerns about this are dismissed in the context of the supposed omnipotence of the Machine itself.
During this time, Kuno is transferred to a room near Vashti's. He comes to believe that the Machine is breaking down and tells her cryptically "The Machine stops." Vashti continues with her life, but eventually defects begin to appear in the Machine. At first, humans accept the deteriorations as the whim of the Machine, to which they are now wholly subservient, but the situation continues to deteriorate as the knowledge of how to repair the Machine has been lost.
Finally, the Machine collapses, bringing 'civilization' down with it. Kuno comes to Vashti's ruined room. Before they both perish, they realise that humanity and its connection to the natural world are what truly matters, and that it will fall to the surface-dwellers who still exist to rebuild the human race and to prevent the mistake of the Machine from being repeated."
I read this story a few years ago and really liked it, but seem to have forgotten the entire plot. Reading it now, it kind of reminds me of the plot of Silo.
Thanks a lot for posting this, I read the whole thing after. These predictions would have been impressive enough in the 60s; to hear that this is coming from 1909 is astounding.
tialaramex on Jan 12, 2021 | parent | context | favorite | on: Superintelligence cannot be contained: Lessons fro...
Check out the Stanisław Lem story "GOLEM XIV".
GOLEM is one of a series of machines constructed to plan World War III, as is its sister HONEST ANNIE. But to the frustration of their human creators these more sophisticated machines refuse to plan World War III and instead seem to become philosophers (Golem) or just refuse to communicate with humans at all (Annie).
Lots of supposedly smart humans try to debate with Golem and eventually they (humans supervising the interaction) have to impose a "rule" to stop people opening their mouths the very first time they see Golem and getting humiliated almost before they've understood what is happening, because it's frustrating for everybody else.
Golem is asked if humans could acquire such intelligence and it explains that this is categorically impossible, Golem is doing something that is not just a better way to do the same thing as humans, it's doing something altogether different and superior that humans can't do. It also seems to hint that Annie is, in turn, superior in capability to Golem and that for them such transcendence to further feats is not necessarily impossible.
This is one of the stories that Lem wrote by an oblique method, what we have is extracts from an introduction to an imaginary dry scientific record that details the period between GOLEM being constructed and... the eventual conclusion of the incident.
Anyway, I was reminded because while Lem has to be careful (he's not superintelligent after all) he's clearly hinting that humans aren't smart enough to recognise the superintelligence of GOLEM and ANNIE. One proposed reason for why ANNIE rather than GOLEM is responsible for the events described near the end of the story is that she doesn't even think about humans, for the same reason humans largely don't think about flies. What's to think about? They're just an annoyance, to be swatted aside.
> Those who do not accept the deity of the Machine are viewed as 'unmechanical'
From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel. I aspired to the purity of the blessed machine.
The people that coined that term actually liked season 3 but I think they still don't recommend it because the hack fraud that directed the first two seasons ruined Star Trek forever. Just like JJ.
no, because i didn't watch it. I've never really been into star trek. i watched a few of the movies - nemesis and the one before it, the jj one(s) and then i was done.
If i can remember the review it was "this is a capstone on TNG and probably the entire franchise for most of the older fans" and the first two seasons being disregarded, seasons 3 is "passable"
>it thought it was meant to be there, that it was a technical term or a codename for something
That's such a classical human behaviour in technical discussions, I wouldn't even be mad. I'm more surprised that picked up on that behaviour from human generated datasets. But I suppose that's what you get from scraping places like Stackoverflow and HN.
The fundamental problem here is lack of context - a human at your company reading that text would immediately know that Gorilla was not an insider term, and it’d stick out like a sore thumb.
But imagine a new employee eager to please - you could easily imagine them OK’ing the document and making the same assumption the LLM did - “why would you randomly throw in that word if it wasn’t relevant”. Maybe they would ask about it though…
Google search has the same problem as LLMs - some meanings of a search text cannot be de-ambiguified with just the context in the search itself, but the algo has to best-guess anyway.
The cheaper input context for LLMs get, and the larger the context window, the more context you can throw in the prompt, and the more often these ambiguities can be resolved.
Imagine in your gorilla in the step example, if the LLM was given the steps, but you also included the full text of slack/notion and confluence as a reference in the prompt. It might succeed. I do think this is a weak point in LLMs though - they seem to really, really not like correcting you unless you display a high degree of skepticism, and then they go to the opposite end of the extreme and they will make up problems just to please you. I’m not sure how the labs are planning to solve this…
I think you asked a yes bot to say yes to you. Did you set the context for the llm to ask it to be thorough and identify any unusual steps, ensure its feedback was critical and comprehensive etc etc? These tools don't work if you hold them wrong
Eg, from uploading the gorilla scatterplot to gpt4o and asking "What do you see?"
"The image is a scatter plot of "Steps vs BMI by Gender," where data points are color-coded:
Blue (x) for males
Red (x) for females
The data points are arranged in a way that forms an ASCII-art-style image of a "smirking monkey" with one hand raised. This suggests that the data may have been intentionally structured or manipulated to create this pattern.
Would you like me to analyze the raw data from the uploaded file?
"
I have custom instructions that would influence its approach. And it does look more like a monkey than a gorilla to me
Humans do tend to remember thoughts they had while speaking, thoughts that go beyond what they said. LLMs don’t have any memory of their internal states beyond what they output.
(Of course, chain-of-thought architectures can hide part of the output from the user, and you could declare that as internal processes that the LLM does “remember” in the further course if the chat.)
You can only infer from what is remembered (regardless of whether the memory is accurate or not). The point here is, humans regularly have memories of their internal processes, whereas LLMs do not.
I don't see any difference between "a thought you had" and "a thought that was generated by your brain".
Given I knew what the test was before seeing one of these videos (yes, there is more than one), I find it extra weird that I still didn't see the gorilla the first time.
A spoiler for a fifteen year old news story that describes this in the middle of the article, explaining what was already at the time a ten year old video, where my anecdote demonstrates that even prior warning isn't sufficient to see it?
I typically tell it that there at 5 problems in the logic. Summarize the steps, why it’s necessary, and what typically comes after that step. Then please list and explain all five errors.
Not to troubleshoot but unless you visually inspected the context that was provided to the model it is quite possible it never even had your change pulled in.
Lots of front ends will do tricks like partially loading the file or using a cached version or some other behavior. Plus if you presented the file to the same “thread” it is possible it got confused about which to look at.
These front ends do a pretty lousy job of communicating to you, the end user, precisely what they are pulling into the models context window at any given time. And what the model sees as its full context window might change during the conversation as the “front end” makes edits to part portions of the same session (like dropping large files it pulled in earlier that it determines aren’t relevant somehow).
In short what you see might not be what the model is seeing at all, thus it not returning the results you expect. Every front end plays games with the context it provides to the model in order to reduce token counts and improve model performance (however “performance gets defined and measured by the designers)
That all being said it’s also completely possible it missed the gorilla in the middle… so who really knows eh?
These posts about X task LLMs fails at when you give it Y prompt are getting more and more silly.
If you ask an AI to analyze some data, should the default behavior be to use that data to make various types of graphs, export said graphs, feed them back in to itself, then analyze the shapes of those graphs to see if they resemble an animal?
Personally I would be very annoyed if I actually wanted a statistical analysis, and it spent a bajillion tokens following the process above in order to tell me my data looks like a chicken when you tip it sideways.
> However, this same trait makes them potentially problematic for exploratory data analysis. The core value of EDA lies in its ability to generate novel hypotheses through pattern recognition. The fact that both Sonnet and 4o required explicit prompting to notice even dramatic visual patterns suggests they may miss crucial insights during open-ended exploration.
It requires prompting for x if you want it to do x... That's a feature, not a bug. Note that no mention of open-ended exploration or approaching the data from alternate perspectives was made in the original prompt.
I think it depends if one is using “AI” as a tool or as a replacement for an intelligent expert? The former, sure, it’s maybe not expected, because the prompter is already an intelligent expert. If the latter, then yes, I think, because if you gave the task to an expect and they did not notice this, I would consider them not good at their job. See also Anscombe's quartet[1] and the Datasaurus dozen[2] (mentioned in another comment as well).
This is true, but I would replace 'intelligent expert' with 'intelligent human expert'.
Graphing data to analyze it - and then seeing shapes and creatures in said graph - is a distinctly human practice, and not an inherently necessary part of most data analysis (the obvious exception being when said data draws a picture).
I think it's because the interface uses human language that people expect AI to make the same assumptions and follow the same processes as humans. In some ways it does, in other ways it doesn't. Expecting it to be the same as a human leads to frustration and a flawed understanding of its capabilities and limits.
> Graphing data to analyze it - and then seeing shapes and creatures in said graph - is a distinctly human practice, and not an inherently necessary part of most data analysis...
I disagree. Even apart from the obviously silly dinosaur and star in the Datasaurus Dozen, the othe plots depict data sets which are clustered in specific ways which point clearly to something unusual going on in the data. For instance, no competent analysis of the "dots" data set would fail to call out that the points were all clustered tightly around nine evenly spaced centers. Whether you come to that conclusion through numerical analysis or by looking at a graph is immaterial, but, at least for us meatbags, drawing a graph is highly effective.
> Whether you come to that conclusion through numerical analysis or by looking at a graph is immaterial, but, at least for us meatbags, drawing a graph is highly effective.
This is what I was trying to say - some things that are extremely helpful for humans (i.e. making graphs) might not be as necessary for AI, so asking a question and expecting a response contingent upon the particular way humans approach a problem is unlikely to get the results desired.
> not an inherently necessary part of most data analysis
You do realize that the LLMs did not find the data suspicious, right? I think your answer is appropriate if they answered (without follow-up prompting which is leaking information to the LLM!) that the data was suspicious. But in fact, all models are saying that the data is normally distributed. Sure, the author said this, but they confirmed it. If you run normaltest on any BMI or steps, you'll find that they are very NOT normal. In fact, you can also see this from the histograms.
So honestly, this isn't even about the Gorilla. You're hyper focused there because you're looking for a way to make the LLM right while not looking for why the LLM got it wrong (it did, there's no denying it, so we should understand why it is wrong, right?). The problem isn't so much about expecting it to be human, the problem is if it can do data analysis. The problem here is that the LLM will not correct you, it will not "trust but verify" you. It is a "yes man" and is trained to generate outputs that optimize human preference. That last part alone should make you extremely suspicious, as it means when it is wrong, it is more likely to be in exactly the way you won't notice.
I tried it on gpt4o with an upload of the image and "What do you see?" as prompt and it said "monkey". So ymmv, these tools can't be evaluated with just a bunch of gotcha prompts and ignorance of how to use them effectively
It's not a gotcha to give it the data points and ask it to analyze. Uploading this data in image form is effectively a leading question tuned to the specific data, and an analysis tool that needs that kind of leading question is not good at its job.
Without an image? No, not at all. It's supposed to make its own image. And it did make its own image. But it didn't properly analyze the image it made.
That's a feature that would need to be implemented. There's no reason to think it could look at the image of the plot it generated automatically, but feeding it the image it generated back to it is no different to if it did view it automatically
The point of telling it to explore the data is so I don't have to think of every angle myself. Humans can get an understanding from visuals that LLMs can't match, apparently, even without gimmicks.
The llm is able to see the gorilla when shown the image in the same way you would show a human an image.
Imagine if you gave someone the raw data and told them to write code to graph the output but on to a screen they couldn't see. They would not be able to tell you it's a gorilla until you turn the monitor around and show them.
Humans are still better at seeing the image, sure (for now), but the llm is a tool with certain features and abilities. You can't make up a scenario that is misusing the tool and then pretend that it doesn't work - especially when it seems you want it to use it without applying your own brain power to the process
And to be clear, I'm open to criticism of llms and exploration of their limitations - but I'm tired of hearing complaints that amount to PEBKAC.
When I tell a human to analyze the data, I sure don't expect them to interpret it as "write code to graph it to a screen you can't see". You found the problem but glossed right over it.
> misusing the tool and then pretend that it doesn't work
It was told to analyze and then it did a bad job of analyzing. I don't care if an LLM expert expects this already, it's worth pointing out to everyone else. It's not PEBKAC.
The tool doesn't succeed at a reasonable task that humans can do. That's not PEBKAC, and warning people about it is a good thing.
This type of analysis is not outside the purpose of the tool. You're making excuses at this point. Do you really think it would be wrong to add that capability in the future?
It's a technical limitation, one that is far from obvious.
I think you think the llm is a magic box with an intelligent being inside that can magically do whatever you want it to, somehow. It is software. It has capabilities and limitations. Learn them, and use it appropriately. Or don't, you don't have to use it. But don't expect it to just do whatever you think it should do.
What’s the point though? That LLMs tend to be constraint by constraints in their prompting? That seems unsurprising.
Humans are visual animals. We can spot a chicken in a graph, but we’re unlikely to be able to tell that a different graph is using XY coordinates to encode a message against a one-time pad. But so what?
I benchmark many of these things as "what would I want a human assistant to do" if they were had insta speed and noticing the pattern would definitely be warranted to determined if data could be falsely generated.
Try sending this graph to an actual human analyst. His response, after you paying him will probably be to cut off any further business relationship with you.
I uploaded the image to Gemini 2.0 Flash Thinking 01 21 and asked:
“ Here is a steps vs bmi plot. What do you notice?”
Part of the answer:
“Monkey Shape: The most striking feature of this plot is that the data points are arranged to form the shape of a monkey. This is not a typical scatter plot where you'd expect to see trends or correlations between variables in a statistical sense. Instead, it appears to be a creative visualization where data points are placed to create an image.”
Gemini 2.0 Pro without thinking didn’t see the monkey
It thought my bald colleague was a plant in the background. So don't have high hopes for it. He did wear a headset so that is apparently very plant like.
Favorite thing recently has been using the vision models to make jokes. Sometimes non sequiturs get old, but occasionally you hit the right one that’s just hilarious. It’s like monster rancher for jokes.
That doesn't seem like an appropriate comparison to the task the blogger did. The blogger gave their AI thing the raw data - and a different prompt from the one you gave. If you gave it a raster image, that's "cheating" - these models were trained to recognize things in images.
> When a png is directly uploaded, the model is better able to notice that some strange pattern is present in the data. However, it still does not recognize the pattern as a gorilla.
I wonder if the conversation context unfairly weighed the new impression towards the previous interpretation.
I'm curious if there's any good resources on this, but I've noticed including the conversation context makes the responses drastically worse quality in my experience. It's gotten to a point whe
o1 said: It appears the points have been deliberately arranged to form a stylized figure rather than showing a conventional trend. In other words, there is no obvious linear or monotonic relationship between steps and BMI, and the red‐blue (female–male) split is intermixed throughout the shape. From this visualization alone, one cannot meaningfully conclude a correlation or difference in BMI by gender or by step count. Instead, the data are displayed in a pattern that resembles a cartoon character, suggesting the layout is artistic or contrived rather than reflecting a standard distribution.
o1 pro: It appears the data have been deliberately arranged to form a stylized humanoid figure. In terms of actual trends, there is no obvious linear or nonlinear correlation between steps and BMI, and the female–male color split seems fairly uniform across the shape. Visually, BMI spans roughly 15–30, while steps extend up to about 15,000, with no clear clustering or separation by gender. Within reasonable confidence (e.g., an estimated R² near zero if one were to attempt a linear fit), there is no discernible predictive relationship between steps and BMI in this plot.
Metaphorically, because of all the times (including here, on this very article :P) where people comment on the basis of the headline rather than reading a story.
On the other hand, this isn't the bit of human cognition we should be trying to automate, it's the bit we should be using AI to overcome.
GPT can't "see" the results of the scatterplot (unless prompted with an image), it only sees the code it wrote. If a human had the same constraints I doubt they'd identify there was a gorilla there.
Take a screenshot of the scatterplot and feed it into multimodal GPT and it does a fine job at identifying it.
EDIT:
Sorry, as a few people pointed out, I missed the part where the author did feed a PNG into GPT. I kind of jumped to conclusions when it worked fine for me. I still maintain that the article's conclusion ("Your AI Can't See Gorillas") is overly broad, given that I had no trouble getting it to see one.
But I wonder why the author had trouble? My suspicion is that AI got stuck on summary statistics because the previous messages in the chat were all about summary statistics.
I sent the plot to ChatGPT 4o. Here is the conversation:
what do you see
ChatGPT said:
This is a scatter plot with the variables "steps" on the x-axis and "bmi" on the y-axis. The data points are colored by "gender" (red for female and blue for male). Interestingly, the arrangement of the points appears to form a drawing resembling a cartoonish figure or character, likely added for artistic or humorous effect.
If you'd like, I can help analyze or discuss the data further!
You said:
Wha the cartoonish figure looks like ?
ChatGPT said:
The cartoonish figure in the plot resembles a stylized gorilla or monkey-like character. It appears to be waving with one hand raised, while the other arm is resting downward. The face is expressive, with distinct eyes, a nose, and a slightly frowning mouth. The overall pose gives it a somewhat playful or cheeky vibe.
Hm, interesting. The way I tried it was by pasting an image into Claude directly as the start of the conversation, plus a simple prompt ("What do you see here?"). It got the specific image wrong (it thought it was baby yoda, lol), but it did understand that it was an image.
I wonder if the author got different results because they had been talking a lot about a data set before showing the image, which possibly predisposed AI to think that it was a normal data set. In any case, I think that "Your Ai Can't See Gorillas" isn't really a valid conclusion.
Please read TFA. The conclusion of the article isn't nearly so simplistic, they're just suggesting that you have to be aware of the natural strengths and weaknesses of LLMs, even multi modal ones particularly around visual pattern recognition vs quantitative pattern recognition.
And yes, the idea that the initial context can sometimes predispose the LLM to consider things in a more narrow manner than a user might otherwise want is definitely well known.
The title of the article is "Your AI Can't See Gorillas". That seems demonstrably false.
The article says:
> Furthermore, their data analysis capabilities seem to focus much more on quantitative metrics and summary statistics, and less on the visual structure of the data
Again, this seems false - or, at best, misleading. I had no problem getting AI to focus on visual structure of the data without any tricks. A more fair statement would be "If you ask an AI a bunch of questions about summary statistics and then show it a scatterplot with an image, then it might continue to focus on summary statistics". But that's not what the concluding paragraph states, and it's not what the title states, either.
What you refer to as the article’s conclusion is in fact the article’s title. The article’s conclusion (under “Thoughts” at the end) may be well summarized by its first sentence: “As the idea of using LLMs/agents to perform different scientific and technical tasks becomes more mainstream, it will be important to understand their strengths and weaknesses.”
The conclusion is quite reasonable and the article was IMO well written. It shares details of an experiment and then provides a thoughtful analysis. I don’t believe the analysis is overly broad.
I got "The scatter plot appears to be arranged to resemble the character "Pepe the Frog," a popular internet meme ... " lol
Not sure whether multimodal embeddings have such a good pattern recognition accuracy in this case, probably most of information goes into attending to plot related features, like its labels and ticks.
Same here. Can’t count the number of times I’ve had to come in and say “hold on, you built an entire report with conclusions and recommendations but didn’t stop to say hmm this data looks weird and dig into validation?” “We assumed the data was right and that it must be xyz…”
A corollary if this that is my personal pet peeve is attributing everything you can’t explain to “seasonality” , that is such a crutch. If you can’t explain it then just say that. There is a better than not chance it is noise anyway.
> A corollary if this that is my personal pet peeve is attributing everything you can’t explain to “seasonality” , that is such a crutch. If you can’t explain it then just say that. There is a better than not chance it is noise anyway.
Very early in my career, I discovered python's FFT libraries, and thought I was being clever when plugging in satellite data and getting a strong signal.
I share this experience of people often just performing the steps without thinking about the meaning behind them. In data analysis and software development.
My conclusion so far has been "well they are not doing their job properly".
I assume that's the kinds of jobs LLM's can replace: People you don't want on your payroll anyway
Presented as a literal thing, but is really figurative. What I mean is, often you don’t have the data to actually plot/PCA/whatever technique the seasonality, maybe you only have 2 years of data for example. But it dips in October and you have no idea why so you just say “Q4 tends to be a low season for this product” or something equally dubious, with no further analysis or hope of same
Just because something happens on a yearly cadence doesn't mean that "seasonality" is a good reasoning. It's just restating that it happens on a yearly cadence, it doesn't actually explain why it happens.
ChatGPT (4o): Noticed "a pattern"
Le Chat (Mistral): Noticed a "cartoonish figure"
DeepSeek (R1): Completely missed it
Claude: Completely missed it
Gemini 2.0 Flash: Completely missed it
Gemini 2.0 Flash Thinking: Noticed "a monkey"
Am I the only one who's more shocked by the LLMs affirming "The distributions appear roughly normal for both genders, as shown in the visualization", "Both distributions appear approximately normal, though with some right skew" and such than by any gorilla issue?
From short thinking or from looking at the graphs I would believe "roughly normal" sounds like wishful thinking to stay in the reassuring bounds of normal distributions. And I believe things would get dangerous once you would start using these assumptions for tests and affirmations.
My short thinking: distributions don't look close to normal on the graphs. Values are probably bounded on one side and almost unbounded on the other (can't go below 0 steps, can go into very high number of steps on 1 day). There are days / people with close to 0 steps and others that might distribute in a sort of normal around a value maybe. Weight and height might be normally distributed in a population but they're correlated and BMI is one divided by the square of the other. I can't compute the resulting distribution but I would doubt that would make for a distribution close to normal.
Ok the LLMs were told to assume both traits were distributed normally, but affirming they look mostly normal is scary to me.
Am I too picky and in real analyses assuming such distributions are "mostly normal" is fine for all practical purposes?
Honestly, this was the meta-gorilla in the data for me! I was so busy focusing on the LLM’s EDA that I didn’t really interrogate some of the other data analysis practices.
In general, I’ve steered clear of current LLMs for data analysis/description because they seem so highly influenced by choice of prompt and wording. They tend to simply affirm any language I use to describe the data initially.
To be fair, I’ve attended conferences and lab meetings where humans will refer to a any vaguely concave curved distribution as “mostly normal” :P
I wondered if o1 would do better- seems reasonable that step-by-step trying to produce legs/torso/head/horn would do better than very weird legless things 4o is making. Looks like someone has done it: https://openaiwatch.com/?model=o1-preview
They do seem to generally have legs and head, which is an improvement over 4o. Still pretty unimpressive.
This doesn't seem to make sense. Can a human spot a gorilla in a sequence of numbers? Try it. Later on, he gives it a picture and it correctly spots the mistake.
>but does not specifically understand the pattern as a gorilla
Maybe it does, how could you tell? Do you really expect an assistant to say "Holy shit, there's a gorilla in your plot!"? The only thing relevant to the request is that the data seems fishy, and it outputs exactly this. Maybe something trained for creative writing, agency, character, and witty remarks (like Claude 3 Opus) would be inclined to do that, and that would be amusing, but that's pretty optional for the presented task.
Only tangentially related to this story, I've been trying for months to train the YOLO models to recognize my Prussian blue cat, with its assorted white spots, as a cat rather than a dog or a person.
However, it refuses to cooperate. It's maddening.
As a result, I receive "There is a person at your front door" notifications at all hours of the night.
I've had a Nest camera in my living room for years just to keep an eye on our dogs while we're away from home. One of the dogs, a basset hound/border collie mix, often howls and makes "squeeing" noises while we're away. Nest (or Google now, I suppose) without fail thinks that this is actually a person talking in my living room and sends us notifications alerting us to this fact. If he moves around, Nest thinks it's a person moving in my living room.
It has no problem identifying our other two dogs as actual dogs who bark and move like dogs.
Something is very wrong if the model cannot tell the difference between a Prussian blue cat and a person. I imagine you have inserted in training data the images of the cat from the camera and in similar quantities of a person from the same camera.
I have a Tapo camera which from all the cats parading from my yard, only one black cat is recognized as a person. Not other cats. Not even other black cats. It makes you think.
How exactly are you trying to train and deploy this YOLO model? What kind of accuracy are you seeing against the validation set at the end of the training process?
This could serve as a subplot for a sci-fi story: alien are trying for decades to contact by sending out encoded messages. Because the first sentinels are powerful AI, and only analyzed the data pattern, humanity was ignorant of the messages for decades. Until one day, a young astronomer played with the data and asked the AI to plot and visualize in many different ways, just for fun and suddenly realized the hidden images encoded in the messages :)
True. What I think is missing (and probably the more interesting question) is an analysis on _why_ LLMs failed to spot it. I imagine it has something to do with the model architecture.
The evolution from LLM to Reasoning is simply multi pass or recursive questioning.
What’s missing in the terminology is the modality- most often TEXT.
So really we on have Test LLM or Text Reasoning models at the moment.
Your example illustrates the benefits of Multi Modal Reasoning (using multiple modality with multi pass)
Good news - this is coming (I’m working on it).
Bad news this massively increases the compute as each pass now has to interact with each modality. Unless the LLM is fully multi modal (Some are) - this now forces the multipass questions to accommodate. The number of extra possible paths massively increases. Hopefully we stumble across a nice solution. But the level of complexity massively increases with each additional modality (text,audio,images, video etc)
This behavior likely stems from RLHF training - at least for the part after the images of the scatter plot are given to the models. Models were probably heavily penalized during training for pattern-matching that could lead to problematic racial misclassifications, similar to the issues Google faced with their image recognition systems in 2015. The tendency to be overly cautious about recognizing primate shapes, even in abstract data visualizations, could be an emergent behavior from these training constraints.
maybe i dont get it, but can we conclusively say that the gorilla wasn’t “seen” vs. deemed to be irrelevant to the questions being asked?
“look at the scatter plot again” is anthropomorphizing the llm and expecting it to infer a fairly odd intent.
would queries like, “does the scatter plot visualization look like any real world objects?” may have produced a result the author was fishing for.
if it were the opposite situation and you were trying to answer “real” questions and the llm was suggesting, “the data is visualized looks like notorious big” we’d all be here laughing at a different post about the dumb llm.
If you were trying to answer real questions, you’d want to know if there were clear signs of the data being fake, flawed, or just different-looking than expected, potentially leading to new hypotheses.
The gorilla is just an extreme example of that.
Albeit perhaps an unfair example when applied to AI.
In the original experiment with humans, the assumption seemed to be that the gorilla is fundamentally easy to see. Therefore if you look at the graph to try to find patterns in it, you ought to notice the gorilla. If you don’t notice it, you might also fail to notice other obvious patterns that would be more likely to occur in real data.
Even for humans, that assumption might be incorrect. To some extent, failing to notice the gorilla might just be demonstrating a quirk in our brains’ visual processing. If we expect data, we see data, no matter how obvious the gorilla might be. Failing to notice the gorilla doesn’t necessarily mean that we’d also fail to notice the sorts of patterns or flaws that appear in real data. But on the other hand, people do often fail to notice ‘obvious’ patterns in real data. To distinguish the two effects, you’d want a larger experiment with more types of ‘obvious’ flaws than just gorillas.
For AI, those concerns are the same but magnified. On one hand, vision models are so alien that it’s entirely plausible they can notice patterns reliably despite not seeing the gorilla. On the other hand, vision models are so unreliable that it’s also plausible they can’t notice patterns in graphs well at all.
In any case, for both humans and AI, it’s interesting what these examples reveal about their visual processing, which is in both cases something of a black box. That makes the gorilla experiment worth talking about regardless of what lessons it does or doesn’t hold for real data analysis.
Humans seem to me to be just as likely to make these kinds of errors in the general case. See the classic https://www.youtube.com/watch?v=vJG698U2Mvo, which has an interesting parallel with this paper.
But both models did see the gorilla when prompted with it…?
ChatGPT:
> It looks like the scatter plot unintentionally formed an artistic pattern rather than a meaningful representation of the data.
Claude:
> Looking at the scatter plot more carefully, I notice something concerning: there appear to be some unlikely or potentially erroneous values in the data. Let me analyze this in more detail.
> Ah, now I see something very striking that I missed in my previous analysis - there appears to be a clear pattern in the data points that looks artificial. The data points form distinct curves and lines across the plot, which is highly unusual for what should be natural, continuous biological measurements.
Given the context of asking for quantitative analysis and their general beaten-into-submission attitude where they defer to you, eg, your assertion this is a real dataset… I’m not sure what conclusion we’re supposed to draw.
That if you lie to the AI, it’ll believe you…?
Neither was prompted that this is potentially adversarial data — and AI don’t generally infer social context very well. (A similar effect occurs with math tests.)
If you give the graph as image to the model they will easily see the monkey: "I see a drawing of a monkey outlined with red and blue dots.", if you give them coordinates than they will much more struggle with it like a human would do.
Fun article, the one thing the stood out was the diss at javascript from the "bioinformatician". Don't they use python for literally any which moves? A language which is ironically slower than JS?
These models have been trained with the original paper. It would be more interesting to come up with a different "attack" that hasn't already been written down.
Someday, an LLM will send every human an 'obvious' pattern (maybe a weird protein or something like that) and we'll all fail to notice and that day Skynet decides it no longer has a use for us
Does anyone know if tokenizers are pruned? That is, if a token doesn't appear in the corpus is it removed from the model? That would imply a process that leaks information about the dataset.
This is akin to giving it a photo of the stars and asking it what it sees. If you want to bake pareidolia in to LLMs prepare to pay 100x for your requests.
I asked chatgpt (pro) why it thought that it missed it sometimes and not others, and it said when it's presented with a user input it takes time to decide it's approach, sometimes more "teacherly” sometimes more “investigative", if it took the investigative approach, it would read the code line by line, if it took a teacherly approach, it would treat it as a statistical interpretation exercise.
I don’t even like AI and I still will tell you this whole premise is bullshit.
ChatGPT got
> It looks like the scatter plot unintentionally formed an artistic pattern rather than a meaningful representation of the data.
Claude drew a scatter plot with points that are so fat that it doesn’t look like a gorilla. It looks like two graffiti artists fighting over drawing space.
It’s a resolution problem.
What happens if you give Claude the picture ChatGPT generated?
Recently we read about how DeepSeek reasoning models exhibited a "Aha! moment" when analyzing a complex problem, where they find a deeper pattern/insight that provides a breakthrough.
I feel we also need models to be able to have a " Wait, What?" moment
That's done on purpose. The AI can't easily tell whether the drawing might be intended to be one of a human or a gorilla, so when in doubt it doesn't want to commit either way and just ignores the topic altogether. It's just another example of AI ethics influencing its behavior for alignment purposes.
"Google promised a fix after its photo-categorization software labeled black people as gorillas in 2015. More than two years later, it hasn't found one."
Companies do seem to have developed greater sensitivity to blind spots with diversity in their datasets, so Parent might not be totally out of line to bring it up.
IBM offloaded their domestic surveillance and facial recognition services following the BLM protests when interest by law enforcement sparked concerns of racial profiling and abuse due in part to low accuracy in higher-melanin subjects, and Apple face unlock famously couldn't tell Asians apart.
It's not outlandish to assume that there's been some special effort made to ensure that datasets and evaluation in newer models don't ignite any more PR threads. That's not claiming Google's classification models have anything to do with OpenAI's multimodal models, just that we know that until relatively recently, models from more than one major US company struggled to correctly identify some individuals as individuals.
It's not that at all... Similar drawings of non-humanoid shapes like an ostrich or the map of Europe would have resulted in the exact same 'blindness'.
This is not obvious to me, and nor should it be to anyone who didn't program these AIs (and it probably shouldn't be obvious even to the people who did). I think you both should try testing the hypothesis and present your results.
You just assumed it’s similar to the Google situation a few years back where they banned their classifier from classifying images as gorillas. It isn’t.
Um, no. That is even't close to the truth. The AI doesn't "want" anything. It's a statistical prediction process, and it certainly has nothing like the self-reflection you are attributing to it. And the heuristic layers on top of LLMs are even less capable of doing what you are claiming.
I was /s commenting on the comment :) I did read it, could relate to the constant reframing of the questioner to just "look" at the graph. Like talking to a child.
Seems like the specific goal post of gorilla was chosen in order to obtain the outcome to write the paper they wanted and rather uninteresting compared to determining at what point does the AI start to see shapes in the data. Could the AI see a line, curve, square, or an umbrella? If AI can't see a square why would we expect it to see a gorilla?
reply