GPT can't "see" the results of the scatterplot (unless prompted with an image), it only sees the code it wrote. If a human had the same constraints I doubt they'd identify there was a gorilla there.
Take a screenshot of the scatterplot and feed it into multimodal GPT and it does a fine job at identifying it.
EDIT:
Sorry, as a few people pointed out, I missed the part where the author did feed a PNG into GPT. I kind of jumped to conclusions when it worked fine for me. I still maintain that the article's conclusion ("Your AI Can't See Gorillas") is overly broad, given that I had no trouble getting it to see one.
But I wonder why the author had trouble? My suspicion is that AI got stuck on summary statistics because the previous messages in the chat were all about summary statistics.
I sent the plot to ChatGPT 4o. Here is the conversation:
what do you see
ChatGPT said:
This is a scatter plot with the variables "steps" on the x-axis and "bmi" on the y-axis. The data points are colored by "gender" (red for female and blue for male). Interestingly, the arrangement of the points appears to form a drawing resembling a cartoonish figure or character, likely added for artistic or humorous effect.
If you'd like, I can help analyze or discuss the data further!
You said:
Wha the cartoonish figure looks like ?
ChatGPT said:
The cartoonish figure in the plot resembles a stylized gorilla or monkey-like character. It appears to be waving with one hand raised, while the other arm is resting downward. The face is expressive, with distinct eyes, a nose, and a slightly frowning mouth. The overall pose gives it a somewhat playful or cheeky vibe.
Hm, interesting. The way I tried it was by pasting an image into Claude directly as the start of the conversation, plus a simple prompt ("What do you see here?"). It got the specific image wrong (it thought it was baby yoda, lol), but it did understand that it was an image.
I wonder if the author got different results because they had been talking a lot about a data set before showing the image, which possibly predisposed AI to think that it was a normal data set. In any case, I think that "Your Ai Can't See Gorillas" isn't really a valid conclusion.
Please read TFA. The conclusion of the article isn't nearly so simplistic, they're just suggesting that you have to be aware of the natural strengths and weaknesses of LLMs, even multi modal ones particularly around visual pattern recognition vs quantitative pattern recognition.
And yes, the idea that the initial context can sometimes predispose the LLM to consider things in a more narrow manner than a user might otherwise want is definitely well known.
The title of the article is "Your AI Can't See Gorillas". That seems demonstrably false.
The article says:
> Furthermore, their data analysis capabilities seem to focus much more on quantitative metrics and summary statistics, and less on the visual structure of the data
Again, this seems false - or, at best, misleading. I had no problem getting AI to focus on visual structure of the data without any tricks. A more fair statement would be "If you ask an AI a bunch of questions about summary statistics and then show it a scatterplot with an image, then it might continue to focus on summary statistics". But that's not what the concluding paragraph states, and it's not what the title states, either.
What you refer to as the article’s conclusion is in fact the article’s title. The article’s conclusion (under “Thoughts” at the end) may be well summarized by its first sentence: “As the idea of using LLMs/agents to perform different scientific and technical tasks becomes more mainstream, it will be important to understand their strengths and weaknesses.”
The conclusion is quite reasonable and the article was IMO well written. It shares details of an experiment and then provides a thoughtful analysis. I don’t believe the analysis is overly broad.
Take a screenshot of the scatterplot and feed it into multimodal GPT and it does a fine job at identifying it.
EDIT:
Sorry, as a few people pointed out, I missed the part where the author did feed a PNG into GPT. I kind of jumped to conclusions when it worked fine for me. I still maintain that the article's conclusion ("Your AI Can't See Gorillas") is overly broad, given that I had no trouble getting it to see one.
But I wonder why the author had trouble? My suspicion is that AI got stuck on summary statistics because the previous messages in the chat were all about summary statistics.