I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.
LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.
I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.