Hacker News new | past | comments | ask | show | jobs | submit login

> I don't know, words have meanings.

That's quite true. Words mean exactly what people agree upon them meaning. Which does not require everyone, or else slang wouldn't exist. Nor the dictionary, which significantly lags. Regardless, I do not think this is even an unusual use of the word, though I agree the mention of myopia is. The usage makes sense if you consider that both myopic and resolution have more than a singular meaning.

  Myopic:
  lacking in foresight or __discernment__ : narrow in perspective and without concern for broader implications

  Resolution:
  the process or capability of making distinguishable the individual parts of an object, closely adjacent optical images, or sources of light
I agree that there are far better ways to communicate. But my main gripe is that they said it was "their hypothesis." If reading the abstract as a whole, I find it an odd conclusion to come to. It doesn't pair with the words that follow with blind guessing (and I am not trying to defend the abstract. It is a bad abstract). But if you read the intro and look at the context of their landing page, I find it quite difficult to come to this conclusion. It is poorly written, but it is still not hard to decode the key concepts the authors are trying to convey.

I feel the need to reiterate that language has 3 key aspects to it: the concept attempted to be conveyed, the words that concept is lossy encoded into, and the lossy decoding of the person interpreting it. Communication doesn't work by you reading/listening to words and looking up those words in a dictionary. Communication is a problem where you use words (context/body language/symbols/etc) to decrease the noise and get the reciever to reasonably decode your intended message. And unfortunately we're in a global world and many different factors, such as culture, greatly affect how one encodes and/or decodes language. It only becomes more important to recognize the fuzziness around language here. Being more strict and leaning into the database view of language only leads to more errors.

> But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.

Because they didn't claim that image size and sharpness was an issue. They claimed the VLM cannot resolve the images "as if" they were blurry. Determining what the VLM actually "sees" is quite challenging. And I'll mention that arguably they did test some factors that relate to blurriness. Which is why I'm willing to overlook the poor analogy.

> I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those

I'm not. Depending on the examples you pulled, 2 random ones passing isn't unlikely given the results.

Something I generally do not like about these types of papers is that they often do not consider augmentations. Since these models tend to be quite sensitive to both the text (prompt) inputs and image inputs. This is quite common in generators in general. Even the way you load in and scale an image can have significant performance differences. I've seen significant differences in simple things like loading an image from numpy, PIL, tensorflow, or torch have different results. But I have to hand it to these authors, they looked at some of this. In the appendix they go through with confusion matrices and look at the factors that determine misses. They could have gone deeper and tried other things, but it is a more than reasonable amount of work for a paper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: