Hacker News new | past | comments | ask | show | jobs | submit login

For what it's worth, I'm also very bad at plotting graphs with any kind of accuracy, which is why I use plotting software instead of doing it by hand.

I get the feeling that my visual system and the language I use are respectively pretty bad at processing and conveying precise information from a plot, (beyond simple descriptors like "A is larger than B" or "f(x) has a maximum"). I guess I would find it mildly surprising if any Vision-Language model were able to perform those tasks very well, because the representations in question seem pretty poorly suited.

I get that popular diffusion models for image generation are doing a bad job composing concepts in a scene and keeping relationships constant over the image--even if Stable Diffusion could write in human script, it's a bad bet that the contents of a legend would match a pie chart that it drew. But other Vision-Language models, designed for image captioning or visual question answering, rather than generating diverse, stylistic images, are pretty good at that compositional information (up to, again, the "simple descriptions" level of granularity I mentioned before.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: