The assessments of visual capability really need to be more robust. They are still using datasets like VQAv2, which while providing some insight, have many issues. There are many newer datasets that serve as much more robust tests and that are less prone to being affected by linguistic bias.
I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:
I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:
https://arxiv.org/abs/2408.05334
https://arxiv.org/abs/2408.03326
I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.