Hacker News new | past | comments | ask | show | jobs | submit login

The assessments of visual capability really need to be more robust. They are still using datasets like VQAv2, which while providing some insight, have many issues. There are many newer datasets that serve as much more robust tests and that are less prone to being affected by linguistic bias.

I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:

https://arxiv.org/abs/2408.05334

https://arxiv.org/abs/2408.03326

I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.




Looking at their benchmark results and my own experience with their 11B vision model, I think while not perfect they represent the model well.

Meaning it's doing impressively bad compared to other models I've tried in similar sizes(for vision).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: