If I had a nickel for every outrageous "matches/beats GPT-x" claim, I'd have more money than the capital these projects raise from VC.
This absolutely is not the first Llama3 vision model. They even quote it's performance compared to Llava. Hard to take anything they say seriously with such obviously false claims
Llama 3 outputs text and can only see text, this is a vision model.
>that would make it Llama-2-based.
It's based on Llama 3, Llama 2 has nothing to do with it. They took Llama 3 Instruct and CLIP-ViT-Large-patch14-336, train the projection layer first and then later finetuned the Llama 3 checkpoint and train a LoRA for the ViT.
This absolutely is not the first Llama3 vision model. They even quote it's performance compared to Llava. Hard to take anything they say seriously with such obviously false claims