Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark.
"Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"
Are they just fine-tuning part of the model on the "unsupervised" portion of the training data? I think that's not entirely unfair because it might be realistic. If you have a big corpus of data and a pre-existing model, you might want to fine tune the latter using the former. However it's certainly a generous benchmark and doesn't reflect real-world "online" usage.
To finetune on each benchmark? I'd say it's not in our modern era of in-context learning, though of course fine-tuning has it's place as well for making smaller models better in one domain than a generalist larger model.
"Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"
S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶
Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.
Impressive improvement!!!