PaLI-3 Vision Language Models

buildbot · 2023-10-16T05:01:56.000000Z

Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark.

"Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"

S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶

Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.

Impressive improvement!!!

nerdponx · 2023-10-16T15:15:45.000000Z

Are they just fine-tuning part of the model on the "unsupervised" portion of the training data? I think that's not entirely unfair because it might be realistic. If you have a big corpus of data and a pre-existing model, you might want to fine tune the latter using the former. However it's certainly a generous benchmark and doesn't reflect real-world "online" usage.

atorodius · 2023-10-16T11:05:42.000000Z

That's normal for ML

buildbot · 2023-10-16T18:28:50.000000Z

To finetune on each benchmark? I'd say it's not in our modern era of in-context learning, though of course fine-tuning has it's place as well for making smaller models better in one domain than a generalist larger model.

tracyhenry · 2023-10-16T04:57:40.000000Z

maybe someone more informed can help me understand why they didn't compared to Llava (https://llava-vl.github.io/)?

kolja005 · 2023-10-16T05:50:39.000000Z

The purpose of this research is to compare large vision-language models where the vision component is pre-trained using different techniques, namely on image classification versus unsupervised contrastive pre-training (see OpenAI's CLIP). PaLI-3 also isn't an instruction-tuned model, so comparing it to Llava would be a little apples-to-oranges.

dartos · 2023-10-16T05:25:45.000000Z

Maybe they just didn’t know about llava while conducting their research. It can take days to train a model sometimes.

buildbot · 2023-10-16T05:30:36.000000Z

Weeks to months at larger scales even.

light_hue_1 · 2023-10-16T11:54:25.000000Z

No comparison against GPT-4V? How embarrassing! Where are they going to submit this? A conference where no one knows about GPT-4V? Ridiculous.

It's getting really awkward seeing these papers from Google. "We're here too! We're totally not woefully behind everyone else in the field!". No model, no reasonable comparisons, just generic bragging.

I'm astounded as an ML researcher how Google can be doing so incredibly badly. They have access to unlimited compute, good people, and great infrastructure. Yet something about their internal culture means they are unable to compete with OpenAI, Facebook, and even the open source community. They constantly brag about how good their models are (even in private) and then every time they deploy anything its performance is pathetic (like Bard and Bard with vision).

You can tell why Google recently totally overhauled the leadership of Google Research/Deep Mind and shut down Google Brain.

staticman2 · 2023-10-16T13:11:54.000000Z

I just tried translating a camera image of a japanese manga page with Chatgpt vs Bard, and Bard greatly outperformed ChatGPT in recognizing the japanese kanji, for what it's worth.

zarzavat · 2023-10-16T12:55:14.000000Z

Realistically you can’t expect one company to dominate like before. The days of buying up all the researchers and giving them unlimited compute are over. Now that there is more investment, it will mostly be about who is lucky to make discoveries, and who is willing to bend the law enough to get breakthroughs. Google’s status as a large multinational may end up as a disadvantage.

standardly · 2023-10-16T18:49:51.000000Z

I imagine their AI model isn't off the ground becuase it hasn't integrated with ads yet. They ruined search and youtube for more ad impressions, and likely have the same strategy with AI.

Geee · 2023-10-16T14:26:12.000000Z

Fundamentally the reason is that they can't make money with it, and AI eats into their search revenue. Search as a business is dead, and AI can't bring in the money.

biomattr · 2023-10-16T16:18:57.000000Z

Maybe its a conscious choice not to be the most frontier model? Cause its hard to believe they're not capable of being the hare in this race.

kolja005 · 2023-10-16T06:04:30.000000Z

I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here.

Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.

mattnewton · 2023-10-16T06:49:09.000000Z

I don’t see why not- “segment anything” from meta seems to handle labeled pixel-wise segmentation masks fairly well. You can also get rough masks today by looking at where the text part of the model attends to in the image part.

sgd99 · 2023-10-16T14:39:06.000000Z

can anyone explain how these visual tokens which are concatenated with the tokenizer outputs for the encoder are created?

kolja005 · 2023-10-16T16:26:02.000000Z

I was a little confused about this too. The authors say in the paper:

"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."

I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.

[1] https://github.com/huggingface/transformers/blob/main/src/tr...

sgd99 · 2023-10-17T19:01:37.000000Z

thank you!

facu17y · 2023-10-16T07:10:18.000000Z

no github?

Technotroll · 2023-10-16T08:18:49.000000Z

Does the vision-language-model process raw image data, or does it process OCR character output?

bigfudge · 2023-10-16T12:44:13.000000Z

Gpt4v seems to be doing the former, at least in my experiments with it. It interprets plots and categorises images.

doggerel · 2023-10-16T09:18:46.000000Z

The copyright violation is coming from inside the house.

Even undigitized materials aren't safe any more.