I probably could have guessed that contrastive pre-training works better for dow... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

kolja005 on Oct 16, 2023 | parent | context | favorite | on: PaLI-3 Vision Language Models

I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here.

Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.

mattnewton on Oct 16, 2023 [–]

I don’t see why not- “segment anything” from meta seems to handle labeled pixel-wise segmentation masks fairly well. You can also get rough masks today by looking at where the text part of the model attends to in the image part.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact