Hacker News new | past | comments | ask | show | jobs | submit login

I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here.

Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.




I don’t see why not- “segment anything” from meta seems to handle labeled pixel-wise segmentation masks fairly well. You can also get rough masks today by looking at where the text part of the model attends to in the image part.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: