Hacker News new | past | comments | ask | show | jobs | submit login

Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).

The embeddings form the vocabulary of the model. The vocabulary "namespace" has 70k empty slots so you could introduce your own tokens and train on top of that, where token = some patch of multimodal data.




Gotcha. Thanks for explaining




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: