Yes, it wouldn't know what to do with the picture unless you fine-tune the model...

sthatipamala on Sept 7, 2023 | parent | context | favorite | on: Persimmon-8B

Yes, it wouldn't know what to do with the picture unless you fine-tune the model (which is why they are permissively releasing it).

The embeddings form the vocabulary of the model. The vocabulary "namespace" has 70k empty slots so you could introduce your own tokens and train on top of that, where token = some patch of multimodal data.

Havoc on Sept 7, 2023 [–]

Gotcha. Thanks for explaining