I was a little confused about this too. The authors say in the paper:
"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."
I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.
"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."
I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.
[1] https://github.com/huggingface/transformers/blob/main/src/tr...