I was a little confused about this too. The authors say in the paper: "The outpu...

I was a little confused about this too. The authors say in the paper:

"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."

I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.

[1] https://github.com/huggingface/transformers/blob/main/src/tr...