According to the paper, the "registers" are additional learnable tokens that are appended to the input sequence of a Vision Transformer model during training.
They are added after the patch embedding layer, with a learnable value, similar to the [CLS] token and then at the end of the Vision Transformer, the register tokens are discarded, and only the [CLS] token and patch tokens are used as image representations.
The register tokens provide a place for the model to store, process and retrieve global information during the forward pass, without repurposing patch tokens for this role.
Adding register tokens removes the artifacts and high-norm "outlier" tokens that otherwise appear in the feature maps of trained Vision Transformer models.
Using register tokens leads to smoother feature maps, improved performance on dense prediction tasks, and enables better unsupervised object discovery compared to the same models trained without the additional register tokens.
This is a neat result. For just a 2% increase in inference cost, you can significantly improve ViT model performance. Close to a free lunch.
This all token business is very shady, and the whole probability theory. You add token here and there and magic happens. Discreet math people can not take this lightly. Stochastic regexes is one thing, but this on a completely different level of mathematical debauchery.
I think it's important to point out for people that might be interested in this comment that a few things are wrong.
1. Standard JPEG compression uses the Discrete Cosine Transform, not the Fourier Transform.
2. It is easy to be dismissive of any technology by saying that it is 'just' X with Y, Z, etc on top
3. Vision transformers allow for much longer range context - the magic comes in part from the ability to relate between patches, as well as the learned features, which JPEG does not do.
Indeed. Kernels mashing features. Knowing jpeg helped the understanding of embedding a lot. It’s why I tell friends - talking to GPT is like talking to .ZIP files…
2. Take the DCT (a variant of the Fourier transform) of each patch to extract key features
3. Quantize the outputs
4. Use arithmetic encoding to compress
The ViT algorithm is:
1. Divide up the image into 16x16 patches
2. Use query/key/value attention matrices to extract key features
3. Minimize cross-entropy loss between predicted and actual next tokens. (This is equivalent to trying to minimize encoding length.)
ViT don't have quantization baked into the algorithm, but NNs are being moved towards quantization in general. Another user correctly pointed out that vision transformers are not necessarily autoregressive (i.e. they may use future patches to calculate values for previous patches), while arithmetic encoding usually is (so JPEG is), so the algorithms have a few differences but nothing major.
-----
I think it's pretty interesting how closely related generation and compression are. ClosedAI's Sora[^1] model uses a denoising vision transformer for their state-of-the-art video generator, while JPEG has been leading image compression for the past several decades.
I find it wild that the training process can do such things as forcing it repurpose background areas to begin with. The authors just observed abd optimized what the model was already doing by itself.
But then… alchemy never produced gold right? So how we expect this thing to ever produce gold value. I’m sure the alchemist OpenAI of 12th century must’ve also had very high valuation.
For these tokens you first need to unembed the result of the final layer, the re-embed the resulting token on the next pass. Has anyone investigated passing the raw output of one pass to the input of the next?
They are added after the patch embedding layer, with a learnable value, similar to the [CLS] token and then at the end of the Vision Transformer, the register tokens are discarded, and only the [CLS] token and patch tokens are used as image representations.
The register tokens provide a place for the model to store, process and retrieve global information during the forward pass, without repurposing patch tokens for this role.
Adding register tokens removes the artifacts and high-norm "outlier" tokens that otherwise appear in the feature maps of trained Vision Transformer models.
Using register tokens leads to smoother feature maps, improved performance on dense prediction tasks, and enables better unsupervised object discovery compared to the same models trained without the additional register tokens.
This is a neat result. For just a 2% increase in inference cost, you can significantly improve ViT model performance. Close to a free lunch.