Hacker News new | past | comments | ask | show | jobs | submit login

According to the paper, the "registers" are additional learnable tokens that are appended to the input sequence of a Vision Transformer model during training.

They are added after the patch embedding layer, with a learnable value, similar to the [CLS] token and then at the end of the Vision Transformer, the register tokens are discarded, and only the [CLS] token and patch tokens are used as image representations.

The register tokens provide a place for the model to store, process and retrieve global information during the forward pass, without repurposing patch tokens for this role.

Adding register tokens removes the artifacts and high-norm "outlier" tokens that otherwise appear in the feature maps of trained Vision Transformer models.

Using register tokens leads to smoother feature maps, improved performance on dense prediction tasks, and enables better unsupervised object discovery compared to the same models trained without the additional register tokens.

This is a neat result. For just a 2% increase in inference cost, you can significantly improve ViT model performance. Close to a free lunch.




This all token business is very shady, and the whole probability theory. You add token here and there and magic happens. Discreet math people can not take this lightly. Stochastic regexes is one thing, but this on a completely different level of mathematical debauchery.

Absolutely amazing this works.


Vision transformers are essentially just JPEG but with learned features rather than the Fourier transform.


I think it's important to point out for people that might be interested in this comment that a few things are wrong.

1. Standard JPEG compression uses the Discrete Cosine Transform, not the Fourier Transform.

2. It is easy to be dismissive of any technology by saying that it is 'just' X with Y, Z, etc on top

3. Vision transformers allow for much longer range context - the magic comes in part from the ability to relate between patches, as well as the learned features, which JPEG does not do.


The discrete cosine transform is the real part of a Fourier transform.


Indeed. Kernels mashing features. Knowing jpeg helped the understanding of embedding a lot. It’s why I tell friends - talking to GPT is like talking to .ZIP files…


Rockets are essentially just fire that burns real fast.


Interesting! Can you elaborate?


The JPEG algorithm is:

1. Divide up the image into 8x8 patches

2. Take the DCT (a variant of the Fourier transform) of each patch to extract key features

3. Quantize the outputs

4. Use arithmetic encoding to compress

The ViT algorithm is:

1. Divide up the image into 16x16 patches

2. Use query/key/value attention matrices to extract key features

3. Minimize cross-entropy loss between predicted and actual next tokens. (This is equivalent to trying to minimize encoding length.)

ViT don't have quantization baked into the algorithm, but NNs are being moved towards quantization in general. Another user correctly pointed out that vision transformers are not necessarily autoregressive (i.e. they may use future patches to calculate values for previous patches), while arithmetic encoding usually is (so JPEG is), so the algorithms have a few differences but nothing major.

-----

I think it's pretty interesting how closely related generation and compression are. ClosedAI's Sora[^1] model uses a denoising vision transformer for their state-of-the-art video generator, while JPEG has been leading image compression for the past several decades.

[^1]: https://openai.com/index/sora/?video=big-sur


I find it wild that the training process can do such things as forcing it repurpose background areas to begin with. The authors just observed abd optimized what the model was already doing by itself.


I agree, the most interesting thing about the paper is the default behavior of the network as it tries to compress the data.


The modern Alchemy.


But then… alchemy never produced gold right? So how we expect this thing to ever produce gold value. I’m sure the alchemist OpenAI of 12th century must’ve also had very high valuation.


There was an attempt to add several CLS tokens to BERT, with less spectacular results: https://arxiv.org/pdf/2210.05043


are there lessons here for regular (non vision) transformers? sounds close to attention sinks/pause tokens?


For these tokens you first need to unembed the result of the final layer, the re-embed the resulting token on the next pass. Has anyone investigated passing the raw output of one pass to the input of the next?


So is that what all the visual cues are in real life, things like fashion accessories, uniforms etc.?


Interesting. One other potential benefit is an easier quantization of the activations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: