According to the paper, the "registers" are additional learnable tokens that are...

larodi · 2024-05-12T00:28:11 1715473691

This all token business is very shady, and the whole probability theory. You add token here and there and magic happens. Discreet math people can not take this lightly. Stochastic regexes is one thing, but this on a completely different level of mathematical debauchery.

Absolutely amazing this works.

programjames · 2024-05-12T00:37:00 1715474220

Vision transformers are essentially just JPEG but with learned features rather than the Fourier transform.

ImageXav · 2024-05-12T10:19:02 1715509142

I think it's important to point out for people that might be interested in this comment that a few things are wrong.

1. Standard JPEG compression uses the Discrete Cosine Transform, not the Fourier Transform.

2. It is easy to be dismissive of any technology by saying that it is 'just' X with Y, Z, etc on top

3. Vision transformers allow for much longer range context - the magic comes in part from the ability to relate between patches, as well as the learned features, which JPEG does not do.

programjames · 2024-05-12T19:10:56 1715541056

The discrete cosine transform is the real part of a Fourier transform.

larodi · 2024-05-14T08:59:34 1715677174

Indeed. Kernels mashing features. Knowing jpeg helped the understanding of embedding a lot. It’s why I tell friends - talking to GPT is like talking to .ZIP files…

idiotsecant · 2024-05-12T16:59:52 1715533192

Rockets are essentially just fire that burns real fast.

Zacharias030 · 2024-05-12T09:07:52 1715504872

Interesting! Can you elaborate?

programjames · 2024-05-12T19:26:57 1715542017

The JPEG algorithm is:

1. Divide up the image into 8x8 patches

2. Take the DCT (a variant of the Fourier transform) of each patch to extract key features

3. Quantize the outputs

4. Use arithmetic encoding to compress

The ViT algorithm is:

1. Divide up the image into 16x16 patches

2. Use query/key/value attention matrices to extract key features

3. Minimize cross-entropy loss between predicted and actual next tokens. (This is equivalent to trying to minimize encoding length.)

ViT don't have quantization baked into the algorithm, but NNs are being moved towards quantization in general. Another user correctly pointed out that vision transformers are not necessarily autoregressive (i.e. they may use future patches to calculate values for previous patches), while arithmetic encoding usually is (so JPEG is), so the algorithms have a few differences but nothing major.

-----

I think it's pretty interesting how closely related generation and compression are. ClosedAI's Sora[^1] model uses a denoising vision transformer for their state-of-the-art video generator, while JPEG has been leading image compression for the past several decades.

[^1]: https://openai.com/index/sora/?video=big-sur

samus · 2024-05-12T04:44:26 1715489066

I find it wild that the training process can do such things as forcing it repurpose background areas to begin with. The authors just observed abd optimized what the model was already doing by itself.

jebarker · 2024-05-12T14:31:05 1715524265

I agree, the most interesting thing about the paper is the default behavior of the network as it tries to compress the data.

akomtu · 2024-05-12T00:30:36 1715473836

The modern Alchemy.

larodi · 2024-05-14T09:02:08 1715677328

But then… alchemy never produced gold right? So how we expect this thing to ever produce gold value. I’m sure the alchemist OpenAI of 12th century must’ve also had very high valuation.

macleginn · 2024-05-11T20:41:14 1715460074

There was an attempt to add several CLS tokens to BERT, with less spectacular results: https://arxiv.org/pdf/2210.05043

swyx · 2024-05-11T21:17:25 1715462245

are there lessons here for regular (non vision) transformers? sounds close to attention sinks/pause tokens?

johntb86 · 2024-05-11T21:44:27 1715463867

For these tokens you first need to unembed the result of the final layer, the re-embed the resulting token on the next pass. Has anyone investigated passing the raw output of one pass to the input of the next?

hasmanean · 2024-05-12T00:06:52 1715472412

So is that what all the visual cues are in real life, things like fashion accessories, uniforms etc.?

kadushka · 2024-05-11T20:17:57 1715458677

Interesting. One other potential benefit is an easier quantization of the activations.