This shouldn't be too surprising. Similar results were had with GPT-3 a while ba...

og_kalu · on April 19, 2023

There's something a bit more mindblowing than that. Language models and vision models learn representations so similar that you can connect them with just a linear projection between image embedding and text embedding space(no training of the image encoder or llm required).

https://arxiv.org/abs/2209.15162 https://llava-vl.github.io/

LLMs are already being grounded.

cs702 · on April 19, 2023

I was about to say the same thing. Hi og_kalu!

Relevant previous discussion:

https://news.ycombinator.com/item?id=35598281

Der_Einzige · on April 19, 2023

We've known that embeddings have this property since the glove paper at least. Linear substructure in low dimensional representations of ultra high dimensions is shockingly common

prox · on April 19, 2023

Do we know how they represent that knowledge? I always hear it called a black box but that seems a bit strange, since you can dissect any data right?

og_kalu · on April 19, 2023

No we don't know how they represent that knowledge. But performing experiments to probe how similar they are is a lot easier than knowing all that.

They're called black boxes because we can't explain what the weights are learning during training, what the different weights do or are responsible for to shift or produce the output it does.

It's like, biologists know how neurons communicate signals with each other. But is that knowledge enough to explain human behavior ? Not even close.

prox · on April 19, 2023

Interesting, so it could be similar to how our brains store knowledge, but could also be completely different.

This makes me wonder if these models are perfect universal translators once they “grasp” a concept.

TeMPOraL · on April 20, 2023

> This makes me wonder if these models are perfect universal translators once they “grasp” a concept.

I don't have a paper reference, but earlier this month I've seen claims that in training GPT-4, it was observed that additional training in a specific task using a single language (e.g. English) improves performance for that task in many languages, strongly suggesting the model is actually learning concepts, not just words.

If that's the case, then I think we have indeed accidentally made a universal translator (limited to humans, though).

prox · on April 20, 2023

Ergo, if we train on say a bird or primate, or perhaps dolphins, it might be able to grasp concepts animals use? Say lots of video footage with context?

cloudking · on April 19, 2023

My lamen understanding is that at a high level the transformer model is performing mathematical operations on the data, based on a complex series of formulas (the "model") derived from the weights set by training.

Then it's able to take in new data, perform the math, and output what it thinks comes next. Is it a big stretch of the imagination to think that maybe such "models" (mathematical formulas) exist also in our brains and maybe we have unlocked one of them?

TeMPOraL · on April 20, 2023

My layman take on LLMs is that they map tokens to points in an absurdly high-dimensional vector space (on the order of tens to hundred of thousands dimensions). The training process shifts those points around to make the related tokens closer, which eventually ends up encoding pretty much any kind of relationship you could think of between the tokens, semantic or otherwise, as proximity in one or more dimensions. The latent space has enough dimensions to accommodate all those relationships, which is how even tasks which require complex understanding of abstract concepts still boil down to adjacency search in that space.

In other words: the LLM isn't learning algorithms, it's building a high-dimensional point cloud, where things related to each other are closer together.

Now, IIRC the visual model mentioned above works with sub-1000-dimensional latent space, which to me feels like not enough... space to fit generalized concepts in. But then, the prompts to txt2img and img2img models I saw seem more like additive modifiers, with individual tokens mostly independent of each other, so maybe that explanation still fits.

Timon3 · on April 19, 2023

As far as I understand this is true. I like to think about it like this: there is some magic formula f(x)=? that perfectly maps our inputs to our outputs (e.g. image captions to images, or input texts to longer input texts), but we don't know how to find it. So we build a space with incredibly many dimensions, and we learn some mapping in this space, which is hopefully very close to the magic formula.

Our brains fundamentally work in a similar way, in that there are mappings from inputs to outputs through our senses and our nervous system, and we can literally determine neural circuits in mammalian brains through topological analysis of this magical function![0]

[0] Youtube video: Neural manifolds - The Geometry of Behaviour, from Artem Kirsanov: https://www.youtube.com/watch?v=QHj9uVmwA_0

retrac · on April 20, 2023

You're right about how machine learning is learning to approximate a function - most machine learning systems are mathematically equivalent to stochastic gradient descent, a statistical method which can, theoretically, do the same thing.

The surprise was that people (me, at least!) thought the computation and amount of data required to learn a function like "translate English to French" would be completely impractical to ever realize.

I think it's open question whether humans work like that, though we probably do.

cloudking · on April 20, 2023

This is really fascinating, assuming it is true, it could imply that everything we "learn" is essentially a training process in our brain to store a new model/function. As humans we've figured out how to transfer these models between our brains through communication. Maybe it is possible to "upload" a model to the brain like Neo learning kung-fu...

seydor · on April 19, 2023

We will find that language and Visual perception are related. Geometry is the underlying structure in language and mathematics, and most of our logical concepts stem from geometric relations and constraints