This shouldn't be too surprising. Similar results were had with GPT-3 a while back, which is kind-of-able to produce audio or images, encoded as streams of tokens, when trained on that task, despite not being designed for it.
A very interesting property was noted a few years ago by multiple researchers, I'm not sure who discovered it first. Transfer learning is unreasonably effective. If you were training an image generator network, there's a significant reduction in training time, by taking a model already trained and fine-tuning it, compared to starting from a model with truly random weights.
This isn't surprising when we're talking photos of ambulances and moving to photos of trucks. But it holds true when you train it on ... well, anything structured, really. A GPT-style transformer trained on online comments, or audio samples of music encoded as token streams, when switched to images of cars encoded as token streams, learns that task much more quickly than if it had been fully randomized.
I don't see how to escape the conclusion that these models learn some sort of general properties (something about arithmetic and mathematical relationships, maybe?) There's some sort of abstraction or internal model that is learned, that is applicable across very different tasks.
There's something a bit more mindblowing than that. Language models and vision models learn representations so similar that you can connect them with just a linear projection between image embedding and text embedding space(no training of the image encoder or llm required).
We've known that embeddings have this property since the glove paper at least. Linear substructure in low dimensional representations of ultra high dimensions is shockingly common
No we don't know how they represent that knowledge. But performing experiments to probe how similar they are is a lot easier than knowing all that.
They're called black boxes because we can't explain what the weights are learning during training, what the different weights do or are responsible for to shift or produce the output it does.
It's like, biologists know how neurons communicate signals with each other. But is that knowledge enough to explain human behavior ? Not even close.
> This makes me wonder if these models are perfect universal translators once they “grasp” a concept.
I don't have a paper reference, but earlier this month I've seen claims that in training GPT-4, it was observed that additional training in a specific task using a single language (e.g. English) improves performance for that task in many languages, strongly suggesting the model is actually learning concepts, not just words.
If that's the case, then I think we have indeed accidentally made a universal translator (limited to humans, though).
Ergo, if we train on say a bird or primate, or perhaps dolphins, it might be able to grasp concepts animals use? Say lots of video footage with context?
My lamen understanding is that at a high level the transformer model is performing mathematical operations on the data, based on a complex series of formulas (the "model") derived from the weights set by training.
Then it's able to take in new data, perform the math, and output what it thinks comes next. Is it a big stretch of the imagination to think that maybe such "models" (mathematical formulas) exist also in our brains and maybe we have unlocked one of them?
My layman take on LLMs is that they map tokens to points in an absurdly high-dimensional vector space (on the order of tens to hundred of thousands dimensions). The training process shifts those points around to make the related tokens closer, which eventually ends up encoding pretty much any kind of relationship you could think of between the tokens, semantic or otherwise, as proximity in one or more dimensions. The latent space has enough dimensions to accommodate all those relationships, which is how even tasks which require complex understanding of abstract concepts still boil down to adjacency search in that space.
In other words: the LLM isn't learning algorithms, it's building a high-dimensional point cloud, where things related to each other are closer together.
Now, IIRC the visual model mentioned above works with sub-1000-dimensional latent space, which to me feels like not enough... space to fit generalized concepts in. But then, the prompts to txt2img and img2img models I saw seem more like additive modifiers, with individual tokens mostly independent of each other, so maybe that explanation still fits.
As far as I understand this is true. I like to think about it like this: there is some magic formula f(x)=? that perfectly maps our inputs to our outputs (e.g. image captions to images, or input texts to longer input texts), but we don't know how to find it. So we build a space with incredibly many dimensions, and we learn some mapping in this space, which is hopefully very close to the magic formula.
Our brains fundamentally work in a similar way, in that there are mappings from inputs to outputs through our senses and our nervous system, and we can literally determine neural circuits in mammalian brains through topological analysis of this magical function![0]
You're right about how machine learning is learning to approximate a function - most machine learning systems are mathematically equivalent to stochastic gradient descent, a statistical method which can, theoretically, do the same thing.
The surprise was that people (me, at least!) thought the computation and amount of data required to learn a function like "translate English to French" would be completely impractical to ever realize.
I think it's open question whether humans work like that, though we probably do.
This is really fascinating, assuming it is true, it could imply that everything we "learn" is essentially a training process in our brain to store a new model/function. As humans we've figured out how to transfer these models between our brains through communication. Maybe it is possible to "upload" a model
to the brain like Neo learning kung-fu...
We will find that language and Visual perception are related. Geometry is the underlying structure in language and mathematics, and most of our logical concepts stem from geometric relations and constraints
A very interesting property was noted a few years ago by multiple researchers, I'm not sure who discovered it first. Transfer learning is unreasonably effective. If you were training an image generator network, there's a significant reduction in training time, by taking a model already trained and fine-tuning it, compared to starting from a model with truly random weights.
This isn't surprising when we're talking photos of ambulances and moving to photos of trucks. But it holds true when you train it on ... well, anything structured, really. A GPT-style transformer trained on online comments, or audio samples of music encoded as token streams, when switched to images of cars encoded as token streams, learns that task much more quickly than if it had been fully randomized.
I don't see how to escape the conclusion that these models learn some sort of general properties (something about arithmetic and mathematical relationships, maybe?) There's some sort of abstraction or internal model that is learned, that is applicable across very different tasks.