> all they can do is, in fact, sample from a compression of historical texts using a weighted probability metric.
I don't think that's all they can do.
I think they know more than what is explicitly stated in their training sets.
They can generalize knowledge and generalize relationships between the concepts that are in the training sets.
They're currently mediocre at it, but the results we observe from SOTA generative models are not explainable without accepting that they can create an internal model of the world that's more than just a decompression algorithm.
I'm going to step away from LLMs for a moment, but: How are video generator models capable of creating videos with accurate shadows and lighting that is consistent in the entire frame and consistent between frames?
You can't do that simply by taking a weighted average of the sections of videos you've seen in your training set.
You need to create an internal 3D model of the objects in the scene, and their relative positions in space across the length of the video. And no one told the model explicitly how to do that, it learned to do it "on its own".
>You need to create an internal 3D model of the objects in the scene, and their relative positions in space across the length of the video. And no one told the model explicitly how to do that, it learned to do it "on its own".
Compression is understanding. If you have a model which explains shadows you can compress your video data much better. Since you "understand" how shadows work.
I don't think that's all they can do.
I think they know more than what is explicitly stated in their training sets.
They can generalize knowledge and generalize relationships between the concepts that are in the training sets.
They're currently mediocre at it, but the results we observe from SOTA generative models are not explainable without accepting that they can create an internal model of the world that's more than just a decompression algorithm.
I'm going to step away from LLMs for a moment, but: How are video generator models capable of creating videos with accurate shadows and lighting that is consistent in the entire frame and consistent between frames?
You can't do that simply by taking a weighted average of the sections of videos you've seen in your training set.
You need to create an internal 3D model of the objects in the scene, and their relative positions in space across the length of the video. And no one told the model explicitly how to do that, it learned to do it "on its own".
I think the same principle applies to LLMs.