If this explanation is true, it suggests a very important experiment would be to...

nl · on June 25, 2020

Indeed.

It's not entirely obvious how to do this though. There needs to be a common training representation and that's pretty hard.

In videos, the sequence of frames is important, but each frame is an image. In text the sequence of characters is important.

Maybe something that can accept sequences (of some length..) of bytes where the bytes maybe an image or maybe a single character might work. But unified representations of images and words for training is probably the first step towards that.

based on the current state of the art and the research problems needed to solve it's probably 3 years before we are in a position to contemplate the kind of training suggested here.

dwaltrip · on June 25, 2020

Another very important aspect is that humans interact with the world they are observing. It isn’t just passive processing of data. Training a model on video may help over text alone, but the interactivity is still missing.