I think it's Ylecun who stated a few times that video was the better way to train large models as it's more information dense.
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.
The results really are impressive. Being able to generate such high quality videos, to extend videos in the past and in the future shows how much the model "understands" the real world, objects interaction, 3D composition, etc...
Although image generation already requires the model to know a lot about the world, i think there's really a huge gap with video generation where the model needs to "know" 3D, objects movements and interactions.