You're not supposed to. Image generation and modeling scene dynamics is a hard task, and thumbnail scale is what we're at at the moment. Nevertheless, those and other papers do demonstrate that NNs are perfectly capable of learning 'objectness' from video footage (to which we can add GAN papers showing that GANS learn, based on static 2D images, 3D movement). More directly connected to image codecs, there's work on NN prediction of optical flow and inpainting.