Some comments: Google, so we'll probably never get to use this directly. That sa...

Some comments: Google, so we'll probably never get to use this directly.

That said, the idea is very interesting -- train the model to generate a small full-time representation of the video, then upscale on both time and pixels.

Essentially, we have seen models adding depth maps. This one adds a 'time map' as another dimension.

Coherence is pretty good, to my eye. The jankiness seems to be more about the model deciding what something should 'do' over time, where a lot of models struggle on keeping coherence frame by frame. The big insight from the Googlers is that you could condition / train / generate on coherence as its own thing, then fill in the frames.

I think this is likely copyable by any number of the model providers out there; nothing jumps out as not implementable by Stability, for instance.