Some comments: Google, so we'll probably never get to use this directly.
That said, the idea is very interesting -- train the model to generate a small full-time representation of the video, then upscale on both time and pixels.
Essentially, we have seen models adding depth maps. This one adds a 'time map' as another dimension.
Coherence is pretty good, to my eye. The jankiness seems to be more about the model deciding what something should 'do' over time, where a lot of models struggle on keeping coherence frame by frame. The big insight from the Googlers is that you could condition / train / generate on coherence as its own thing, then fill in the frames.
I think this is likely copyable by any number of the model providers out there; nothing jumps out as not implementable by Stability, for instance.
That said, the idea is very interesting -- train the model to generate a small full-time representation of the video, then upscale on both time and pixels.
Essentially, we have seen models adding depth maps. This one adds a 'time map' as another dimension.
Coherence is pretty good, to my eye. The jankiness seems to be more about the model deciding what something should 'do' over time, where a lot of models struggle on keeping coherence frame by frame. The big insight from the Googlers is that you could condition / train / generate on coherence as its own thing, then fill in the frames.
I think this is likely copyable by any number of the model providers out there; nothing jumps out as not implementable by Stability, for instance.