It appears that what they're doing here is simply extracting keyframes from the video, using them to compose a photosynth, then converting the autoplay of the synth to a video. If you load a photosynth and press "c", you can even see a the same point clouds and scene reconstruction seen on the research page.
To me it seems like they are just taking frames subject to three constraints: average must be one every 10 frames, maximum gap must be say 80 frames, and finally the aggregate distance is minimized. In other words minimizing that metric subject to those two constraints. That's all. It's a nonlinear minimization problem.
EDIT: After reading their description, I agree they are going the photosynth route. Why not, they have the technology that you worked on. And they say that the naive subsampling I described above doesn't work...
It appears that what they're doing here is simply extracting keyframes from the video, using them to compose a photosynth, then converting the autoplay of the synth to a video. If you load a photosynth and press "c", you can even see a the same point clouds and scene reconstruction seen on the research page.
Source: I worked on photosynth.