Running a fast network on sparse data then calling one optimized for a task on a...

Running a fast network on sparse data then calling one optimized for a task on a subset seems like a good optimization and dealing with video we are probably going to need them.

After you parse what an object is, tracking it doesn't take anywhere near the effort of original segmentation. No need to re-evaluate until something changes.

Maybe even use activations to turn on and off networks. "Oh text better load ocr into memory"

And it does inform a lot of our built world It's strange to think that when watching a movie only 10% is in focus.

Eye movement does provide a lot of information to other people and I think the physical movement produces feedback for velocities and things too. Mimicking biology is often a good bet.