Neat.. A while ago I was trying to figure out how to run tensorflow on videos from the camera I have in front of my house.
I had worked out how to pre-process the video using opencv to mask out whatever was moving and export that to static images. Then train TF using those images.
The resulting system worked enough to say that a USPS truck was outside, but couldn't tell the difference between the truck driving by vs stopping to deliver something in the mailbox.
I had an idea to use opencv to track the objects, then export a graph of the X,Y coordinates of the moving object over time, and then train TF on the graphs. but never got around to testing it.
I wonder if this project would do a better job, or still have issues because all the videos are almost identical.
One way to solve this problem is to give your neural network a stack of successive frames, instead of given them one by one. If the framerate is too high, you may need to skip some of the frames to get idea of speed/acceleration.
I'll also link up another Google paper (sorry no open access). It's only tangentially related. But meta-learning of knowledge graphs for recommending the next video to watch on Youtube. Will possibly be extended for robots learning to watch humans perform complex tasks ;)
Recommending what video to watch next: a multitask ranking system
The interesting fact about TinyVideoNets is they classified the layers in the end after identification on each layer, depending on the model of parameters, is much efficient to classify on each layer? Better sorting of the results? I wonder why they didn't explain that in the paper.
The first performance above 34% on Moments-in-Time.
That MiT test looks like it begins to approximate general intelligence. Can they get the other 65% (or whatever it actually is) with the existing paradigm?
I looked for a bit at the MiT dataset paper (https://arxiv.org/pdf/1801.03150.pdf) and I'm honestly not sure what human / general intelligence would be. There's only a single ground-truth category per video, but the categories overlap somewhat and multiple categories can apply. And the ground truth category is only 75-85% agreed upon by humans for some videos. The dataset was not constructed to have 100% performance by humans.
I guess to evaluate accuracy fairly you'd have to run another Amazon Turk project, where you run a classifier over the data, take the top 5 and replace the bottom one with the ground truth if it's not there, and then quiz the workers as to the best category. But it's a million videos.
I had worked out how to pre-process the video using opencv to mask out whatever was moving and export that to static images. Then train TF using those images.
The resulting system worked enough to say that a USPS truck was outside, but couldn't tell the difference between the truck driving by vs stopping to deliver something in the mailbox.
I had an idea to use opencv to track the objects, then export a graph of the X,Y coordinates of the moving object over time, and then train TF on the graphs. but never got around to testing it.
I wonder if this project would do a better job, or still have issues because all the videos are almost identical.