This is an interesting and well-written post but in my experience for practical object tracking problems using a fine-tuned object detection model with a centroid tracking algorithm works well enough. Much more computationally efficient and you only have to annotate object examples rather than frame-by-frame object tracking.
I would have loved to have seen a benchmark comparing the two approaches to motivate the much more complex and labour intensive alternative ':)
Often occlusions don’t matter in practice for dense object tracking e.g. counting people entering a bus or measuring gym equipment utilization you actually don’t care if the person IDs switch.
Centroid tracking works well in practice at 2 FPS for the bus use case and several others.
More complex approaches have their place but an important limiting factor (e.g. in the bus project I worked on) is that often the hardware required to run a more complex approach at a higher frame rate is too expensive to be feasible.
There are definitely limitations to centroid tracking and use cases where it doesn’t work like if you need to track specific faces moving through a crowd but it’s a useful tool that has its place in practice.
Your blog post is awesome, I’m just pointing out that a simpler method often works well in practice and is much less intimidating for beginners :)
1. Detect objects (e.g. using TensorFlow) in each video frame
2. Initialize an incrementing ID for each object at the first frame
3. Compute the centroid of each object 4. For each object:
4.1. Compute the Euclidean distance to every object centroid in the previous frame. The nearest object from the previous frame is the candidate assignment.
4.2.a) If the distance is less than a hand-set threshold, use the object from the previous frame's ID for the object in the current frame.
4.2.b) If the distance is above the threshold, assign a new ID
I explained it during my PyConZA keynote a few weeks ago. The vid isn't out yet but the slides are available here and it's much easier to explain visually: http://bit.ly/pycon_keynote_slides
Feel free to email me if you get stuck or need some help implementing it :)
i think the idea of making kalman filter parameters learnable so a deep learning algorithm can judge appearance of the object and not just location is super neat. watching the algorithm in action makes me uncomfortable though. creepy things are in order in our future.
In some sense this is what a particle filter already does, but this allows greater learning capacity (hence also huge risks of overfitting on specific surface statistics in a given dataset).
I would have loved to have seen a benchmark comparing the two approaches to motivate the much more complex and labour intensive alternative ':)