Does that work? A NN might detect half a face, but then how do you switch over to the "find two dots" technique when there's only one eye? This seems susceptible to a lot of flapping.
I'd expect that the non-NN part would be more of a "track movement of this arbitrary blob" thing rather than a "track movement of this face" thing.
Suppose the NN is only 25% of the speed you need to support the frame rate you want. Then every time you get a new face blob list from the NN, the non-NN tracker would to track the blobs for 3 frames. My guess is that in most common photography situations where you need face detection, faces won't move very far or won't change orientation or lighting very much in 3 frames.