I like all of your suggestions. I've been thinking about using TinyYOLOv3 as wel...

I like all of your suggestions. I've been thinking about using TinyYOLOv3 as well. Provided the training set is considerably bigger than my own (I've created about ~550 samples and fine-tuned the model with them), you could end up with a very capable detection system that uses very few resources.

Object tracking is yet a very good idea. I will consider it. Anchor-box tuning is another very good idea.

Also, the CRAFT text detector that I'm using should IMHO be removed. Instead just use a very well trained text recognizer (like the CRNN I'm using). The text detector is expensive computationally since it's based on the VGG-16 model.

Then convert the models to use mixed-precision.

All in all, I think the performance improvements can be anywhere between 1 and 2 orders of magnitude.