Pretty sure that once the pre-trained models used for this are pre-trained on lots of video and get larger (2-3 orders of magnitude larger than this), things will quickly improve. This may already exist in prototype form behind closed doors. Think how LLMs have improved since gpt-2 and gpt-3. Though I imagike it in real-time and cost efficiently may be a challenge.