Could isolate the text by compositing neighbor images and getting the pixels that don't differ. The background is moving, the text is not.
This would also allow you to extract out the subs without having to OCR them/get characters. Could just erase all static artifacts (including subs but also things like watermarks).
An approach I really want to try is taking a stream of the video without subs (can easily be found online) and subtracting the two. You'd have to deal with differences in resolution and compression between the two, and also handle cases where the background is either white or black, but in theory it should work very well. I haven't had time to dig into this.
In order to have access to vocabulary words. From the article:
> I wanted to get a transcript of the episode’s dialog so I could study the unfamiliar vocabulary. Unfortunately, the video files I have only have hard subtitles
This would also allow you to extract out the subs without having to OCR them/get characters. Could just erase all static artifacts (including subs but also things like watermarks).