This a great writeup and super easy to follow along with. The figures are really nice!
One observation: training a neural net to classify segmented characters is probably overkill. The author observed that the font never changed, but never ended up exploiting this fact. After the very effective preprocessing, thresholding, etc the characters are almost identical to the 'average' representations the author generated!
I bet it would be enough simply to classify an unknown character by the letter that it shows highest correlation with.
Also, given the successful character extraction, and the knowledge that (1) the font doesn't change, and (2) they're just translated and rotated, I think performing those operations on the individual characters could've yielded a pretty perfect success. Simply try a whole bunch of shifts and rotations on a given character until it matches a reference, almost exactly.
Same thinking here: IIRC there are some pretty good open source OCR programs that would make short work of the individual characters, with no need to re-assemble and/or neural networkify. However, the boilerplate code he gives is generally applicable to less braindead captures, which is a great resource for others.
FYI: Captchas are generally considered "broken" at between 1% and 10% rates of success with automated approaches, because attackers can run hundreds of thousands of requests, generally "for free" at the margin. There is no practical difference in the amount of abuse suffered by a site with a 90% captcha and a 9% captcha -- the first one just requires 10X as many HTTP requests to abuse.
This is one of the unfortunate "math favors the bad guy" consequences in a lot of anti-abuse filtering tasks. (Anti-spam research has similar problems, which is why the main innovation wasn't making filters better but radically increasing the cost of getting caught, via burning the reputation of the offending IP. IP addresses are a lot more expensive to acquire in quantity than packets.)
I've created many similar programs to defeat captcha's. I would classify this as a medium severity bug, you would still need to brute force the passwords on a terribly slow and intermittent connection.
Why wouldn't you just pay a captcha breaking service to get a near-100% success rate? Less noticable for botting and $10 will buy you around 10k captchas on antigate or deathbycaptcha. Don't really need to log in and out that much, so that'd probably be plenty.
One observation: training a neural net to classify segmented characters is probably overkill. The author observed that the font never changed, but never ended up exploiting this fact. After the very effective preprocessing, thresholding, etc the characters are almost identical to the 'average' representations the author generated!
I bet it would be enough simply to classify an unknown character by the letter that it shows highest correlation with.