Hacker News new | past | comments | ask | show | jobs | submit login
OCRopus: high-quality open source OCR sponsored by Google and used by reCAPTCHA (code.google.com)
68 points by henning on April 18, 2010 | hide | past | favorite | 18 comments



I've used http://code.google.com/p/tesseract-ocr/ by Google too. It appears OCRopus uses that as one of the plugins for OCR.


Is there a webservice that allows you to send in files (PDF with images, png, jpeg etc.) through an API and have the service send you a .txt, .doc or a pdf of the same file with the text embedded?


Docsplit comes to mind, but it's not in web-API form: http://documentcloud.github.com/docsplit/



Official Google version

http://googlecodesamples.com/docs/php/ocr.php

Now integrated into G Docs already.


Posterous is pretty good. I believe your docs get pushed onto Scribd. But, then again, you can download it as PDF probably from there.


Is this the same one as featured here:

http://recaptcha.net/digitizing.html

If so, I'm not impressed.


What exactly do you expect? Some of those old documents they're trying to digitize are in such bad shape that you practically need an electron microscope to decipher them. Document recognition in its full generality is still an open problem. The examples shown on that page constitute highly adversarial challenges. For simpler examples of the kind that would prevail with recently printed material, much better results can be achieved.


> Some of those old documents they're trying to digitize are in such bad shape that you practically need an electron microscope to decipher them.

Red herring. I'm talking about the examples pictured.

Do you know _for a fact_ that this is the same software package?

I don't want to waste my time arguing about why it doesn't live up to my expectations, if it's not.


It's not obvious what you're asking but the answer's probably "no".

That page features the output of reCAPTCHA and compares it against an unnamed standard OCR.

The standard OCR does poorly, but it's on tricky documents selected to show the benefits of reCAPTCHA. It doesn't say it's this OCR code, nor does that page really tell you anything about how good it is, if it was.

The other thing you might be saying is that you think the reCAPTCHA output isn't very impressive either. As well as the human element, reCAPTCHA claims to use several standard OCRs to process their document and combine the output is some way. It's possible that the Google code is one of those that they use, but if so it's only part of the process.


what evidence do you have that this is used for reCAPTCHA?


A quick search revealed that this title ("...and used by reCAPTCHA") has been reused many times and is likely just a rumor at this point.

I would definitely be interested in seeing evidence of it though.


Is there something similar for handwritten text?


The links says: "The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.", so it appears so.


Uh, why is Google releasing this? Wouldn't this code give hackers a good head start to create an OCR system capable of trivially defeating CAPTCHA everywhere?

Or maybe they've realized any human computer test based on text recognition is flawed, and so what better way to force the web to upgrade than to make OCR trivial? I rather like this shotgun approach to AI.


I think the benefits of having high quality free OCR tools available to developers outweighs the CAPTCHA abuse risk. Information organization is a huge problem/area of opportunity, and being able to extract text/content/context out of scans/photos and the like is key.


1. OCRs don't work well with text distorted in typical-captcha ways. They fail with colours especially.

2. Captchas have limited output sets and special characteristics, which make using OCR for them both costly and ineffective in comparison to dedicated solutions. Specifically, you can generate as many perfect sample outputs from a captcha system as you want - and then analyse it in ways beyond the standard character recognition.


As far as I know, no. reCAPTCHA specifically focuses on challenges that are likely to be incorrectly processed by existing document recognition systems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: