OCRopus: high-quality open source OCR sponsored by Google and used by reCAPTCHA

chime · on April 18, 2010

I've used http://code.google.com/p/tesseract-ocr/ by Google too. It appears OCRopus uses that as one of the plugins for OCR.

dlib · on April 18, 2010

Is there a webservice that allows you to send in files (PDF with images, png, jpeg etc.) through an API and have the service send you a .txt, .doc or a pdf of the same file with the text embedded?

perplexes · on April 18, 2010

Docsplit comes to mind, but it's not in web-API form: http://documentcloud.github.com/docsplit/

_pius · on April 18, 2010

http://www.programmableweb.com/api/wisetrend-ocr

est · on April 19, 2010

Official Google version

http://googlecodesamples.com/docs/php/ocr.php

Now integrated into G Docs already.

samratjp · on April 18, 2010

Posterous is pretty good. I believe your docs get pushed onto Scribd. But, then again, you can download it as PDF probably from there.

nearestneighbor · on April 18, 2010

Is this the same one as featured here:

http://recaptcha.net/digitizing.html

If so, I'm not impressed.

henning · on April 18, 2010

What exactly do you expect? Some of those old documents they're trying to digitize are in such bad shape that you practically need an electron microscope to decipher them. Document recognition in its full generality is still an open problem. The examples shown on that page constitute highly adversarial challenges. For simpler examples of the kind that would prevail with recently printed material, much better results can be achieved.

nearestneighbor · on April 18, 2010

> Some of those old documents they're trying to digitize are in such bad shape that you practically need an electron microscope to decipher them.

Red herring. I'm talking about the examples pictured.

Do you know _for a fact_ that this is the same software package?

I don't want to waste my time arguing about why it doesn't live up to my expectations, if it's not.

ZeroGravitas · on April 19, 2010

It's not obvious what you're asking but the answer's probably "no".

That page features the output of reCAPTCHA and compares it against an unnamed standard OCR.

The standard OCR does poorly, but it's on tricky documents selected to show the benefits of reCAPTCHA. It doesn't say it's this OCR code, nor does that page really tell you anything about how good it is, if it was.

The other thing you might be saying is that you think the reCAPTCHA output isn't very impressive either. As well as the human element, reCAPTCHA claims to use several standard OCRs to process their document and combine the output is some way. It's possible that the Google code is one of those that they use, but if so it's only part of the process.

cowsandmilk · on April 18, 2010

what evidence do you have that this is used for reCAPTCHA?

gometro33 · on April 18, 2010

A quick search revealed that this title ("...and used by reCAPTCHA") has been reused many times and is likely just a rumor at this point.

I would definitely be interested in seeing evidence of it though.

aschobel · on April 18, 2010

Is there something similar for handwritten text?

ZeroGravitas · on April 19, 2010

The links says: "The OCRopus engine is based on two research projects: a high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau, and novel high-performance layout analysis methods.", so it appears so.

korch · on April 18, 2010

Uh, why is Google releasing this? Wouldn't this code give hackers a good head start to create an OCR system capable of trivially defeating CAPTCHA everywhere?

Or maybe they've realized any human computer test based on text recognition is flawed, and so what better way to force the web to upgrade than to make OCR trivial? I rather like this shotgun approach to AI.

modoc · on April 18, 2010

I think the benefits of having high quality free OCR tools available to developers outweighs the CAPTCHA abuse risk. Information organization is a huge problem/area of opportunity, and being able to extract text/content/context out of scans/photos and the like is key.

viraptor · on April 19, 2010

1. OCRs don't work well with text distorted in typical-captcha ways. They fail with colours especially.

2. Captchas have limited output sets and special characteristics, which make using OCR for them both costly and ineffective in comparison to dedicated solutions. Specifically, you can generate as many perfect sample outputs from a captcha system as you want - and then analyse it in ways beyond the standard character recognition.

henning · on April 18, 2010

As far as I know, no. reCAPTCHA specifically focuses on challenges that are likely to be incorrectly processed by existing document recognition systems.