I also assumed that it was some kind of Python wrapper or implementation of Tesseract OCR when I saw that name.
One would think so when Tesseract being (one of?) the best preforming OCR-programs out there.
Thanks for pointing this out. I've been working on a text extractor in Go at work and tried for a long time to get UnRTF working with RTF files containing Japanese characters to no avail. This lib lists catdoc as the extractor they use for RTF, so I'm going to give that a try.
Edit: Looks like catdoc doesn't work with RTF files containing Japanese characters either. Might end up having to use libreoffice or something like that.
This looks nice. What I'd really like to see, along these lines, is a python library for automated document metadata extraction with confidence assessment, like this:
I thought about the metadata thing but decided to exclude it for the earliest versions of textract to keep things simple. If you'd like to see it in there and have a good example of how you'd like to use metadata, please feel free to throw an issue on the issue tracker https://github.com/deanmalmgren/textract/issues/
As far as I have been able to tell, the public state of the art in academic paper metadata parsing is Grobid: https://github.com/kermitt2/grobid
Not quite as simple a commandline interface as you suggest, but not too hard to set up, and pretty impressive. Now if only Google Scholar would open-source whatever they use...
I realise that it's nice that it'll give you a single function to dump whatever file format into (while actually running it through a shell command in the backend), but it's not that hard to:
out = ""
pdf = pyPdf.PdfFileReader(stream)
try:
if pdf.getIsEncrypted():
pdf.decrypt('')
for page in pdf.pages:
out += page.extractText()
except NotImplementedError:
# Yeah, this ain't happening
When I first read the headline, I thought there was a new python API or SDK for the already existing Textract OCR solution from Structurise. We've used Structurise's product called Textract for years at work, so it was definately around first. I'm not sure if the creators of this new solution/product were aware of the prior's existence, but using the same product name for a product that solves a similar problem seems like it would be an issue... or at the very least confusing.
"Its very similar to Apache Tika (which I didn't know about until yesterday), but I think it is different in at least two important ways.
"1. The intention of textract is to provide many possible ways to extract text from any document, provided words appear in the correct order in the text output. By being method agnostic, its possible to use different parsing techniques in different situations. Here's more on that philosophy http://textract.readthedocs.or... and, to be fair, I'm not sure that Tika's philosophy differs in any meaningful way on this.
"2. Another subtle difference is that textract is written in python, which is a language that is used by nearly all data people that I know. Since the intent is to be a preprocessing framework for natural language processing, I wanted it to be as maintainable by the community as possible."
> Ok, ok, ok. You can’t extract text from any document at the moment, but textract integrates support for many common formats and we designed it to be as easy as possible to add other document formats.
There go my hopes to see painless OCR library for Python…