Hacker News new | past | comments | ask | show | jobs | submit login
Textract, a Python package for extracting text from any document (datascopeanalytics.com)
176 points by ColinWright on Aug 3, 2014 | hide | past | favorite | 27 comments



The node module by the same name (https://github.com/dbashford/textract) also supports image OCR (via tesseract), excel files, RTF and other formats.


I also assumed that it was some kind of Python wrapper or implementation of Tesseract OCR when I saw that name. One would think so when Tesseract being (one of?) the best preforming OCR-programs out there.


Thanks for pointing this out. I've been working on a text extractor in Go at work and tried for a long time to get UnRTF working with RTF files containing Japanese characters to no avail. This lib lists catdoc as the extractor they use for RTF, so I'm going to give that a try.

Edit: Looks like catdoc doesn't work with RTF files containing Japanese characters either. Might end up having to use libreoffice or something like that.


For what its worth, textract (python) also has ambitions of including OCR through the tesseract-ocr project https://github.com/deanmalmgren/textract/issues/16


This looks nice. What I'd really like to see, along these lines, is a python library for automated document metadata extraction with confidence assessment, like this:

./autometa.py --author --verbose academic-paper.pdf

Author: "Edward Witten" Confidence: High (matches template "amslatex")


I thought about the metadata thing but decided to exclude it for the earliest versions of textract to keep things simple. If you'd like to see it in there and have a good example of how you'd like to use metadata, please feel free to throw an issue on the issue tracker https://github.com/deanmalmgren/textract/issues/


As far as I have been able to tell, the public state of the art in academic paper metadata parsing is Grobid: https://github.com/kermitt2/grobid

Not quite as simple a commandline interface as you suggest, but not too hard to set up, and pretty impressive. Now if only Google Scholar would open-source whatever they use...


For video files, guessit does something similar using only the file name:

http://guessit.readthedocs.org/


I realise that it's nice that it'll give you a single function to dump whatever file format into (while actually running it through a shell command in the backend), but it's not that hard to:

  out = ""
  pdf = pyPdf.PdfFileReader(stream)
  try:
      if pdf.getIsEncrypted():
          pdf.decrypt('')
      for page in pdf.pages:
          out += page.extractText()
  except NotImplementedError:
      # Yeah, this ain't happening


When I first read the headline, I thought there was a new python API or SDK for the already existing Textract OCR solution from Structurise. We've used Structurise's product called Textract for years at work, so it was definately around first. I'm not sure if the creators of this new solution/product were aware of the prior's existence, but using the same product name for a product that solves a similar problem seems like it would be an issue... or at the very least confusing.

Here's a link to StructuRise's Textract product page: http://www.structurise.com/textract/


I have a little shell script which tries to do basically this:

https://gist.github.com/djudd/1402751e2928cb8ac788

It tries either abiword or OpenOffice/LibreOffice for filetypes other than pdf, ps, and txt, which works pretty decently for doc, docx, ppt, etc.

One file type here that textract folks might want to add is Postscript.


Thanks for the suggestion. I wasn't familiar with ps2ascii and I just created an issue here https://github.com/deanmalmgren/textract/issues/25


Apache Tika exists for years and seems to have the same goal: http://tika.apache.org/

I'm wondering why the authors wrote something from scratch ?

edit: this is answered by one author in the 2nd disqus comments of the link


Here's the comment:

"Its very similar to Apache Tika (which I didn't know about until yesterday), but I think it is different in at least two important ways.

"1. The intention of textract is to provide many possible ways to extract text from any document, provided words appear in the correct order in the text output. By being method agnostic, its possible to use different parsing techniques in different situations. Here's more on that philosophy http://textract.readthedocs.or... and, to be fair, I'm not sure that Tika's philosophy differs in any meaningful way on this.

"2. Another subtle difference is that textract is written in python, which is a language that is used by nearly all data people that I know. Since the intent is to be a preprocessing framework for natural language processing, I wanted it to be as maintainable by the community as possible."


> wrote something from scratch ?

Looking at the source, they didn't lie about “no muss, no fuss”. It just antiword .docs, cat .txt etc


Python version supported? Pypi doesn't list it.

On the same note, your pypi page is borked: https://pypi.python.org/pypi/textract

(look at Build status & co, there's a formatting error)


Currently 2.7 but there's no reason python 3 can't be supported too. Thanks for the heads up on the borking of the pypi page. Noted.


> Ok, ok, ok. You can’t extract text from any document at the moment, but textract integrates support for many common formats and we designed it to be as easy as possible to add other document formats.

There go my hopes to see painless OCR library for Python…


Hopefully it will be? There's a great suggestion to use tesseract-ocr to make this happen. https://github.com/deanmalmgren/textract/issues/16

If you have any other (better?) ways of doing this, feel free to add some comments on the issue tracker.


The list of common formats is still pretty robust.

http://textract.readthedocs.org/en/latest/#currently-support...


Great tool! BTW. how does this compare to Apache Tika for text extraction from HTML pages?


i'm using this for my git repos now. (I version control my word docs and pdfs.) here, I even made a post about it http://www.aphex.cx/2014/08/using-git-for-pdf-and-word-doc-f...


Nice. Does it do any encoding conversion, e.g. latin1 to utf-8? Does Tika do that?


i've always thought the datascope team was awesome. textract makes them even awesomer.


This is awesome






Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: