Demystifying Text Data with the Unstructured Python Library

agadius · on July 6, 2023

If you accept running Java, the Apache Tika is extremely good at parsing content (https://tika.apache.org/)

mcswell · on July 6, 2023

Tika can be used as a library in Python: https://pypi.org/project/tika/

ramraj07 · on July 7, 2023

As is customary for all of Apache, I have no clue what I’m looking at after trying to read through the links in that page for ten minutes. Like who is this tool for? When should I use this vs any other competing tools? No clue. I suppose it can read documents of any type and give it out as a dictionary? Why would I use this vs pandas?

mcswell · on July 7, 2023

I can't speak to the Apache documentation, but I once had the task of extracting plain text from many different document formats: Word, spreadsheets, PDFs, the EXIF information in JPEGs, and so on for a long list. I had written code with calls to extractor libraries for several of these formats, when I can across tika. Out when my if..then..elif..elif..elif.. code, to be replaced with a single (Python) call to tika.

I can't answer your question about pandas, though.

mbwgh · on July 7, 2023

I second this, there is absolutely no easily discoverable entry point to the documentation. In the end if you want to get a feeling of what this is you search for "tika tutorial" and get a rough idea via (in my case) some medium article I guess.

mcswell · on July 7, 2023

There's a book called "Tika in action" which I found useful.

convivialdingo · on July 6, 2023

I second this suggestion. I tested numerous Python tools to extract text - nothing matches Tika for general extraction of just about any data format.

However - if you can expect a certain format beforehand - then Python is better since you can extract higher-quality data (tables, lists) with the appropriate tool.

saeedesmaili · on July 6, 2023

Do you have any suggestions for Python libraries (other than what's mentioned in the post)?

convivialdingo · on July 6, 2023

I've had good luck with python-docx for reading word documents (typically specifications). Tables are supported - but it's not obvious where the table comes from in the document and I had to come up with a hack way to read image captions.

PDF has been hit or miss, but pypdf has improved in the last couple of years. Depending on the document you'll sometimes get random spaces or nospacesatall.

saeedesmaili · on July 6, 2023

I tried python-docx with a bunch of docx files (downloaded from Google Docs). It returns empty strings for hyperlinks and I couldn't manage to fix this. So if there is a sentence like "This is an important link to another doc or url." and the "link" is a hyperlink, python-docx returns "This is an important to another doc or url."

icegreentea2 · on July 6, 2023

Heh, I got a bit into hacking on python-docx last year (the original author seems to be focusing on other things than python-docx now) - I have a fork/branch where I tried to more properly implement external hyperlink functionality (https://github.com/icegreentea/python-docx/pull/7)

I realize now staring at this, that I might have broken API a little. You can't do "text = paragraph.text" anymore, but you can do "text = ''.join([run.text for run in paragraph.runs])" instead.

If you're curious at all why it breaks, it's because in the OOXML spec paragraphs are made up of a ordered list of runs or hyperlinks (and hyperlinks can then contain additional runs). The master branch just implements paragraphs as ordered list of runs (and ignores all hyperlinks).

saeedesmaili · on July 6, 2023

This sounds amazing! Thanks for sharing it, I will try it to see if I can replace it with the main python-docx. For my use case it suffices to have full text of each paragraph (even if it includes a hyperlink) and heading but also be able to have each of them separated when needed.

icegreentea2 · on July 6, 2023

Actually, I just realized that I had provided a 'one-off' hack to a similarish situation here: https://github.com/python-openxml/python-docx/issues/1123#is...

Replace the `qn("w:ins")` in the example with `qn("w:hyperlink")` and that should hopefully work?

convivialdingo · on July 6, 2023

Hey, that's fantastic. I'll definitely check that out.

jghn · on July 7, 2023

I found myself today trying to parse a TSV and substituting a few fields with a different value, then writing the new file out.

Something that perl would excel at, although I used Python. Because Perl isn't as maintainable as Python

I was intrigued by this comment. A JVM solution would also be viable in my tech stack. Would Tika be easier than line processing compiled regexes in Python? I tried looking at the Usage examples but it wasn't clear.

oersted · on July 6, 2023

I have been using it extensively during the last few weeks. I've very thankful for such a clean and practical API, and I think it will become the central solution for ingesting heterogeneous text in the Python ecosystem.

However, I'm afraid it is not there yet. Other libraries like PDFMiner give higher quality outputs and specialized libraries like Camelot are still needed to extract tables as reasonably well formatted text. It also needs a lot of extra tooling for web scraping. Sure it can read plain HTML from a URL, but it cannot run JavaScript, or control things like User Agent. It could be argued that such features are not within the scope, but it is rather bothersome for a library that presents a magic `partition` function for most standard text sources.

I'm sure it will get there soon though. It shouldn't be hard to integrate with state-of-the-art parsers and tooling, and the simple API undoubtedly brings a lot of peace of mind.

yuppiepuppie · on July 6, 2023

It would help this article’s quality if the author had included an output example from each code snippet. As a reader I’m left to imagine what the output looks like.

saeedesmaili · on July 6, 2023

Author here. That's a good point. I'll add output examples.

dogline · on July 6, 2023

I hadn't seen the unstructured python library before. Seems handy to parse personal text, like the author is doing.

froggychairs · on July 6, 2023

API looks very clean :) I’ve also been avoiding LangChain since it just seems too big for my tastes. Will give this a shot

raphman · on July 7, 2023

Another way to parse Markdown, HTML, or docx files would be pandoc [1]:

  pandoc --to json file.docx

or in Python:

  import json
  from sh import pandoc
  doc = json.loads(  pandoc("file.docx", to="json").stdout  )

Example output (reformatted slightly to reduce number of lines:

  {'pandoc-api-version': [1, 22, 2],
   'meta': {'title': {'t': 'MetaInlines',
     'c': [{'t': 'Str', 'c': 'The'}, {'t': 'Space'}, {'t': 'Str', 'c': 'Title'}]}},
   'blocks': [{'t': 'Header',
     'c': [1,
      ['first-chapter', [], []],
      [{'t': 'Str', 'c': 'First'}, {'t': 'Space'}, {'t': 'Str', 'c': 'Chapter'}]]},
    {'t': 'Para',
     'c': [{'t': 'Str', 'c': 'I'}, {'t': 'Space'}, {'t': 'Str', 'c': 'like'}, {'t': 'Space'},
      {'t': 'Emph', 'c': [{'t': 'Str', 'c': 'cursive'}]}, {'t': 'Space'}, {'t': 'Str', 'c': 'or'}, 
      {'t': 'Space'}, {'t': 'Strong', 'c': [{'t': 'Str', 'c': 'bold'}]}, {'t': 'Space'},
      {'t': 'Str', 'c': 'text.'}]},
    {'t': 'Para',
     'c': [{'t': 'Str', 'c': 'Here'}, {'t': 'Space'}, {'t': 'Str', 'c': 'is'}, {'t': 'Space'},
      {'t': 'Str', 'c': 'a'}, {'t': 'Space'}, {'t': 'Link',
       'c': [['', [], []], [{'t': 'Str', 'c': 'link'}], ['https://ix.de/', '']]},
      {'t': 'Str', 'c': '.'}]},
    {'t': 'BulletList',
     'c': [[{'t': 'Para', 'c': [{'t': 'Str', 'c': 'Item'}, {'t': 'Space'}, 't': 'Str', 'c': '1'}]}],
      [{'t': 'Para', 'c': [{'t': 'Str', 'c': 'Item'}, {'t': 'Space'}, {'t': 'Str', 'c': '2'}]}]]}]}

[1] https://pandoc.org/

CShorten · on July 6, 2023

Awesome!