Hacker News new | past | comments | ask | show | jobs | submit login
Introducing Tabula, a human-friendly PDF-to-CSV data extractor (mozillaopennews.org)
137 points by mtigas on April 3, 2013 | hide | past | favorite | 18 comments



Love it: a wonderful gift to millions of students, analysts, journalists, researchers, and others who for many years have had to extract data from PDFs via throwaway scripts, copy-and-paste, or (yikes) read-and-retype.


If they automate table detection, then many low-end "analysts" will be made redundant. PDFs one of the worst bits for data feed automation.


Yeah, I did this for a living for a little while--I was an analyst whose job was mostly to read industry quarterly reports in PDF form and condense them into much smaller reports to give to upper management.

Of course, the data itself is usually also for sale. But a manager would rather make an analyst scrape it from the PDF report than pay the reporting company extra for a data subscription, because they prefer to bear the opportunity cost of not having the analyst work on something more important and productive.

As an analyst, I can't count how many times I asked my former employer to shell out a couple hundred dollars a month for market intelligence data subscriptions and was blown off because they didn't want to allocate a budget for it.


Just imagine how many "analysts" work for Reuters et al.


A staggering number of people in any large organization are basically working as a sort of "information filter" to simply condense information and report it up the organizational food chain. A sufficiently clever combination of OCR, NLP, and ML could automate a lot of those jobs. In other words, the executive set needs a Summly for industry intelligence. (Startup idea that I'm sure someone with VC connections has thought of already)

The trouble with PDFs is they're designed to be consumed by human eyes only. Any attempt to automatically extract information from them is fundamentally a hacky scrape-job.


In fact, we’re working on an auto-detection feature at this very moment! :D


Great work, the integration (as shown in the demo) and UX are really well done. A couple of questions:

1) Why use Python for OpenCV when Ruby has a decent wrapper that can do Hough (https://github.com/ruby-opencv/ruby-opencv)? Or was the Ruby version just too buggy still?

2) Is there a command-line version planned? I guess it'd be most relevant once auto-detection is figured out.


1) We’re not actually using Python for OpenCV, just ruby-opencv and possibly some bindings in Java/JRuby. (I think Python’s in the build instructions due to a numpy dependency in OpenCV. Though that might be specific to using Homebrew on OS X. Definitely looking into it soon.)

2) No plans at the moment, though that's an awesome idea.


Wow, nice work! I'm the author Trapeze, a once-shareware (now freeware and open source) PDF-to-Word/RTF/HTML/PlainText application for OS X. My approach was similar: trying to squash characters into words via a logical grid to determine whitespace. My #1 request from customers was to extract tables and I never had the guts to attempt it. :-)

(For those interested, you can grab Trapeze from mesadynamics.com -- requires OS X 10.4; source code is a mixture of C++ and Objective-C).


I probably could have used this recently when I had a project which required a close encounter with extracting data from PDFs. Fortunately the PDFs were generated as a report by a VB6 application (!) so they had a fairly regular format once I figured out the quirks of PDF, as the authors describe here.

I did learn a few neat tricks by doing it myself though. The library I used to extract the text was none other than Mozilla's own PDF.js, so in the final version my users could just drag and drop the PDF onto the browser window, and my little algorithm parsed the tables into arrays, with AngularJS rendering them as HTML tables.

Obviously computer-vision assisted, general purpose reconstruction of tabular data is the secret sauce in this project, but if you have the right use case you can do some cool things in the client. You do have to dig into the PDF.js internals a bit to figure out how to use it but I'm sure that it will improve in that respect.


I wish I'd read this an hour ago, before I wrote a series of terrible awk, perl, and bash scripts to process several thousand inconsistently formatted pdfs.

edit: Nevermind, it wouldn't have helped. I missed the part where automation isn't yet supported. Either way, this looks like a great tool.


This is fantastic, would saved me dozens of hours as an econ undergraduate.

Semirelated: I used to have a ton of scanned journal articles that I wanted to be able to read on a kindle without having to scroll across every page, and came across k2pdfopt. It's a C script that finds word and line breaks in image based pdfs and rearranges the text so that they'll fit on smaller screens. It's got a ton of flags you can set and is pretty good and ignoring/cropping out headers and footers and dealing with pages scanned at an angle. http://www.willus.com/k2pdfopt/help/k2menu.shtml No affiliation with Willus


I am starting a personal project to convert my University schedules from pdf to an ICS calendar, I'm so glad I heard about Tabula, but like previously said a command line version would just be wonderful.


This is very cool!

Has this kind of thing been done for PDF map data?

I was talking with a friend of mine a month ago about the dismal state of official crime incidence websites. They're usually just lists of PDFs, probably because whoever is responsible for the data just uses whatever MS Word PDF output is available to the office and posts an existing monthly report as a PDF. This makes online crime data a huge pain in the #ss to decipher.

I'm sure there's a lot of geographic data this could apply to.


this is neat. i'm also doing pdf rasterization and pretty extensive document analysis in html5 <canvas>, not just tables. unfortunately it's for an internal tool which will likely form the core of our business but the base library i wrote and use for it is open sourced at https://github.com/leeoniya/pXY.js

tute and demos are here: http://o-0.me/pXY/ , some recent commits like radial scanning aren't documented very well yet but i'll devote some time to it if anyone needs those. they're mostly useful for interactive analysis.

with some creative algorithms, typed arrays and web workers the speed is pretty amazing (for something built in js at least). a 1550x2006 pixel document page analyzes in 1.1s in chrome.


This is just awesome! Well done!


Tabula is also the name of a programmable logic company doing fpga-like "3PLDs" where the design implemented varies over time to increase effective size of the logic fabric. (tabula.com)


Awesome - have needed this so often.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: