Hacker News new | past | comments | ask | show | jobs | submit login

If they automate table detection, then many low-end "analysts" will be made redundant. PDFs one of the worst bits for data feed automation.



Yeah, I did this for a living for a little while--I was an analyst whose job was mostly to read industry quarterly reports in PDF form and condense them into much smaller reports to give to upper management.

Of course, the data itself is usually also for sale. But a manager would rather make an analyst scrape it from the PDF report than pay the reporting company extra for a data subscription, because they prefer to bear the opportunity cost of not having the analyst work on something more important and productive.

As an analyst, I can't count how many times I asked my former employer to shell out a couple hundred dollars a month for market intelligence data subscriptions and was blown off because they didn't want to allocate a budget for it.


Just imagine how many "analysts" work for Reuters et al.


A staggering number of people in any large organization are basically working as a sort of "information filter" to simply condense information and report it up the organizational food chain. A sufficiently clever combination of OCR, NLP, and ML could automate a lot of those jobs. In other words, the executive set needs a Summly for industry intelligence. (Startup idea that I'm sure someone with VC connections has thought of already)

The trouble with PDFs is they're designed to be consumed by human eyes only. Any attempt to automatically extract information from them is fundamentally a hacky scrape-job.


In fact, we’re working on an auto-detection feature at this very moment! :D




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: