Accuracy. Every solution I've seen that relies on automatic crawling will eventually have a parsing error when someone changes their sentence structure of a press release.
It's not so obvious when you're looking at the breaking releases for a few stocks or companies, but historical records have at least 1 error per stock per year.
1. Data matching expectations (you do have a definition of correct, right?)
2. Log for manual review -> manual inserts or correction and placed into queue for (1)
Monitor (2). When inserts start trending up, it may be time to update your processing logic.
I came up with a similar idea for a company several years ago where we had a team of people doing data entry from faxed documents. I wanted to build something that would do all the OCR it could and then display it to users to verify, which should have been a 10 times efficiency increase, not to mention speed and accuracy.
The idea was rejected, they wanted either a perfect solution or nothing. I don't know why, but for some reason the idea computers removing humans is acceptable to management, but computers augmenting humans wasn't.
It's not so obvious when you're looking at the breaking releases for a few stocks or companies, but historical records have at least 1 error per stock per year.