Hacker News new | past | comments | ask | show | jobs | submit login
Extracting tabular data from U.S. Senators' scanned-in personal finance reports (github.com/dannguyen)
70 points by danso on March 21, 2016 | hide | past | favorite | 6 comments



Very neat project. Tesseract seems to be a lot of projects default go to now on OCR. I had not heard of FineReader until now.

My question is why is it still acceptable for them to submit via paper? Who determines the submission requirements? I am assuming it is the Houses themselves - I mean what body within those houses determines the requirements?

edit: Seems Secretary of Senate keeps up with and enforces - I wonder if also makes or if it quite literally takes "an act of Congress" to alter requirements. https://efdsearch.senate.gov/search/home/


Tesseract is popular because it is open source, in my opinion. I've had a great deal of frustration using it, but to be honest I think OCR falls into the same valley as trying to implement an office suite, or reimplement the Windows api - high expectations and only so many spare cycles to work on it.

I also have FineReader for Mac but I've had contact with some of Abbyy's more expensive stuff and it's really incredible. I wouldn't recommend any other system, if asked.


On the House side (and I imagine it's relatively the same with the Senate), the Clerk administers the system and the Ethics committee acts (or at least, talks about) on complaints and violations:

https://fd.house.gov/FAQ.aspx

> Q. Can I still file on paper or is it mandatory to file the FD and PTR forms electronically?

> The electronic filing system is voluntary, but strongly recommended by the Committee on Ethics. You are still required to file, you just have the option of choosing to file on paper or via the online reporting application. The paper forms can be found here: Disclosure Forms.

Given that just enforcing the deadline is hard enough, I imagine the Clerk is just happy when things get turned in, regardless of the format (and let's face it, it's not hard to imagine Congressmembers complaining about having technical problems with the web form):

http://www.rollcall.com/218/financial-disclosure-deadline-co...

If the process of filing seems annoying to you...you probably don't want to know how good they are at filing accurate reports :)

http://fortune.com/2015/12/14/senator-corker-financial-discl...


In person paper submissions are always preferable for these sorts of things because an error, misstatement, or omission could constitute mail or wire fraud in the hands of a zealous prosecutor.


Using non-free software, i.e. ABBYY Finereader, is a privacy mistake, if this is to be used on personal data. On publicly available data, using a commercial non-free software isn't even close to being innovative; iow I don't see a value addition to using this program as opposed to using the batch scanning features of the software directly. What is the point of this project?


Hmmm...not sure where the difference in understanding is here. You're asking if this could be a privacy violation? Is it not clear that the U.S. Congress is required to post these forms on us.gov websites, making them accessible to all? Are you unaware of the difference between a digital image and digital text, that you don't understand how one is profoundly different than the other? Help me out here




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: