Several years ago I came across a pile of interesting-to-me US government data that consisted of scanned PDFs of paper forms (some filled out by hand, some with a typewriter, more recently with a computer). The paper forms change in layout and level of detail a few times over several decades, there are hundreds of sub-schema variations within each time period (yes, that means that a particular variant might have a single record from 1983, a dozen scattered over time, or thousands across several decades), some variations are due to a simple category misspelling, others are more substantial.
Some of the scanned forms have marginal notes (probably okay to ignore these), but I've noticed some cases where the form had prefilled data crossed out with handwritten data replacing it.
Please let me know if a system exists that can deal with this idiosyncratic mess and coaxed to produce structured data of some sort without basically using MTurk behind the scenes. I would be interested in re-releasing the data into the public domain.
Some of the scanned forms have marginal notes (probably okay to ignore these), but I've noticed some cases where the form had prefilled data crossed out with handwritten data replacing it.
Please let me know if a system exists that can deal with this idiosyncratic mess and coaxed to produce structured data of some sort without basically using MTurk behind the scenes. I would be interested in re-releasing the data into the public domain.