There's projects on this - one of the best known ones (to my knowledge) is the S...

sunchild · on June 29, 2011

Yes, litigation and compliance tend to lead the way when it comes to extracting meaning from legal data pools. In my opinion, the single biggest obstacle to getting legal knowledge to play nice with software is the fact that it is all "silo'ed" due to: (1) being in MS Word format, (2) being confidential information, and (3) the lack of conventions/standards in legal documents.

The good news, though, is that legal documents tend to follow a fairly narrow channel of variations, when isolated to particular practice areas (e.g., leases, sales of goods, service agreements, motions, etc.)

I've always wanted to run a huge number of documents through Beyesian filters or something similar to develop some interesting classification rules, but it's damn hard to get a pool of representative documents that isn't strictly confidential.

VanL · on June 29, 2011

I am doing this with patents. Google "Killing patents with Python" if you want to see the presentation.

sunchild · on June 29, 2011

Shoot me an email. I didn't find your presentation. I did find a Yahoo Answers on "Can a 4 foot long ball python kill a 4 lb. kitten?"

bchjam · on June 29, 2011

you need to keep the quotes in the search. doing that I found this link

http://topsy.com/pycon.blip.tv/file/4879824/?allow_lang=en

which then links to a video of the presentation (watching it now)

http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-how...

VanL · on June 29, 2011

Rusty or Sol?

Zumzoa · on June 29, 2011

The only word-processed files that might be non-confidential I can think of are contracts made in the past couple of decades between companies that have both declared bankruptcy. Either that or public EULAs.