Speaking as someone who contributed a few scrapers to the inspectors general project (https://github.com/unitedstates/inspectors-general), I think this is a great and worthwhile effort. It's actually not that hard to contribute a scraper if you know a little Python (and maybe a way to learn a little Python if you don't).
One thing that my friend who works in Open Data has told me is that it's important for websites like this to exist, to be able to point non-technical people at them and say "SEE. THIS is why you can't just publish everything as a PDF".
The main source of their power is every single court case, and more importantly, tracking which ones interact with each other and how, such as being overturned.
This not only needs access to pacer, but a good algorithm and a huge staff to catch up to westlaw/lexis.
That's right. I once wrote the code for the now-defunct Courtbot.com, which crawled, stored, and indexed the majority of available U.S. court decisions as they were published. This was years before Google Scholar started indexing opinions.
But even a completely functional Courtbot-style site is really only a competitor to something like Findlaw.com, not Westlaw or LexisNexis. That's because those companies have features that would be difficult to replicate:
- Very complex and comprehensive search options and parameters. This is possible to do but time-consuming and tricky.
- LexisNexis has 15,000 employees, and I suspect a significant number are involved in reviewing cases, summarizing, noting authorities and conflicts, etc. It's not yet possible to replace a trained lawyer reviewing an opinion with a regular expression. :)
- A subscription to LN/WL typically gets you, depending on your package and how much you're paying, far more than just court opinions. You get news articles, journal articles, congressional transcripts, and a slew of databases that can be used to look up info on people, locate assets, etc. A lot of this means licensing deals, and LN/WL effectively gives you a one-stop shop for a wealth of data. Some of this is coming online and is becoming searchable, but not enough to make a real dent.
The one thing that challengers have in their favor is that Lexis and Westlaw are expensive. I've had free accounts because of faculty affiliations or a newsroom subscription, which is grand, but it's cost-prohibitive for many people and businesses. The ABA has published a list of alternatives; note the majority are actually still owned by Lexis and Westlaw:
http://www.americanbar.org/groups/departments_offices/legal_...
Although many folks from the Sunlight Foundation support the project, it has relatively decentralized control:
> This is an unusual, and occasionally chaotic, model for an open data project. the /unitedstates project is a neutral space; GitHub's permissions system allows many of us to share the keys, so no one person or institution controls it. What this means is that while we all benefit from each other's work, no one is dependent or "downstream" from anyone else. It's a shared commons in the public domain.
Awesome list of resources! I'm currently working on a text-based Twilio app that simplifies updates on how their Senator/Representative votes on major legislation. Further down the line I'd like to tie in direct communication with Senators/Reps where they give a statement on why they voted the way they did, updates on when they're in their local offices, etc.
May I suggest you include committee votes where you can?
I've done a bunch of technology voters guides for Wired and CNET by crawling House/Senate records (what a pain) and that's one thing I always thought would be useful. Not enough attention is paid to them, and many bills don't get to the floor. There were plenty of SOPA committee votes on amendments, but the legislation never made it to the floor.
This is awesome - though I'm amused that the site that clearly represents US data, is hosted on an overseas domain... (.io = British Indian Ocean Territory).
Edit: All snark aside though, this really is awesome. I can imagine all kinds of useful things that come out of this sort of structured data, including just interesting information (like demographic patterns of various politicians, etc).
This is a fascinating and shockingly current story. Thank you for the pointer.
"The depopulation of Chagossians from the Chagos Archipelago, that is, the compelled expulsion of the indigenous inhabitants of the island of Diego Garcia and the other islands of the British Indian Ocean Territory (BIOT) by the United Kingdom, at the request of the United States of America, began in 1968 and concluded on 27 April 1973 with the evacuation of Peros Banhos atoll.
...
On April 1, 2010, the British Cabinet announced the creation of the world’s largest Marine Protected Area (MPA) which consists of most of the Chagos Archipelago, homeland of the Chagossians. The MPA will prohibit extractive industry of all kinds, including commercial fishing and oil and gas exploration. Some Chagossians have claimed that this MPA was created to prevent the islanders from returning to the islands.
On December 1, 2010, a leaked US Embassy London diplomatic cable exposed British and US communications in creating the marine nature reserve. The cable relays exchanges between US Political Counselor Richard Mills and British Director of the Foreign and Commonwealth Office Colin Roberts, in which Roberts 'asserted that establishing a marine park would, in effect, put paid to resettlement claims of the archipelago’s former residents'. The cable (reference ID '09LONDON1156')[citation needed] was classified as confidential and 'no foreigners', and leaked as part of the Cablegate cache."
Not exactly; Google just recategorised it in their indexing system as a generic ccTLD (a technically-country-specific TLD treated as though it weren't).
You mean something that would allow "access [to] selected statistics about your Congressional district"? With a informative name like "My Congressional District"? Easily found by navigating to the tools and data section of the government entity that collects statistics about the population?
That tool helps with the parent's particular complaint, but I think the broader point is accurate. It is definitely too hard to find useful raw data, and it is even harder to find raw data that is already is a useful format. Specifically talking about the census data, their format is custom and complex [1]. They do have an API [2] which makes it easier, but I still have to write code to download a version of the census data that is in a useful format. Why can't I just have a download link to a SQL script, JSON file, or a tarball with a bunch of CSV files?
I have the same question for the United States project. Why YAML for congress-legislators? It is certainly better than creating their own custom format, but I still have to do work if I want to import the data into a database or Excel.
I do not understand your comment. OP lamented the lack of easy access to demographic data, commenting that "it would be really awesome to just click a few times to see make up of a politician's district" and I gave a link to exactly what OP was longing for. The only remaining laziness circumvention is an application that reads your mind. You think bothering to look before complaining is too much to ask of individuals who comment on HN posts?
My point is this: how many people are there as motivated as op? Motivated enough to comment on an obscure website in hopes of being pointed in the right direction?
Perhaps maybe the disconnect is in that I -- and maybe op as well -- are thinking in the context of people as a whole, and you are thinking in the context of people as members HN.
Lets talk about "people as a whole" who are interested in "a few clicks access to congressional district demographics." You think it is too much to ask to have them type "congressional district demographics" into a search box? If you put this search into google "My Congressional District" is the first result. I don't know how you make that any easier to find short of creating a mind reading application.
>You think it is too much to ask to have them type "congressional district demographics"..
For the majority of the (US at least) population? Absolutely yes. Hence the term "circumvention". No way this fight is won, at least at this point, over a battle of logic.
In this day and age, instigating change needs to be as simple as possible.
Maybe what I'm saying doesn't make sense. If so, apologies.
I apologize I did no see your reply earlier. I have a project that I am working on that will really benefit from the citation extraction. I am tired of waiting for GPO/CRS to release the Annotated Constitution in xml format. I have been slowly working on getting it in markdown format so that it can be made into epub/html/etc. I have been planning to get in touch with you but I do not have enough completed yet.
Both have to do with open data, but otherwise, there are significant differences.
The Github @unitedstates Project, is an open, relatively decentralized directory to find tools and data related to the United States. Based on the organizations involved in its birth, I'd say its ethos is, broadly, about civic-minded issues. The tools mentioned vary and have different user experiences.
Enigma is a login-required, commercial offering (with a free option, at least for the time being) providing a web application interface to public data, worldwide. It is, at its core, a search engine that lets you drill down into data rows from a common user interface. Its ethos seems to be "find the data you are looking for, whatever your purpose: academic research, business analysis, civics, etc.
Thanks Craig, but I was barely involved: all the credit should go to Sunlight Foundation and their partners (Govtrack, NY Times) who started the project and did the painstaking work to build the datasets over the course of 2 years.
I helped with a tiny tiny piece (the contact-congress repo), and even that was worked on for months before by the folks at Sunlight (in particular Dan Drinkard and Eric Mill).
One thing that my friend who works in Open Data has told me is that it's important for websites like this to exist, to be able to point non-technical people at them and say "SEE. THIS is why you can't just publish everything as a PDF".