Lehigh research team to investigate a “Google for research data”

gervase · on Sept 6, 2018

Yikes, talk about poor timing [0]!

Of course, the proposal [1] was submitted at the beginning of August before this was available, but it must still be a bit of a gut check for the research team.

Hopefully they can figure out a way to inject additional novelty in the project.

[0]: https://news.ycombinator.com/item?id=17919297

[1]: https://www.nsf.gov/awardsearch/showAward?AWD_ID=1816325

toomuchtodo · on Sept 7, 2018

Not poor timing at all! A non-Google alternative is necessary for when Google (inevitably) decides to shelve their version.

xt00 · on Sept 6, 2018

I've been curious about how to pay people for their datasets.. I feel like research into that is really interesting.. what would be great would be if you could essentially pay known trustworthy sources for vetted data, and those sources would essentially crowdsource the data by literally pounding the pavement or parsing through difficult to parse datasets.. I mean entire businesses are based upon that, so I could see how this gets out of hand quickly where somebody wants access to a dataset but does not want others to have access to it. But anyway, I wonder if people have studied the "how to pay people for datasets" problem where you get good quality data that is not gamed to just be maximizing profit, but at the same time people have an incentive to collect the data.

lingz · on Sept 7, 2018

There are already several established organizations that hold and manage billing for data sets within specific domains. An example is the LDC (https://www.ldc.upenn.edu/) which hosts huge amounts of natural language + voice data in many languages, submitted by universities around the world.

Personally, I think there is a big downside to attaching it to billing. The process is quite difficult to obtain (financially and logistically), especially if you are outside a university. Even within a university, procuring data could take months of bureaucracy. Also as an independent student or developer, this data becomes largely inaccessible.

shawn · on Sept 6, 2018

I'd like a dataset which maps (zip code, age, single) to a distribution of incomes. E.g. (60642, 18, single) is likely <$20k income. Ideally it would return a big list of (age, zip code, income, year, single) entries.

Is it possible to find this data using Google's dataset search? If not, making an easier frontend for it might be one way to add novelty.

It's also hard to figure out whether the data exists or whether your search terms are poor.

jonas21 · on Sept 7, 2018

The data you're looking for (as well as dozens of other interesting data fields) is available from the Census Bureau:

https://www.census.gov/programs-surveys/acs/guidance/subject...

It's updated annually and you can get data all the way down to the census tract level (or maybe even individual blocks, I don't remember). It's a really great resource.

You can grab the data as CSV here:

https://www2.census.gov/programs-surveys/acs/summary_file/20...

They have a web UI too.

gervase · on Sept 6, 2018

I would also agree that in general, there is certainly room to grow in dataset discovery. However, that will require a drastic improvement in the metadata associated with many (most?) datasets, which I think is probably a large contributor to discovery difficulty.

Regarding your data, at an individual level, it will be very difficult to find that information publicly available, for the simple reason that the organizations that have that information (US Census, IRS) are under extremely strict privacy-preservation requirements whose goal is to prevent exactly these types of linkages.

You could instead try an analysis on aggregated data, i.e. correlations between (zip, average age, average household size) and (zip, average income) tuples. That data is available: [0, 1]

[0]: https://toolbox.google.com/datasetsearch/search?query=us%20i...

[1]: https://toolbox.google.com/datasetsearch/search?query=us%20c...

spydum · on Sept 6, 2018

This already exists.. look at claritas: https://claritas360.claritas.com/mybestsegments/#zipLookup

shawn · on Sept 7, 2018

This is magical. How'd you find out about this?

conception · on Sept 7, 2018

The dataset you are looking for is a pretty common marketing research dataset which may help your Google queries.

modells · on Sept 7, 2018

Google may not be the best model to aid researchers, or the most useful and profitable. An AWS meets Coursera meets helpful tech and engineering consulting/support shop seems like a better, full-service model to help bio people accomplish their work while having the support of a top-notch IT/engineering organization. Professional services without the delay, cost or extractive tendencies... more like kick-ass support that gets things done right now.

kyle_v · on Sept 7, 2018

Suprised no one has mentioned google scholar, the solution already exists. Not to mention google by itself is already a pretty great research tool if you know what you're looking for. You can even search by file type e.g. "cure for cancer filetype: pdf"

https://scholar.google.com/

imh · on Sept 7, 2018

This isn't for finding papers. It's for finding the data the papers were written about.

j_star · on Sept 7, 2018

There's already quite a few sites like this, including the one I work on for Canadian research data: https://www.frdr.ca

We have a search engine that covers every Canadian data set we can find (both academic and governmental) and the ability to upload your own data set directly to the site.

mooman219 · on Sept 6, 2018

Incorrect title, or misleading at least, The NSF awared $500K for a "Lehigh research team to investigate a 'Google for research data'". You're likely not going to make a "google for x" on $500K.

chiefalchemist · on Sept 6, 2018

I'm thinking the better question is: What would it take for Google to be "Google for research data"? I have to presume the answer is: not much.

foobar2020 · on Sept 6, 2018

I don't know what it took but they did it recently:

https://news.ycombinator.com/item?id=17919297