Hacker News new | past | comments | ask | show | jobs | submit login
Majority of published scientific data not recoverable 20 years later (upi.com)
87 points by bane on Dec 20, 2013 | hide | past | favorite | 34 comments



The problem with getting scientists to publish their data is that the incentives are not aligned properly. Clearly it is better for science in general, but the benefit to the scientist publishing the data is less clear. For what it's worth, I am working on reshaping the academic article where the raw data and analysis can be stored alongside the text and figures, and furthermore that the reader will be able to play with the data and analysis directly inside the article. The idea being that this will be a more interesting article from the reader's perspective, and thus result in more visibility and citations for the author.

This doesn't solve the problem for research where the datasets are in the many terabytes, but then again there are many papers where the datasets are well under a gigabyte.


Incentives really are the problem - there's absolutely no reason for me, the researcher, to put much work into maintaining my data. If I do, it's because of some personal standing, or the good of the field or what have you.

No one got tenure for a well curated data set.


There may be a contrived way out of this conundrum, but I think really the most sensible way is simply regulation.

We've had a ton of scientists sacrificing their own individual reputation and relationship with publishers (and of their groups) in name of something everyone agrees but few really stand up for, because the personal gains are almost exclusively negative. That's textbook use case of regulation.

I don't know specifically what should be done but perhaps a requiring publishers certain obligations (e.g. responsibility of maintaining papers for a long date and turning them public afterwards); or maybe simply a universal obligation to open publications after say 5 years.

People fear this will compromise quality or sustainability of publishers. But the community need publishers. It's a tag of credibility. So if publishers are into trouble (and they're really needed) they'll find a way by e.g. demanding payments from publications from the most wealthy labs.


The NIH, Howard Hughes Medical Institute, and a number of other major funders of science both in the U.S. and U.K. already require papers funded with their money to be open access after 12 months.

The problem is that whether or not a paper is Open Access, "Data is available on request from the author" may be an undocumented bit of spaghetti code, may be stored on a Zip disk around here somewhere I'm sure, or may just be lost.

Regulating "You must make your data accessible, and maintain it well" is much harder to implement, and much harder to check. Some grants now have sections describing what will happen to the data etc., but right now there really is very little reason beyond their own personal desire for researchers to maintain good quality software and data repositories.


Precisely. Publications require data. Data costs time/money/brains. Grants/promotions/tenure/survival cost publications. It's not hard to see that data serves a role as currency, and there is little incentive to give it away as long as publications are what puts bread on the table.

Additionally, the discussion over shared data is usually a cry to improve reproducibility. In this publication-centric world, one can't publish a paper that is just "we reproduced an existing paper", so outsiders look to "poach" new applications or findings from the data.


Who's going to maintain it, even at the small size?

The data from my dissertation is on a Zip disk. I didn't mean for it to get lost to the world, but, you know...

I'm on a project now where provision of the raw data to the granting agency for public availability is a requirement. But the metadata, database structuring, and answering questions from people who take my data and have a question about it has added noticeably to my workload.


"..answering questions from people who take my data and have a question about it has added noticeably to my workload."

I know this is an imposition under current incentive structures but it seems to me on a macro level that this is exactly what we the public would want to see occurring, people actively reading other people's research and data and unrelated scientists asking questions of each other and reviewing each others findings.


As someone who's made sure to keep the data from my PhD thesis around for almost a decade now, I know that most of it is totally obsolete and will never be looked at by anyone. It is a laudable goal to keep data available, but it'll be a cache with very low hit rate, I think.

However, this was simulated data, it could be recreated by checking out some old version of the code and rerunning it. These days, it wouldn't even take that long. The situation is different where someone's made measurements of the real world, since those are truly irreplaceable.


There are lots of people working on this problem. The NSF Office of Cyberinfrastructure has funded quite a few projects working on long-term data storage and discoverability. The largest of those are probably DataOne (http://www.dataone.org/) and Data Conservancy (http://dataconservancy.org/). The hard part is convincing scientists to use the tools that are available. The NSF already requires that all new proposals include a data management plan. I imagine it won't be long before they start requiring projects to deposit their data in a public or eventually-public repository.


All scientific research should be verifiable and reproducible. It could be hard to reproduce physical experiments. But we should at least be able to verify authors work.

Study without original raw data or source code, is just authors opinion, not a science!


I'm happy to submit my raw data. Where I get concerned is where do we define the raw data?

For my thesis, I measured the Paterson function for a series of colloids. I can imagine other scientists finding this useful and I'd be happy to submit it. However, it's not the raw data. What I actually measured is the polarization of a neutron beam, which I then mathematically converted into the Patterson function. So I should probably submit the neutron polarization I measured, so that other scientists can check my transformation. Except that I can't directly measure the polarization - all I really measure are neutron counts versus wavelength for two different spin states, so that must be my raw data. But those counts versus wavelengths are really a histogram of time coded neutron events. And those time coded neutron events are really just voltage spikes out of a signal amplifier and a high speed clock.

If a colleague sent me her voltage spikes, I'd I'd assume she was an idiot and never talk to her again. Yet, I've also see experiments fail because of problems on each of these abstraction layers. The discriminator windows were set improperly, so the voltage spikes didn't correspond to real neutron events. The detector's position had changed, so the time coded neutron events didn't correspond to the neutron wavelengths in the histogram. A magnetic field was pointed in the wrong direction, so the neutron histograms didn't give the real polarization. There was a flaw in the polarization analyzer, so the neutron polarization didn't give the true Patterson function. And all of this is assuming that my samples were prepared properly.

I've seen all of these problems occur and worked my way around them. However, I could only work my way around the problem because I had enough context to knew what was going wrong. The deeper you head down the raw data chain, the more context you lose and the easier it becomes to make the wrong assumptions. I know that I have one data set that provides pretty damn clear evidence that we violated the conservation of energy. Obviously we didn't, but looking at the data won't tell you that unless you have information on the capacitance of the electrical interconnects in our power supplies on that particular day.

Research should be verifiable and reproducible. However, an order of magnitude in verifiability isn't as useful as an incremental increase in reproducibility. I'd be happy to let every person on earth examine every layer of my data procedure to see if I've made any mistakes, but even I won't fully trust my results until someone repeats the experiment.


One concern of mine is also that being able to "Click Run and Get The Same Answer" seems to assuage people and convince them that all is well, when what really needs to happen is to have the experiment repeated independently.


Many times this is completely impossible. I still have an email from the sysadmin of (at the time) a top 500 supercomputer asking of me and the people I was working with could please delete some of the 210TB we generated for a specific project that eventually resulted in 4 or 5 papers.


Your data is still digital at least. What about the E. coli evolution experiment (http://en.wikipedia.org/wiki/Escherichia_coli_long-term_evol...) that has to store its data in analog form in a freezer? You can't send that to other people with interest in the results by e-mail or FTP.


That sounds expensive, not impossible in principle.

The past and current community standard is that is not absolutely necessary to share data, so people run projects for which it is prohibitively expensive to copy the data.

If we chose to weight sharing more highly, we could budget for sharing and/or scale project data to make it more sharable. We might decide that if the data are too expensive to share in practice, we don't fund the project. We choose. It's not impossible.


The argument is that you should be able to regenerate the data from scratch, assuming it is a simulation of some kind. This just requires source code and config files, which are still generally not shared.


One of the problems with that was that some of the input data we were using was commercial (and almost as expensive as a postdoc years salary).

The best we could come up with was to publish as many details about the simulations as possible and let people run a limited number of simulations on our own servers (see www.gleamviz.org).

Edit: Even so you can't actually redo the original simulations as there have been several updates since, including a complete rewrite of the simulator, updated input databases, etc...


Maybe researchers could negotiate a license for commercial data that would allow others access to the data for review or reproduction purposes at a reasonable charge, but not for use as the basis of new claims.

Has anyone seen a deal like this before?


It would never happen, people would steal/share the data. Remember researchers (specifically in academia) are typically short on cash, and will cut corners whenever possible. The good ones (a majority in my experience) won't cut corners when it comes to the integrity of their data though.


But you can't cite stolen data, and you can't base claims on things you don't cite. You would pay at least when it came to submission time.


Much of science is the author's opinions, that is what makes you a scientist, you have to interpret the data you have collected. Publishing piles of data would be completely useless to anyone without detailed and careful interpretation.

The current system has its faults but some fields are at least making progress ensuring at least some of the data remains available. For certain projects there is currently no feasible way to openly distribute terabytes of data.


No, as a scientist one should give others chance to interpret data he has collected. One could interpret his own data, but it should be separate step and separate publication.

Plus how we can be sure there is no some basic mistake in interpretation? And how study can even pass peer review if most of its sources are hidden?

Technical difficulty is not really an excuse. Many studies are based on a few megabytes or even kilobytes of data. Astronomy and physics has no problem to distribute terabytes of data.

To give an example: Tycho de Brahe made lot of measures of planet Mars positions. However he was not very skilled mathematician, so his interpretation would only make orbital parameters more precise. Luckily Kepler who was briliant mathematician had access to his data and could derive Kepler's laws of planetary motion.


Let's assume we live in the ideal world where source code, raw data, analysis pipelines, and processed data are accessible with a paper. In practice, do you think that graduate students and professors will have time to review this information and reinterpret the data? When I read or peer review a paper from a lab that I know produces good quality research, I wouldn't dare waste my time reviewing such information. I know in principle, sharing everything sounds nice, but it would drastically slow down the progress of science.


Someone may review it in 1 or 10 years or maybe even in 1000 years. In future the review could be even automated. Even I could write robot which looks for mistakes in source code.

Incorrect widely accepted study can cause lot of damage.

I think share-everything is the only long term sustainable way.


I think your view is too simplistic. You are wholly correct that it's good for scientists to publish data, and often progress is made when scientists see each other's data. However, often it's not practical to publish all data or knowledge about an experimental set up. Many of my experiments run on custom-built machines with years of maintenance and history. Detailing every step in a machine's history is far too onerous a requirement, even if it does occasionally contain a nugget of value. As rprospero says above, what counts as data? An important part of scientific publishing is having the judgment to know what to include and what not to include.


And when that raw data includes your medical records?

I'm all for open data, but "All X must be Y" is often a flawed argument.


I made a great diagram which illustrates this here: '"Research Data and Metadata at Risk: Degradation over Time" http://zzzoot.blogspot.ca/2010/12/research-data-and-metadata... for a paper I co-authored: https://www.jstage.jst.go.jp/article/dsj/11/0/11_11-DS3/_pdf

The diagram is based on one from an earlier paper: 'Nongeospatial Metadata for the Ecological Sciences', 1997, Michener et al. http://dx.doi.org/10.1890/1051-0761(1997)007%5B0330:NMFTES%5...


This would be a worthwhile use for the NSA datacentres, a raw scientific data repository.


Github for science provided by Amazon might be more realistic.


Isn't AWS Glacier built for this type of use?


I've written papers where the raw data was small enough that it could be put into tables and included within the online supplements. This is pretty much ideal. Unfortunately, a lot of experiments generate enough data that no publisher will store it for you. Given what a racket the scientific journal business is and that scientists pay thousands for each publication, it really should be expected that journals will store and curate any data pertinent to a paper they are paid to publish.


Jelte Wicherts and his co-authors put a set of general suggestions for more open data in science research in an article in Frontiers of Computational Neuroscience (an open-access journal).[1]

"With the emergence of online publishing, opportunities to maximize transparency of scientific research have grown considerably. However, these possibilities are still only marginally used. We argue for the implementation of (1) peer-reviewed peer review, (2) transparent editorial hierarchies, and (3) online data publication. First, peer-reviewed peer review entails a community-wide review system in which reviews are published online and rated by peers. This ensures accountability of reviewers, thereby increasing academic quality of reviews. Second, reviewers who write many highly regarded reviews may move to higher editorial positions. Third, online publication of data ensures the possibility of independent verification of inferential claims in published papers. This counters statistical errors and overly positive reporting of statistical results. We illustrate the benefits of these strategies by discussing an example in which the classical publication system has gone awry, namely controversial IQ research. We argue that this case would have likely been avoided using more transparent publication practices. We argue that the proposed system leads to better reviews, meritocratic editorial hierarchies, and a higher degree of replicability of statistical analyses."

Wicherts has published another article, "Publish (Your Data) or (Let the Data) Perish! Why Not Publish Your Data Too?"[2] on how important it is to make data available to other researchers. Wicherts does a lot of research on this issue to try to reduce the number of dubious publications in his main discipline, the psychology of human intelligence. When I see a new publication of primary research in that discipline, I don't take it seriously at all as a description of the facts of the world until I have read that independent researchers have examined the first author's data and found that they check out. Often the data are unavailable, or were misanalyzed in the first place.

[1] Jelte M. Wicherts, Rogier A. Kievit, Marjan Bakker and Denny Borsboom. Letting the daylight in: reviewing the reviewers and other ways to maximize transparency in science. Front. Comput. Neurosci., 03 April 2012 doi: 10.3389/fncom.2012.00020

http://www.frontiersin.org/Computational_Neuroscience/10.338...

[2] Wicherts, J.M. & Bakker, M. (2012). Publish (your data) or (let the data) perish! Why not publish your data too? Intelligence,40, 73-76.

http://wicherts.socsci.uva.nl/Wichertsbakker2012.pdf


It's an amazing topic. My wife is a doctor and I'm amazed of how difficult is to see if an article is sound or important or it's been checked properly, or if the tree of citations that lead to it's importance is sound after new studies. I've been thinking on a visual standard that let's you know what status has a given study, how many reviews has, if it's citing studies that are solid with solid peer reviews and such. Ideally it should be possible to track a new discovery down to it's scientific roots, and easily be able to know how good is. But to achieve this you need as you said to change all the scientific publication method, create some kind of standard publication API for the data, the citations, the peer reviews and number of times it's been replicated. Also some kind of debate forum around this publications would be of help. Were people could coordinate replications or talk about methods , etc... But it is all a big mess, and certainly ready for disruption.

One clear user case is that recently it was published in the New England (a tier one med publication) how all the studies about high blood pressure and salt relationship have their origin in an old animal study with rabbits (If I recall correctly). They fed them with the equivalent for humans of hundreds of grams of salt, as the blood pressure rise it was deduced that salt causes high blood pressure. They also did a meta-study about the relationship of high blood pressure and salt and they didn't find a clear correlation. Maybe this is correct maybe it´s not, but that a medical truth as established as salt=high blood pressure can not be properly traced and known, just let's you see how it's all broken.

edit: typos and spelling


This highlights the fallacy of the old "once it's on the internet it's there forever" meme.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: