Hacker News new | past | comments | ask | show | jobs | submit login

All scientific research should be verifiable and reproducible. It could be hard to reproduce physical experiments. But we should at least be able to verify authors work.

Study without original raw data or source code, is just authors opinion, not a science!




I'm happy to submit my raw data. Where I get concerned is where do we define the raw data?

For my thesis, I measured the Paterson function for a series of colloids. I can imagine other scientists finding this useful and I'd be happy to submit it. However, it's not the raw data. What I actually measured is the polarization of a neutron beam, which I then mathematically converted into the Patterson function. So I should probably submit the neutron polarization I measured, so that other scientists can check my transformation. Except that I can't directly measure the polarization - all I really measure are neutron counts versus wavelength for two different spin states, so that must be my raw data. But those counts versus wavelengths are really a histogram of time coded neutron events. And those time coded neutron events are really just voltage spikes out of a signal amplifier and a high speed clock.

If a colleague sent me her voltage spikes, I'd I'd assume she was an idiot and never talk to her again. Yet, I've also see experiments fail because of problems on each of these abstraction layers. The discriminator windows were set improperly, so the voltage spikes didn't correspond to real neutron events. The detector's position had changed, so the time coded neutron events didn't correspond to the neutron wavelengths in the histogram. A magnetic field was pointed in the wrong direction, so the neutron histograms didn't give the real polarization. There was a flaw in the polarization analyzer, so the neutron polarization didn't give the true Patterson function. And all of this is assuming that my samples were prepared properly.

I've seen all of these problems occur and worked my way around them. However, I could only work my way around the problem because I had enough context to knew what was going wrong. The deeper you head down the raw data chain, the more context you lose and the easier it becomes to make the wrong assumptions. I know that I have one data set that provides pretty damn clear evidence that we violated the conservation of energy. Obviously we didn't, but looking at the data won't tell you that unless you have information on the capacitance of the electrical interconnects in our power supplies on that particular day.

Research should be verifiable and reproducible. However, an order of magnitude in verifiability isn't as useful as an incremental increase in reproducibility. I'd be happy to let every person on earth examine every layer of my data procedure to see if I've made any mistakes, but even I won't fully trust my results until someone repeats the experiment.


One concern of mine is also that being able to "Click Run and Get The Same Answer" seems to assuage people and convince them that all is well, when what really needs to happen is to have the experiment repeated independently.


Many times this is completely impossible. I still have an email from the sysadmin of (at the time) a top 500 supercomputer asking of me and the people I was working with could please delete some of the 210TB we generated for a specific project that eventually resulted in 4 or 5 papers.


Your data is still digital at least. What about the E. coli evolution experiment (http://en.wikipedia.org/wiki/Escherichia_coli_long-term_evol...) that has to store its data in analog form in a freezer? You can't send that to other people with interest in the results by e-mail or FTP.


That sounds expensive, not impossible in principle.

The past and current community standard is that is not absolutely necessary to share data, so people run projects for which it is prohibitively expensive to copy the data.

If we chose to weight sharing more highly, we could budget for sharing and/or scale project data to make it more sharable. We might decide that if the data are too expensive to share in practice, we don't fund the project. We choose. It's not impossible.


The argument is that you should be able to regenerate the data from scratch, assuming it is a simulation of some kind. This just requires source code and config files, which are still generally not shared.


One of the problems with that was that some of the input data we were using was commercial (and almost as expensive as a postdoc years salary).

The best we could come up with was to publish as many details about the simulations as possible and let people run a limited number of simulations on our own servers (see www.gleamviz.org).

Edit: Even so you can't actually redo the original simulations as there have been several updates since, including a complete rewrite of the simulator, updated input databases, etc...


Maybe researchers could negotiate a license for commercial data that would allow others access to the data for review or reproduction purposes at a reasonable charge, but not for use as the basis of new claims.

Has anyone seen a deal like this before?


It would never happen, people would steal/share the data. Remember researchers (specifically in academia) are typically short on cash, and will cut corners whenever possible. The good ones (a majority in my experience) won't cut corners when it comes to the integrity of their data though.


But you can't cite stolen data, and you can't base claims on things you don't cite. You would pay at least when it came to submission time.


Much of science is the author's opinions, that is what makes you a scientist, you have to interpret the data you have collected. Publishing piles of data would be completely useless to anyone without detailed and careful interpretation.

The current system has its faults but some fields are at least making progress ensuring at least some of the data remains available. For certain projects there is currently no feasible way to openly distribute terabytes of data.


No, as a scientist one should give others chance to interpret data he has collected. One could interpret his own data, but it should be separate step and separate publication.

Plus how we can be sure there is no some basic mistake in interpretation? And how study can even pass peer review if most of its sources are hidden?

Technical difficulty is not really an excuse. Many studies are based on a few megabytes or even kilobytes of data. Astronomy and physics has no problem to distribute terabytes of data.

To give an example: Tycho de Brahe made lot of measures of planet Mars positions. However he was not very skilled mathematician, so his interpretation would only make orbital parameters more precise. Luckily Kepler who was briliant mathematician had access to his data and could derive Kepler's laws of planetary motion.


Let's assume we live in the ideal world where source code, raw data, analysis pipelines, and processed data are accessible with a paper. In practice, do you think that graduate students and professors will have time to review this information and reinterpret the data? When I read or peer review a paper from a lab that I know produces good quality research, I wouldn't dare waste my time reviewing such information. I know in principle, sharing everything sounds nice, but it would drastically slow down the progress of science.


Someone may review it in 1 or 10 years or maybe even in 1000 years. In future the review could be even automated. Even I could write robot which looks for mistakes in source code.

Incorrect widely accepted study can cause lot of damage.

I think share-everything is the only long term sustainable way.


I think your view is too simplistic. You are wholly correct that it's good for scientists to publish data, and often progress is made when scientists see each other's data. However, often it's not practical to publish all data or knowledge about an experimental set up. Many of my experiments run on custom-built machines with years of maintenance and history. Detailing every step in a machine's history is far too onerous a requirement, even if it does occasionally contain a nugget of value. As rprospero says above, what counts as data? An important part of scientific publishing is having the judgment to know what to include and what not to include.


And when that raw data includes your medical records?

I'm all for open data, but "All X must be Y" is often a flawed argument.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: