Hacker News new | past | comments | ask | show | jobs | submit login

I'm happy to submit my raw data. Where I get concerned is where do we define the raw data?

For my thesis, I measured the Paterson function for a series of colloids. I can imagine other scientists finding this useful and I'd be happy to submit it. However, it's not the raw data. What I actually measured is the polarization of a neutron beam, which I then mathematically converted into the Patterson function. So I should probably submit the neutron polarization I measured, so that other scientists can check my transformation. Except that I can't directly measure the polarization - all I really measure are neutron counts versus wavelength for two different spin states, so that must be my raw data. But those counts versus wavelengths are really a histogram of time coded neutron events. And those time coded neutron events are really just voltage spikes out of a signal amplifier and a high speed clock.

If a colleague sent me her voltage spikes, I'd I'd assume she was an idiot and never talk to her again. Yet, I've also see experiments fail because of problems on each of these abstraction layers. The discriminator windows were set improperly, so the voltage spikes didn't correspond to real neutron events. The detector's position had changed, so the time coded neutron events didn't correspond to the neutron wavelengths in the histogram. A magnetic field was pointed in the wrong direction, so the neutron histograms didn't give the real polarization. There was a flaw in the polarization analyzer, so the neutron polarization didn't give the true Patterson function. And all of this is assuming that my samples were prepared properly.

I've seen all of these problems occur and worked my way around them. However, I could only work my way around the problem because I had enough context to knew what was going wrong. The deeper you head down the raw data chain, the more context you lose and the easier it becomes to make the wrong assumptions. I know that I have one data set that provides pretty damn clear evidence that we violated the conservation of energy. Obviously we didn't, but looking at the data won't tell you that unless you have information on the capacitance of the electrical interconnects in our power supplies on that particular day.

Research should be verifiable and reproducible. However, an order of magnitude in verifiability isn't as useful as an incremental increase in reproducibility. I'd be happy to let every person on earth examine every layer of my data procedure to see if I've made any mistakes, but even I won't fully trust my results until someone repeats the experiment.




One concern of mine is also that being able to "Click Run and Get The Same Answer" seems to assuage people and convince them that all is well, when what really needs to happen is to have the experiment repeated independently.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: