Hacker News new | past | comments | ask | show | jobs | submit login

Reproducibility for the computer scientist means including any code written and data collected or relied on in the scientific publication itself. In practice, getting there from here isn't literally zero work, since some actual human action is needed to bundle the code and data, but that effort ought to be negligible overall, especially if we make it a standard part of the scientific process.



Trust me, it's currently far from zero work to submit code with a research paper. I was recently the corresponding author on a software paper sent to a journal that at least verifies the code compiles and runs and produces expected output. Since the poor person testing the softwares submitted is permanently in the ninth circle of dependency hell, across all platforms and libraries imaginable, it took about fifteen emails back and forth plus an OS reinstall before everything checked out. And they said that wasn't anything extraordinary.


How about a platform centered around Linux containers (or maybe one of several OS containers or VM images), as the repository image?

I'm not saying the work is zero now, but maybe we can get there. If a researcher is developing on a platform where their repository is expressed as a container-like image, then they should be able to publish it for anyone to run exactly as-is. The container repo includes the data, the operating system, and any languages and libraries, with an init system that optionally builds the results.


Yes, I think we need to go in this direction. The problem is that the container system is yet another tool for researchers to learn. The first step is to get everyone using VCS and nightly testing. Many are still at the point of clumsily written, old Fortran code that gets emailed around and exists in N different variants. (Not that there is anything wrong with Fortran.) Many are at the point where if you email them a link to a git repo to clone, they're clueless about what to do.


It would help if Git didn't have such an awful learning curve (and I say this as a git user that already went through it).

I know researchers that used Subversion when it was on the rise, but they just abandoned version control altogether when Git became the generally preferred option.


There's a difference though between "just published" and "reviewed, verified and published". We'd gain a lot if anyone simply included the code they were using with the papers. It doesn't matter if it runs and how generic it is. If it's important, it can be fixed by the next user. If it's not, it didn't matter in the first place if it's verified.

Of course verification on submission is also a great idea, but we can make it the next step.


Couldn't we all just assume that the reproduction procedure starts from a fresh Ubuntu VM?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: