Hacker News new | past | comments | ask | show | jobs | submit login
Publish your computer code: it is good enough (nature.com)
96 points by pama on Oct 13, 2010 | hide | past | favorite | 56 comments



This is such a fundamentally important issue in the sciences. Without the source code, you can not truly do peer review, and you may not be able to replicate someone else's work. With no peer review or replication, you are no longer doing science.

I've seen source code for software used to produce results in major publications like Nature that was so poor quality it is surprising it even compiled. You don't generally get tenure for producing well written software, so there needs to be another incentive for scientists to spend the time to write well thought out and well documented code. I think sunlight helps tremendously.

Warren DeLano, the author of PyMol, the very popular molecular visualization software, was an early voice about the importance of open source software in the sciences. Unfortunately, he is no longer with us, but it is hearting to see others now taking up the cause.

The only way to publish software in a scientifically robust manner is to share source code, and that means publishing via the internet in an open-access/open-source fashion. - Warren DeLano (2005)


I'm not sure sharing code is such a great idea. If you use the other guy's code, you might be replicating the other guy's mistakes. And having the code available encourages that.

It would be better to describe your methods, so someone else can implement it in their preferred tools. If they still replicate your results, that strikes me as being much stronger than just rerunning the same, possibly buggy, code.

I mean, if you're working with monkeys, other scientists can't reuse your monkey. They have to follow the procedures you describe on their own animals.

Another issue is that the code might not be very useful to other labs as-is. The code might be for unique, custom-made hardware, or an unusual configuration of equipment, or in an unusual language.


Sharing code doesn't prevent someone else from reimplementing an idea and indeed they should if they rely on results from another publication. If everyone open sources code they wrote during their research then it ought to be easy to tell if someone reimplemented an idea correctly (if their results differ). If they copied the original implementation this should be evident too.


But people are people and when there's an easy way...


I agree, but at least having implementations in the wild should bring more attention to an idea and scrutiny of that code than none?


If the same results are obtained using different code (or subjects/animals/chemical batches) does it matter as long as the code is doing the same things?

I get the impression that software types are inclined to put way too much significance on the source code. I'm not surprised the author of the linked item in Nature is a software engineer.

I wonder, did scientists in the 1930s publish their scratchpads full of calculations?


There is a massive difference between a scratchpad of calculations, and a simulation involving thousands or millions of data points.


There might not be tenure in it, but there's certainly a lot of citations around for well written tools. The license to use them often requires citation in related work.


> With no peer review or replication, you are no longer doing science.

Science can be done in isolation. Robinson Crusoe could do science on his small island in the tropics. Without peer review or replication, you're just not doing the particular kind of science that's most beneficial to society: the kind that can be built upon.


I don't agree. The accumulation of a corpus of knowledge is fundamental to the scientific endeavor. Even if you are systematically acquiring knowledge through observation and experimentation, if your work can not be built upon, it's not science.


No, the scientific method is fundamental to scientific endeavor. Everything else is just gravy.

A man working alone is doing science, he's just not benefiting society as well as he might. Science isn't "pushing the bounds of human knowledge." Science is forming hypotheses and testing them empirically. When kids drop two objects of different weights to see which falls faster, they're doing science just as much as Galileo was, despite the fact that their findings will never get published.

What a researcher does doesn't magically become "science" when he publishes. It's science all along. If she does it in a cave for years before she publishes, it's still science for all those years she was in her cave.

Children who perform experiments that have been done a million times before are still being scientists.

People who perform experiments but never get published are still scientists.

Science is about the process, it's not about the result.


Science existed before the formulation of the scientific method. The accumulation of knowledge has always been fundamental to science going back to Aristotle.


Yeah, but a lot of that 'knowledge' was noise, just-so stories, myths, coincidences, best guesses, fanciful explanations, etc.


There's a reason Aristotle is known primarily as a philosopher and only secondarily as a scientist.


I debate about this constantly with my office-mate, who--unfortunately--represents the conservative status quo in science. His concerns are that he is worried about being scooped, and that it's bad science to release things before they are finished products.

I am a proponent of open notebook science, and strongly believe that the benefits outweigh the negatives. Besides the feel-good arguments that this advances science and reproducibility, I'll point out a selfish motivation for releasing code: It makes it more likely that other researchers actually try your methodology and cite you.

I used to get mired down in trying to do formal releases. Now, I realized that releases are a hindrance, and by default all my research code lives in a github repo from day one:

http://github.com/turian

For example, here is the page I published on my word representation research, with links to my github code: http://metaoptimize.com/projects/wordreprs/


Whether or not he's an open science fan, that's no excuse for him to refuse to publish code when the paper is out.

You should point out that refusing to publish code is like being intentionally hazy about your experimental protocol.


I am not sure where the idea that you release the source before publication comes from, but it does seem oddly prevalent given the solution is so easy.


I have seen the advantage you point out in action in my former life in computation quantum field theory.

When I started, code was jealously guarded as a "secret weapon" in the global competition for publishable results. Then the MILC group (http://physics.indiana.edu/~sg/milc.html) started releasing both their data and their code. The result was that they were widely cited and gained much respect in the scientific community, becoming a de facto standard, and facilitated research at institutions without the resources necessary for such computational preliminaries.

I'm pleased to say that this approach has become more the norm in this field at least, encouraged by SciDAC ( http://www.scidac.gov/) for example, with raw data (e.g. http://qcd.nersc.gov/, http://www.gridpp.ac.uk/qcdgrid/) and code (e.g. http://usqcd.jlab.org/usqcd-docs/chroma/, http://fermiqcd.net) being routinely made available.

The result is more science all round, which can only be a good thing. The "scooping" thing we all worried about turns out not be such an issue after all.


One time I worked on a power grid simulation that wanted to release its code but it couldn't because it used algorithms from «Numerical Recipes: The Art of Scientific Computing» which has a pretty asinine copyright policy with respect to openness : http://www.nr.com/com/info-permissions.html

Even worse are the researchers who keep their models and data sets secret due to paranoia that some colleague will publish first, which admittedly does happen.

Cultivating a spirit of genuine cooperation and sharing in academia would do wonders for the progress of the sciences, but there are so many hurdles that need to be removed. It's not just a matter of knowing that it's good to release source code or feeling confident in including it as the article suggests.


I've come across the same problem. I was working on some ancient code purportedly released under a BSD license. However, it contained code from Numerical Recipes!

Here's a good page encouraging everyone to boycott them: http://mingus.as.arizona.edu/~bjw/software/boycottnr.html


IANAL but as far as I understand copyright law it is impossible to copyright "an algorithm." What one can copyright is source code but as long as you do not copy the source verbatim it is not copyrighted. I would suggest you translate the algorithms from the book into mathematical expressions and then implement your own code from there. This should be legal to publish.


In this case it was a derivative work based upon source code from the book, which is permitted under the terms but precludes open dissemination. Further, the rest of the codebase was very tightly coupled to the algorithms since since the algorithms updated state in-place instead of returning values, so dropping in an open version of these algorithms was very non-trivial.


I'm currently an engineering masters student, and the code I've written is available on github, if not properly licensed (I'm seriously considering slapping the CRAPL onto it now that I'm aware of it). I don't really care too much about polish, since, I mean, it does what I want, I have the excuse of working around other peoples' software, and at worst someone finds out that I did something wrong, and we all benefit. Plus, having it on github makes it really easy to work on it from different locations, and version control helps me document just where all my friggin' time went.

The biggest concerns I've heard from other researchers (ie, professors) has to do with being beaten at a publish. I think that as long as your paper itself isn't easy to find pre-submission, then you're okay, especially if nobody will really understand what you're doing anyway. So, I'm not too nervous about the prospect. It's just code for now.

Unfortunately, even if scientists release their own code, it's often just a small part of the big picture. In engineering, at least, MATLAB, Mathematica and commercial Finite Element packages are everywhere. My own project uses MATLAB and COMSOL in tandem, meaning that what I wrote only really serves to glue a bunch of completely closed algorithms together. Personally, I'd love to see a completely open, usable and documented FEA stack (which would ideally include an adaptive mesher, some FEM algorithms, a post-process visualizer, and both a gui and a not-shitty scripting API).


I'm currently an engineering masters student, and the code I've written is available on github

Did you have to clear this with your university/advisor? AFAIK, my university has copyright on code I produce on paid time or using university facilities/equipment, which covers my research and even a good chunk of my homework.

meaning that what I wrote only really serves to glue a bunch of completely closed algorithms together

This is consistent with what I've run across. It seems a lot of research revolves around modifications to an existing system, but in CS, it seems the established code base is more likely to be open source (e.g. Jikes RVM) or be proprietary but still have source code readily available (e.g. SimpleScalar).


> Did you have to clear this with your university/advisor?

I don't know if I had to, but I did discuss it with my advisor, mostly due to the "someone could steal my thunder" issue, and we were basically in agreement. I actually have a really cool advisor--I lucked out there.


Eh... good enough. I'd do it if I had my advisor's approval (I really doubt he'd give an OK that would get me in trouble).


Did you have to clear this with your university/advisor? AFAIK, my university has copyright on code I produce on paid time or using university facilities/equipment, which covers my research and even a good chunk of my homework.

They're very unlikely to stop you if you just go and do it. It only really becomes an issue if you try to make money off of it.


While I strongly believe that scientists should publish their code along with their papers, I do think it has one potential disadvantage. Suppose a scientist writes a big piece of code to some complicated calculations, but makes a subtle mistake somewhere in it (perhaps some parentheses are nested incorrectly). If the code is not published and another scientist comes along to extend the results from the first paper, the first thing he will do is try to replicate the results of the original. If the code is unpublished, the second scientist will have to write the code from scratch, and in doing so will likely catch the original error. But if the code is published, the second scientist will just use the original code and probably won't catch the error. Consequently, the mistake will take much longer to catch, if it's ever caught at all.

The advantages still outweigh the disadvantages, but it's important to remember that there are always trade-offs.


I think that to the contrary, that it would allow people to determine whether the original results were due to flawed software by analyzing the code. Subsequent efforts would remain free to create their own implementations from scratch if desired.


Further, open code means that other researchers can come along and write unit tests. It amazes me how much code is still verified manually if at all in the sciences.


A lot of code in the sciences is used once or only a few times. You crunch the data, write the paper and move on to the next research project. You don't get tenure for writing unit tests.


And you don't make friends with salad, i.e. you certainly won't make tenure when your paper is demonstrated as flawed due to errors in crunching the data that would have turned up with testing.


Ah,

and that brings us back to why many scientists don't publish at all now...

As long as not publishing code is standard, there's a incentive to do it since it might open you to career-withering criticism...


Consider this though:

I have tried to explain (With limited success) to my colleagues in bioinformatics that not unit-testing your code is like using an instrument that hasn't been calibrated, and blindly trusting the results to be right.


They may not unit test it, in the software development sense, but they probably hand-check a new bit of calculation code on a few data points before setting it loose on a whole data set.


Yes, or sometimes even less rigour than this: run some data and see if the results "look right".

This is the standard procedure for when you've developed a complex algorithm to do something novel, and actually working through it by hand would take hours.

I should know, this is how I used to do it before I knew better, and as a result one of my published papers has (very slight) numerical errors in.


True, and that's the definite advantage to publishing code. My point is just that if there's a lot of code and error is obscure, it will take a long time to identify the error and the mistake will be propagated in further research. Most scientists using the code will just skim it, see if it gets the results published in the original paper, and then move on. This has been my experience with my own research anyway.


I disagree, because your mental model of a large code base that only has one error is absurd. Scientific code is not magically easier to write than real code. Get two random code hacks to write two significantly-sized codebases, say, about the size of a decent web framework, and you won't get one perfect code base and one with a subtle bug that has ramifications down the line. You'll have two code bases so shot through with bugs that they will never reconcile, ever.

The quality issues aren't all that different than a web framework, either. Release one or a small number of code bases, and have everybody pound on and improve them, and you might get somewhere. Have everybody write their own code bases from scratch every time and you'll get yourself the scientific equivalent of http://osvdb.org/search?search[vuln_title]=php&search[te... .

To be honest at this point when I see a news article that says anything about a "computer model" I almost immediately tune out. The exception is that I read for some sign that the model has been verified against the real world; for instance, protein folding models don't bother me for that reason. But this is the exception, not the rule. When it became acceptable "science" to build immense computer models with no particular need to check them against reality before running off and announcing immense world-shattering results I'm not exactly sure, but it was a great loss to humanity.


It seems more likely that the scientist re-writing the code would blame their own implementation or some part of the input data.

A lot of time might be wasted going down that path too.


Except in the cases where: someone tries to port the code to another language, or they decide to use a different formula / algorithm; and they notice the result.

In the real world, most scientists will keep modifying their program until it gets an expected result (either because the program is correct, or because multiple mistakes cancel each-other out, or the standard theory is wrong but they feel obliged to keep debugging until they can prove it to be correct). Sure, there's times when a computational model proves something interesting (unexpected), but then the researchers may have to open source it just to prove they didn't screw it up.


It's just as likely that the second researcher would make a subtle mistake in the process of replicating the results of a paper that didn't include code. And unlike in the scenario in which the first scientist publishes a mistake, it would be impossible for a third party to find or correct it.


If a scientist says their results have a certain p-value, they don't need to release the source code for how they obtained that p-value, for another scientist to be able to tell if the p-value is incorrect given the data.

A lot of code is going to be like that: applying known functions that can be identified by name. If a paper says it used an FFT on some data, there are plenty of well-tested FFT implementations that can be used by another researcher to try to replicate the original result. The original researcher's FFT code, if any, isn't really necessary as long as the function, the inputs, and the outputs are well-documented.


I must admit that I have mixed feelings about this. Usually I post on my Python blog some code that might not be perfect or good enough but does the job for me. I post to show other people with no CS formation like me that they can achieve things with the language. The programming crowd that visit my blog, helps a lot, the science (biology, bioinformatics) crowd calls me dumb (not directly). I don't know if I would publish my code, I know my limitations and what in knowledge, but in a vain environment like the academic/scientific, I prefer to hide my shortcomings.


Why don't you put your blog url in your profile?


just did, if you want to check

python.genedrift.org


As a scientific programmer/data scientist at a university, I wish I could upvote this twice. I'll be forwarding this one around my lab.

Another issue well as code quality, transparency, reproducibility etc. is simply reuse. There's a lot of wasted effort in academia where people are constantly reimplementing simple things from scratch that do basically the same thing as their peers' code does.

Okay, this is true in other fields too, but in our field it's public money getting wasted.


In addition to the common excuses that Nick Barnes mentions in his article in Nature, one reason why scientists did not freely provide their code was the fear of potential misuse of their software (since erratic publications using their software could harm their reputation). The common solution to this problem was to impose a barrier to entry by charging a fee. Many of the early examples of complicated scientific software used this policy:

http://yuri.harvard.edu/

http://ambermd.org/#obtain

http://cms.mpi.univie.ac.at/vasp/vasp/How_obtain_VASP_packag...

Due to advances in computer literacy and the creation of competing projects, the landscape has been changing in recent years:

http://en.wikipedia.org/wiki/List_of_software_for_molecular_...

http://en.wikipedia.org/wiki/Quantum_chemistry_computer_prog...


Those packages are rather different from a given scientist's code for a given paper. These are big, mature projects, probably multi-lab or department-wide, long-term efforts, to develop a shared foundation for work in a given field.

The code actually corresponding to a given paper is likely to be a small amount that builds on packages such as those above, or products like Matlab, whatever. There could be code to implement an experiment, and Matlab code to analyze the data.


Agreed. Sharing these big, mature, foundation frameworks is essential for the reproducibility of published research. The additional convenience scripts that accompany a paper are often described or provided in the methods and supplement. As a reviewer, I block papers that lack unambiguous information allowing their reproduction, but I don't penalize the lack of explicit convenience scripts, since these can be reproduced or provided from the authors upon request.


I have heard this 'erratic publications could harm their reputation' idea, but I don't buy it. Erratic publications can only harm their own authors.


Thank you for the inspiring article and for your comment.

I agree that nowadays the fear of harming one's reputation through abuse of their code sounds unreasonable. Leaders of major scientific codes told me, however, that this was their overwriting concern in the 80s and 90s. As an alternative to imposing barriers to entry through fees, some groups demanded collaboration on the first project that used their code (a method that doesn't scale well, but doesn't require writing of thorough documentation, another thing that many scientists dislike). Hopefully, the times they are a-changing.


I go even further and am of the opinion that data provenance is the overarching issue. Any series of results should be able to be re-generated quickly (as measured in scientist-time not computer-time) based solely on meta-data provided–and this is the key point–as part of the results themselves. A few simple guiding principles could go a long way toward achieving this goal.

   1. In the absence of a well defined standard, it’s the individual scientist’s/consortium’s responsibility to define and actively use an organized meta-data standard.
   2. If it’s not open source it’s not science.
   3. A snapshot of the source code used to generate results should be given/pointed to when the results are presented.
   4. Minimizing reproduction time is an integral part of science.
   5. Principles 1-4 should be be demanded, by funding agencies, program heads, and research advisors.



The pronounced acronym is appropriate for academia. It contains both words that come to mind.


Scientists sometimes do publish their software. An example: in Operations Research, the COIN-OR initiative (http://www.coin-or.org) is a growing set of (Open Source) tools for solving OR problems.

Also, in certain disciplines, not only papers, but also the software used to obtain the published result is peer reviewed. For instance, the journal Mathematical Programming Computation (http://www2.isye.gatech.edu/~wcook/mpc/index.html) accepts papers accompanied by the software used by the authors, which is tested and reviewed by technical editors.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: