Here's a practical solution I have proposed in my community (autonomous robots):
1. package code and data into tarball or VCS repo.;
2. place package on long-lived website;
3. compute SHA1 hash or similar from package (if git is used, this is the revision ID, conveniently);
4. publish the URI and hash in any paper that makes claims based on that code or data;
5. as a reviewer, prefer papers that follow this method, all else being equal;
6. as an editor, suggest that submissions use this method.
(Edit 2: In case it's not obvious, the purpose of the hash is to allow users to be pretty confident that the code they downloaded is indeed exactly the code used in the paper. By putting the hash in the paper, I make this promise. If I want to make an improved version available, I just put it up at the same site, but I must make the exact original available and identifiable as such. This simple method of ensuring identifiability is our contribution.)
My group does this with every paper. I have a paper describing this method coauthored with a student under review now at a good journal, and I'm looking forward to seeing the response. (Edit 1: see link in comment below)
"A few months ago, a graduate student in another country called me (Vaughan) to ask for the source code of one
of my multi-robot simulation experiments. The student had an idea for a modification that she thought would improve the system’s performance ... we were able to offer the requesting student some code that may or may not be that used in the paper. This was better than nothing, but not good enough, and we suspect this is quite typical in our community."
Fwiw I've had a similair experience.
A few years ago I was reading Daphne Koller's "Using learning for approximation in stochastic processes" and it was a very elegant and powerful idea, but it didn't seem to me at the time that there was enough detail in the paper to implement the algorithm and I wrote to Dr Koller and asked if she had source code available. She replied that her student had departed a decade ago and the code was effectively lost.
Dr Koller was very helpful and clarified some doubts I had and in the end I managed to re implement the algorithm and so a happ ending after all but if code was archived and made public for every paper, it would have really helped.
Awesome that you guys are adopting this methodology.
i had this problem many years ago (15?). at the time i was working as a postdoc, calculating the evolution of the ionizing background with redhshift from the inverse effect (lyman alpha clouds near quasars get fried by the quasar; the extent of this gives an indirect way to measure the ionizing background at that redshift).
i had a bunch of perl scripts (ah, those were the days) that mangled various files before feeding them into fortran least-squares stats code that took a day or so to run.
by the end, it was pretty much chaos. i was a self-taught programmer, these were probably the second or third "significant" programs i had ever written. nothing was documented, everything took so long that i couldn't check much... i had bugs, of course.
in the end i published. maybe 6 months later i got an email from someone in the states. they were trying to reproduce my results. in the end, they did (as far as i know).
so the system worked.
incidentally (perhaps the only useful point here) they must have used different data. that's something worth explaining in more detail - "my" data came from years of painstaking work by a bunch of people working for my thesis supervisor. yet 6 months later the results could be duplicated from a week or so of data from a much more powerful telescope (the keck). so data in research often aren't as critical as you might think. things progress at such a rate that even if you don't share data, it's trivial to reproduce just a short time later... (and i am pretty sure that this is true in gene sequencing, for example)
But this is a problem that can be tackled, and those who take it seriouslt already do so. For example, in our pipelines we use an infrastructure that always adds every command executed on the file, with every exact parameter, to the metadata of the file, starting from one canonical archived file - and hence, one can indeed reproduce manually the result of the pipeline given sufficient time and dedication. [Edit: we also write the git shar1]
The same way that say, biology labs have processes they engage in in order to convince us that their samples are not contaminated, we can have processes that lead to the reproducibility of data.
i'm not sure what problem you're talking about - someone reproduced the results with separate code and data, so what's to worry about? (i don't mean that because it was confirmed it was ok, but rather that if it had been wrong, we would have known in the end... after all, people make mistakes all the time - science is a collective enterprise that relies on many overlapping, interlocking pieces)
[Yeah, didn't I share an office with you 15 years ago? :-) ]
But data can't always be reproduced - Shoemaker-Levy won't hit Jupiter again any time soon, and while big results will cause a rush for verification, the little steps are usually believed as is. Moreover astronomy is a special case where - let's face it - if we go down the wrong path for a decade or so, nobody's bleeding.
Take on the other hand the processing of climate data - if crap engineering (not malice) causes garbage to come out of the data, this is a problem for everybody. So I think the OP is right in that proper auditing of computer-processed science data should be possible, I was just pointing out that it's not as hard (technically) as people think to achieve.
Speaking of reproducibility, my friend getting a PHD in finance told me he was writing a paper using the data from some brokerage. I asked if he would publish the data and he told me it's confidential.
I talked myself blue in the face trying to explain how science doesn't work if you don't give people enough information to reproduce your research! I couldn't get him to understand though. Arggh so frustrating.
It depends what type of science you're doing, and what stage the science is in. When we are at the gathering data hypothesis building stage observations, and case studies are important and are often based on confidential data.
Assuming the data was some sort of financial time series then that means lots of people already have the data (or something close enough to it). So even though he can't publish the data he can say what the data is and when and how it was collected and that should be enough for most people in the field reproduce the results.
I am often told that I should keep my nose out of other science domain's business because they know more than I do. However, I think when they start building their science on top of computers, I start getting a say again. Here's what concerns me about this increasing use of computers:
It seems like the vast bulk of these simulations are iterative, and therefore subject to mathematical chaos. How many of these researchers have any clue what door they are walking through? How many of them know what a strange attractor is? I'm sure the answer is non-zero; I'm equally sure the answer is nowhere near 100%.
Small errors cascade even if you consider a non-chaotic classical model. (That is, not that there is such a thing as an iterative model that is not potentially subject to chaos, but rather than even if you don't understand chaos you can see that small errors can cascade. Chaos just makes it worse, and weirder.) A simulation will have bugs like any other large problem. A non-programmer approaches bugs by banging on the program until it seems to generate expected results. (About 50-80% of programmers do that too.) Therefore, many of these simulations are simply reflections of the simulator's expected result, due to the effect of the researcher's selection mechanism running on the results of the simulations they run. How do we verify that this is not the primary factor in the result of the simulation? This need not be conscious. It need not be ideological, either; I can easily envision a simulation that "should" return a boring or trivial result being monkeyed with until it produces something "interesting", because the simulators think the boring result should not obtain.
A lot of algorithms you can use in these simulations are fundamentally unstable when used iteratively; some exacerbated by floating point errors, some mathematically unstable even with perfect real numbers. How many of these simulations use something unstable without even realizing it, given that it could take a professional mathematician to work out whether that's the case? Even algorithms thought to be stable and reliable can fall apart under pathological situations, and one of the odd things about mathematics is just how often you end up hitting those pathological situations when programming; far more often than it seems like should be the case.
In information theory terms, a simulation can not contain more information that the sum total of the input data and the content of the simulation algorithm. How many simulators understand the full implications of that statement? I sure don't understand the full implications of that, but what I do understand makes me pause a bit. Very simple simulations with rules that can be verified and initial data that is very solid I can deal with; for instance, I like the cell-automata based social theories that show the spread of information or political views or something, especially when it is clear the researchers understand that it's only an approximation. But as the initial data starts getting sketchy or the simulation grows enormous, I start getting nervous about the actual information content of the output. Just because the output appears to be information doesn't prove that it is. It is vitally necessary to be able to check the simulation against real data. For instance, physical simulations of, say, cars crashing can be verified. How many simulations can actually be verified, though? Frequently the reason computers were reached for in the first place is the inability to do the real experiment. Any simulation that can't be verified should be presumed worthless by default. How often does that happen? (It's 20-f'ing-10 and "the computer said it, it must be right" still runs rampant through our culture....)
And of course there's the whole reproducibility issue, where the absolute bare minimum for science would be to publish the full simulation program, all data, the necessary invocation and compile instructions to bring the two together, and all necessary information to understand the input and the output. Clearly, this is not something that fits in a journal paper, but how often does this happen at all?
No, I am not referring to any specific discipline here and in particular I'm not actually referring to climate science. I'm nervous about the whole movement towards simulations in general.
Note that I'm not reflexively against the idea. Meet these bars and I'm happy; give me enough data for reproducibility and verify that your simulation is in fact simulating something real and corresponds to reality and I am happy. (Many physical simulations fit in here.) But as more disciplines jump in I am concerned that these bars are not well understood, and I'm seeing ever more press releases about simulations that can't possibly meet these bars.
I am often told that I should keep my nose out of other science domain's business because they know more than I do.
I've only heard such statements coming out of a few fields: math education, labor economics, climate science and psychometrics of race/gender. You should ignore such statements; they are nothing more than an attempt to bully you into accepting received wisdom from activists with a PhD.
As an actual scientist (rather than a political activist with a PhD), I strongly encourage you to stick your nose into any or all of my fields (quantum mechanics, PDEs, medical imaging, complex analysis, prediction markets). If you come up with dumb ideas, I'll even explain why they are dumb, rather than just demanding that you leave things to the experts.
Such statements also come out of biologists discussing evolution. This is not, however, evidence that they aren't really doing science. Instead it is evidence that they've been burned out explaining basics over and over again to Creationists and want to get on with their lives.
However some do take the energy out for those explanations. One of the results of their energy is http://www.talkorigins.org/.
Hopefully some day someone will take the energy to do the same with climate science. Because as much as there is a lot of politics there, there is some real science there as well.
> Such statements also come out of biologists discussing evolution.
A lot of my friends are scientists, including in a couple biochemists and some other people that do more or less serious research into topics like that. In my experience, you're off on how they deal with stupid people and stupid arguments - instead of "get out of biology" and moving on with their lives, they tend to address and correct errors, debate if necessary, or at least point the person towards a relevant piece that explains things and take new criticisms and arguments and address them if necessary.
The scientists I know always been open to me saying stupid things (and occasionally not stupid things) and correcting me if I'm wrong, or exploring together if I might not be wrong. Good biologists don't say, "Get out of biology".
> Hopefully some day someone will take the energy to do the same with climate science.
Oh, hah, I honestly responded quickly and missed the climate science analogy originally. If you wanted to make an apologia for climate science, then I do understand why you'd want to draw a biology:evolution:creationism to climate science:global warming:deniers parallel.
I was responding to the "want to get on with their lives" part as being wrong based on my experience, as scientists are usually rather encouraging and tolerant of dissent. Climate science is not so much interested in people and data which disagree with them, which is a pretty big problem.
I'm married to a biology PhD, and at one point spent a couple of years watching people, including biologists, deal with a constant stream of Creationists in places like talk.origins.
My experience is that if you're a personal friend, you get more serious conversation. If you're someone they know but not so closely, they'll have the argument if pushed but don't feel the need to actively educate. And if you're a random person spouting on the Internet, it isn't worth their time to get involved. If pushed to be involved, they don't feel the need to be pleasant about it.
After spending time myself explaining the same thing over and over again, I've come to feel the same way. I've also come to realize that there are plenty of smart people who do not wish to be educated. Including in my direct personal experience, at least one PhD in mathematics and another in molecular biology.
Based on this experience I have some sympathy for the position of climate scientists. You spend your life climbing around glaciers in Greenland, and you don't really feel like spending the rest of it convincing people who don't want to bother learning the basics about climate.
> Such statements also come out of biologists discussing evolution
This is only the case when the opposing person is pretending to be a scientist but is really a creationist. Scientists tend to make arguments based upon facts and experiments. They start to get really mad when people don't follow the rules and start bringing non-scientific "evidence" into the argument.
There is plenty of things up for debate about how evolution works, but these are the details of the theory. Due to overwhelming evidence, the debate on if is over, however the debate on how is very much alive.
Education is a lot more complex than most people understand it to be. Even within the field few people actually have any understanding of how the brain and learning interact. Everything from diet, to time of instruction, to setting, to materials, to methods, to testing is all important. A classic example of which is how do you build a useful test? Well if you want to know how much learning took place you need to test them both before and after instruction. But you also need to test people twice without instruction to see if the tests give information away. You need a mix of problem difficulty to find out if someone understands the basics and at the same time, to see if they understood the minor details. etc. You then need to look at the test not just in terms of overall score but also how well they did on each type of problem.
PS: Education might seem like a flaky field and much of the discussions are devoid of good science. At the same time there is a lot of great research that has been which has real and important implications.
On the contrary, I think math ed types want to keep the actual mathematicians out precisely because the mathematicians recognize how complex the field is. Every time I've been told "stay out, you don't know our field", it was because I pointed out complexity or plausible alternative explanations.
For example, a puzzle: why do SAT scores underpredict female performance in Calc 1, and overpredict grades in higher level courses? I suggested girls are more conscientious than boys, and that a study being proposed did not control for such factors. I even suggested a test of this hypothesis: compare grades in classes which are conscientiousness-weighted (25% homework, 10% attendance, tests questions are practice problems with numbers changed) to ability-weighted coursework (50% midterm, 50% final, test questions all new).
I'm not sure my idea was a good one. But rather than explain why either a) this had been tried, and didn't work, or b) why my test wouldn't work, I was simply told to leave the field to experts.
I've come up with plenty of dumb ideas in other areas, e.g., global existence of wave equations. The closest I've ever gotten to "stay out, you aren't an expert" was "Terry tried that and failed, ask him why before spending lots of time on it."
[edit: really confused why you are being downmodded.]
Honestly, SAT scores vs. Performance is the type of (mostly bad) research I am talking about. The SAT is not designed to test math ability. It's focus is on how likely a student is to finish their freshman year of college and it does that fairly well. You can do a lot of useless research in this area and it tells you next to nothing. If you want to predict a highschool students ability in advanced collage math classes you could design a test that did that. But, people have easy access to SAT data so that's what they look at.
If you want to see real and useful research look into how long the optimal study period. There are significant and useful study’s which suggest 2 hours of nonstop instruction is less useful than two one hour periods with a moderate break in-between. Yet how many collage lectures follow this approach.
"There are significant and useful study’s which suggest 2 hours of nonstop instruction is less useful than two one hour periods with a moderate break in-between."
The professor of the last class I took was very conscientious about taking a break at the 1 hour mark of an 80 minute lecture.
What makes SAT vs performance bad research? The SAT may be worse than some specialized test, but so what? Breast self exams suck in comparison to mammograms, but that doesn't make studies into breast self exams bad research.
The methodology is what makes research good or bad.
The fewer unknowns the better the data and the more accurate the experiment.
In terms of predicting how likely someone is to finish their freshman year of collage having a test where you can increase your score significantly with moderate levels of preparation is not a bad thing. However, the fact you can easily game the test means it is a less accurate indicator of a student’s innate capability. The fact you can retake the test creates yet another sort of bias. etc.
More generally, it's easy to focus on defects in the test, which are irrelevant in a larger contest and subject to change.
PS: Education might seem like a flaky field and much of the discussions are devoid of good science. At the same time there is a lot of great research that has been which has real and important implications.
My wife and I were discussing this recently. She is in an unusual program where she should graduate with both a masters in history and an education certificate with hopes to teach in history at the High School level after that.
This is both second hand an anecodatal, but from what she tells me the education courses she takes are less rigorous and generally easier than the history courses. But the reason for this is that the education course she is taking are vocational. They are trying to teach her how to get teaching done in a modern classroom, not the science behind it.
It is akin to learning to the difference between training to be a mechanic and a mechanical engineer. Both are very hard and important jobs worthy of respect, but they have very different focuses. The education courses are focused on learning the technique of teaching effective, it looks like most of the research on the theory of effective education is done in departments like psychology, sociology, etc.
As another scientist I have to strongly second this. Some things might take real work on your part to understand though. Even within different fields of mathematics there are difficulties in understanding (complaint about this: http://www.math.rutgers.edu/~zeilberg/Opinion104.html).
I agree with the sentiment you express and thank you for pointing it out.
I do think that thinking outsiders should stay out is uncommon in the field of math though. I am currently a math grad student, and while I am only a grad student what I have seen of the mathematical community makes it seem like it is very open to cross fertilization from other sciences and from amateurs.
Admittedly he is historical, but Fermat (professional judge when he did most of his mathematical work) is revered and there are some more recent amateurs that have made genuine contributions. Much of mathematics is inspired by other fields such as physics.
I admit I know some professional mathematicians were tired of finding flaws in amateurs proofs of Fermat's Last Theorem and rebuting supposed refutations of Godel's diagonal proof that the reals are uncountable, but it is very different to be tired of dealing with a constant stream like that than to be hostile towards outsiders in general.
From my limited perspective, I think math is one of the fields most open to outsiders and amateurs coming in and making real contributions, both culturally and in the fact that we rarely need expensive experimental setups so the barrier of entry (in terms of money and equipment) is low.
Your concerns are certainly legitimate, and I'm sure there are some situations where they apply. However, in many physical sciences (I would venture to say the vast majority), simulations are not chaotic. They are based on stochastic models that have converging behavior. So, in general, it really is fair for most people to rely on the stability of the simulations.
As for your other points, the issue of simulation accuracy is already taken very seriously in the areas where I've seen it used.
Let me give an example from my field (high energy physics). The pre-eminent event generator of choice is called PYTHIA. It incorporates as much known physics as possible, and is continually updated. There are groups that regularly convene to compare the distributions it produces to those seen in real experiments and adjust the parameters to improve the results. The code is open source, so you can readily make improvements and submit them to the maintainers for inclusion in official releases.
Now the primary uses of these simulations are to either tune your analysis to separate signal from background or calculate corrections for different kinds of detector acceptance issues. No reviewer would ever accept a publication that used simulation if the paper did not include clear evidence that the simulation was valid, usually in the form of a data/simulation comparison.
Regarding reproducabilility, I think you're a little off the mark. The idea is not that you should simply repeat my analysis to get the same answer (this important verification step should be, and usually is, present in all scientific groups). Ideally, you should collect your own data, make your own simulations, and do your own analysis. Then we should see if we got the same answer.
"As for your other points, the issue of simulation accuracy is already taken very seriously in the areas where I've seen it used.... my field (high energy physics)"
Bad example, inasmuch as it is too good. Particle physics has petabytes of data (exabytes yet?) to test against and is very connected to the real world. With that check you can't stray very far.
"Ideally, you should collect your own data, make your own simulations, and do your own analysis. Then we should see if we got the same answer."
I would say a precondition of that step is first that I can reproduce your work. If I can't even reproduce your computation it's not even worth trying to reproduce the eventual results, as your eventual results are too questionable to begin with, as far as I am concerned. If I can't reproduce the computation you might as well have just pulled them from your bum.
Weather simulations have even more data; they are less accurate for other reasons.
Also, running someone else’s exact simulation again is useless. You need to start from scratch (or some vary well accepted baseline libraries) for it to be useful.
"Weather simulations have even more data; they are less accurate for other reasons."
I doubt anyone can beat particle physics for sheer information quantity. I seriously doubt that we have petabytes of real weather information to feed our simulations. I can find some references online to petabyte stores for weather simulation results online, but that's ultimately just cache, not information (in the information theoretic sense).
This is part of what I mean when I talk about the information theory, and how you can't get more information in the information-theoretic sense than the sum total of the simulation and the original data. Weather simulations may chew through terabytes or petabytes of RAM in the simulation phase, but they are not fed that much data. If the people involved mistake it for real information, then this is also part of what I mean when I say that once you get into using computers in a big way my training does indeed start giving me standing to complain again by even the most rigid "stay out of my science" standards.
Secondly, your assumption that running the simulation again is useless in a world where you can casually assume that you have all the data and the simulation and can have the contempt bred of familiarity for the whole process. In the real world, if you can't replicate the results, how can you criticize the model? I think there's still some "the computer said it, it must be right" underlying your answer; you can't assume the computer model is worth anything, it must be proved and debated and peer reviewed, which is not possible if you can't even get it to run and get the same results. The model is still subject to scientific inquiry, it can't be given a free pass.
Independent replication of results is also desirable, but you need both. In a world where nobody can replicate the results and it's hard to verify the simulation against the real world, it's too damned easy to end up with the Feynman electron mass situation where the selection effects from the researchers dominate the putative results of the simulations. The researcher summaries of the results of some runs of some models you can't see or execute and some data you can't get at happen to align... what does it mean? Frankly, who knows?
Several satellites take high resolution real time pictures of global weather they run 24 x 7 for years and that's RAW data. There are several of these plus radar stations etc. NOAA uses a subset of that information to make weather forecast (they normally toss out old data and data from the other side of the planet because it's not useful even if it might make a slightly better forecast). We can make highly accurate block by block forecast an hour ahead over major cities and just about any point in the US. However, there is little point in reading that level of detail from a forecast for such a short period of time. Sometimes when the forecast is 50% chance of rain over the next 8 - 12 hours they know where and when it's going to rain they just don't know where you are.
edit: One of the world's largest scientific data systems, NASA's Earth observing system data and information system (EOSDIS) has stored over three petabytes of earth science data in a geographically distributed mass storage system. that's just Nasa and a lot of their data does not make it into EOSDIS.
The goal of science is to understand the world. Running the same simulation on the same data and getting the same result only tells you that the machine running the simulation is not broken. What you want is to run a different program with different assumptions on different data and come to the same conclusion. This is actually used to make 7 day forecasts. They run a few different models with different assumptions and pick the average result. Over time each model is updated independently to maintain its independence.
PS: You don’t validate E=MC^2 by doing the exact same experiment 10,000 times. You do every type of experiment that you can think of which relates to E=MC^2 looking for anything which does not work out the way you think it should.
I think this ongoing discussion is fascinating, and I don't disagree with you.
But (and you knew that was coming), re-running someone else's code, or reproducing someone else's experiment is verification. It doesn't mean that what they did is valid, but it verifies that they did do what they said they did.
I agree with many of your sentiments. Here is a view from "inside" (I do PDE solvers from a mathematics and computational physics perspective, have written open-source models, and contribute to a popular open-source solver library).
1. Very little of numerical analysis is about propagation of rounding errors. I think you underestimate the reliability of floating point numbers in this context. The challenge of numerical analysis is not in representing the continuous one-dimensional object (real numbers), but in representing an infinite dimensional space in finite dimensions. You might appreciate Trefethen's "The definition of numerical analysis" (http://www.comlab.ox.ac.uk/nick.trefethen/publication/PDF/19...).
2. We can prove convergence of iterative methods for some problems of practical interest. More frequently, we can prove that if the iterative method converges, then it converges to a meaningful solution. Such results are unavailable for some systems, and this is where we enter the domain of Verification (the code converges to exact continuum solutions at the appropriate rate) and Validation (the continuum equations approximate reality). V&V is a rapidly growing field which offers a number of mature tools, but is still under-appreciated in many disciplines. The fact that V&V is rarely taught at university, even to applied mathematics graduate students, let alone physicists, doesn't help.
3. It takes effort to turn a one-off code that only runs in your special environment into portable, distributable software. Funding agencies and institutions place little emphasis on this process, so it is regularly neglected. It is exacerbated by the fact that few physicists have any background or interest in software engineering, thus making it more challenging to release non-broken software.
4. It is often not worth the effort to turn a one-off code into distributable software. There is a good chance that only a small number of people care about the PDE you are trying to solve.
I contemplated making a distributable solver for the Schrodinger equation, but there was little interest.
A quote I've heard attributed to Cleve Moler (chief scientist at Matlab): "PDEs are a niche market."
I think that quote holds less and less true over the years. It's not so much that there are huge numbers of people who need their PDEs solved, rather that those who do (aerospace, engines, circuits, oil, climate, defense, energy) really need them solved and are willing to pay for it.
One thing I enjoy about PDE solvers is that algorithmic improvements always pay off because every user is running the largest simulation they can on the most expensive hardware they can afford.
In information theory terms, a simulation can not contain more information that the sum total of the input data and the content of the simulation algorithm.
This is false outside of the most trivial definition, when you simulate evolutionary systems you can create systems far more complex than the algorithm used to generate them, the problem domain, or any other factor prior to running the simulation. That information is crated from your random input stream which does not contain information in the classical definition.
PS: As to the larger context of the discussion. From a simulation standpoint sending other scientist the code and having them run it is a bad idea. Reproducibility does not involve the lab that made the discovery sending the devices used to make the discovery to another lab. The goal is for someone to be able to see the same effect with the most independent setup possible. If scientist A says "building this a model with these assumptions > this result." Then another lab creates their own software which runs and finds either the same or a different result. Or, they can look at the assumptions or inputs and say this is bad data try again. Or, they can look at the code and say, here is a bug fix it and try again. But, those are three independent steps best carried out by diffrent groups.
"when you simulate evolutionary systems you can create systems far more complex than the algorithm used to generate them, the problem domain, or any other factor prior to running the simulation."
Possible. I have to admit I'd have to think more carefully about this, but I'm not sure it's a knockdown win either. In the real world, the extra information comes from the environment in its capacity as a selection mechanism. It is a large source of such information. Simulations tend to have a radically simplified selection mechanism, what with not containing reality and all. Just because something looks complicated and can't be effectively gzip'ed doesn't mean its actually high in information, it's tricky stuff. I'd be much more inclined to buy this, if evolutionary programming was much more impressive than it actually is and actually routinely pulled off programs that were clearly high in information value.
(I've actually spent a bit of time with evolutionary computation. It has one of the largest hype/reality ratios in computer science.)
"Reproducibility does not involve the lab that made the discovery sending the devices used to make the discovery to another lab."
This is an artifact of the fact that is impossible to do this with physical objects, so we've built the mechanics of science around not being able to do that. Don't mistake historical accidents with scientific imperative. While it is still absolutely desirable that simulations as independently-constructed as possible and give as similar a result as possible, if there was an equivalent in the physical world of being able to rigorously examine someone else's experiment we would be doing that. How long would Cold Fusion have lasted if that was available?
Science is inevitably a product of its environment. We don't demand particle-physics precision from the sociologists not because it is undesirable, but because it is impossible. We don't pass around experimental apparatus to each other for examination not because it is undesirable, but because it is impossible. It isn't impossible for simulations. We're allowed to tweak our procedure in response. Science is not a religion.
In general, science hasn't been as reproducible as one would like it to be, there's so much tacit knowledge that's necessary for experiments, many unknown variables, so many demonstrations are so complicated that hardly anyone even goes through the motions. In general there are many more people writing in detail than there are people read and understand, many more people focusing on some narrow branch of their discipline than there are people with genuinely strong interest and understanding of the fundamentals.
My point, primarily, is that the danger of people using tools that they don't understand is an old one; possibly worsened by the narrow focus of modern science, and how easy computing tools have made simulations. But it's been here all along.
As far as sticking your nose into sciences, your nose is invited into any domains of knowledge of mine, but do keep your voice down: there are too many people making too many bold pronouncements without real knowledge; persistent, diligent, and quiet questioning seem to be much more effective at getting the heart of the matter.
I certainly agree with you that there is a lot of publication of results generated with computers that is non-reproducible. However, I'm not sure publishing the complete codes that were used is necessary, as long as there is a detailed description of the algorithm used. In essence, I think it's essential that multiple people write their own codes and reproduce the result. The danger in publishing your code is that it is easier for someone to "reproduce" your study by just taking my code and running it, and there is a danger in having the community rely on a small number of codes.
The PYTHIA example someone posted is a good one: It's my understanding that basically everyone in high-energy physics uses it, and while that means it gets a lot of testing I'm not sure that many actually bother to look through the code in detail, because "it's what everybody uses, so if it had a bug it would have been found already".
The situation is similar, but not quite as bad, in astrophysical hydrodynamics simulations. Basically the entire field uses 6 different codes, but they do work on different principles, and the teams do conduct controlled tests of the agreement between them on specific problems.
About the chaos problem, I suspect that the problem is worse. My impression is that often there is not even a basic test that the results are numerically converged, regardless of esoteric strange attractors.
I'm curious, can you mention any specific cases that you are worried about wrt the chaotic behavior.
The danger in publishing your code is that it is easier for someone to "reproduce" your study by just taking my code and running it, and there is a danger in having the community rely on a small number of codes.
Having a small number of codes seems like an excellent situation. What you should have is large of tests and much test data to test the small number of codes.
Testing code in thorough manner is an important addition to looking at it.
“It is vitally necessary to be able to check the simulation against real data.”
Actually, that’s not enough either. In spam training, you have to have a completely set of data that you use for the final test but never, ever for training. Why? Because otherwise your spam filter will learn how to properly classify every message you train it with and nothing else.
For simulated science, this means it’s not safe to say, “Well, we trained the simulation until it could reproduce 2000-2010 when we gave it 1990-2000 as input data.” If you do that, your predictions for 2010+ are probably going to be worthless, since your simulation is like a cheating student who only knows the right answers on old tests he stole, not how to pass any test in general.
Even worse, having reviewed the code that leaked during Climategate, it's pretty clear that the researchers are not competent programmers. The revelation that was most off putting was the "commented out" code that injected arbitrary values that was "just used for testing". Testing is great, but it has no place in the production code. If the researchers method of testing is to write test code in the production code, run it once to make sure things "work", then comment it out, we certainly shouldn't be making any decisions based on predictions made by that code. Their testing framework is apparently comments in the code.
For a project I'm currently working on, I have a 5:1 test to code ratio, largely because I'll be processing credit cards. While my project will probably only ever deal with a few hundred thousand dollars in transactions, I need to be sure that those transactions will be properly handled. In comparison, these researchers are encouraging a trillion dollar piece of legislation, and their testing methodology is intuition and spaghetti code.
Further troubling is the assertion that this is science. As far as I can tell it isn't testable or repeatable. As I recall, that's a pretty big part of what science is.
To echo the meme around here, there appears to be an opportunity here for a startup :)
If you can get buy-in from the research community, even in one niche field, I could see great utility in a central escrow house of sorts, where people can post research software, datasets, etc. and others can download the stuff and validate the results only after agreeing in some reasonably binding way not to use that information to scoop or copy the original researcher.
Buy-in in this case means that everybody should have to do it to publish in a certain venue. You might even do something fancy like providing people access to a VPS which will run the server without allowing validators to scp everything back to their own machines. Of course this is impossible to do perfectly; once you allow people to view the data in a text editor or hex dump, it is all over in a strict sense (scripts + screen-scraping), but it is still a step up from just handing over the files, and keeps honest people honest.
You might also provide some anonymization/randomization service for datasets (for instance, by applying an unknown linear transform to absolute numbers in the case that the results don't change, similar to what was done here: http://googleresearch.blogspot.com/2010/01/google-cluster-da... .
I think there are 2 reasons why researchers can be reluctant to share:
1) They are afraid of getting scooped by somebody else leveraging all the effort they put into developing the tools and curating the data and leapfrogging them.
2) Their results are suspect.
Such an escrow house would ameliorate the concern behind 1) and help suss out 2) :)
In biomedicine, the National Library of Medicine has been taking steps in this direction. There are repositories for sequence, protein structure, and microarray data, for example, and many journals require that relevant data be deposited in these repositories as a condition of publication.
Efforts such as the Science Commons are also thinking about/working towards these sorts of solutions.
Actually, the NIH has already done this with many recent grants. Everyone funded through certain grants has to make their data available in dbGAP (genotype and phenotype), sometimes even pre-publication (which is raising some feathers).
Point 1 will always be true when people share methods and/or data pre-publication. If point 2 cannot be discerned with currently available methods (and I would argue that it can), then we are in a world of trouble.
i'd like strongly to encourage you to offer your opinions on things on which you are nonexpert. having only experts allowed to opine, with experts also relaining monopoly on experts, leads to echo chambers.
however, you seem to be reacting to a different article entirely. nowhere did the article mention simulations. if you look at Stodden's CV (the woman whose talk motivated the article) she's a statistician. The need for code-sharing is in fields like climate and bioinformatics where people's analyses lead to changes in politics, policy, pharma/health, etc; not for people deciding whether lorenz goes chaotic at precisely the right bifurcation parameter value.
more and more and more science is becoming statistical and data-driven. chaotic dynamics, though very important, is not the issue here. it's analysis of complex data coming from real-world systems.
there are many interesting and wise comments in this thread but they are almost all about an article on simulations -- ie. not the article the OP points to.
Can you provide references that would allow a researcher to educate himself? I'm genuinely interested, and there seem to be lots of fields that use software in experimentation and simulation but where there isn't much emphasis or education on doing it properly.
A similar argument actually goes for experiments that are somehow affected by computer networks.
If scientists use grid-computing, cloud-computing, or just the plain regular Internet, there is no way to accurately reproduce results for distributed applications.
Luckily, some researchers are aware of this and there are now some projects starting to make testbeds and infrastructures to make environments where experiments can be reliably reproduced.
PlanetLab and the clouds are great ways to do non-reproducible research, because the other users in the system (where sometimes "the system" is the public Internet) create interference. Reproducible networking research either has to be simulated or has to be run on an isolated testbed such as Emulab.
The results from PlanetLab et al. may not be deterministic, but you should be able to reach the same conclusions based on repeated experiments. Otherwise, your results may be a little too fragile to form the basis of sweeping conclusions.
This may not be ideal, but it is no worse than any branch of science which is not purely digital.
NICTA and WinLab have developed a management framework called OMF for specifying and running experiments on testbed networks. One of the main goals of the project is to improve reproducibility in networking research.
My (biotech) employer largely solves this issue by keeping a copy of the formal research specs outside of software altogether. All validation documents and research data are kept in paper form (in addition to digital form) in such a way that future researchers or inspectors could take those documents and data and reconstruct the research.
It wasn't always this way, unfortunately. I've been involved with trying to glean some formulaic/methodological insights from spreadsheets and code and it's not always possible to reverse engineer the essential methodology or be sure that mistakes were avoided.
Proper scientific research practices are similar to proper data backup practices: the documentation (backup files) are important, but they don't matter if you can't have successful reproducibility (restoration).
In the life sciences provenance is as important as reproducibility, and you can use software systems to manage provenance. The paper requirement will go away over time. Even the FDA, which required paper in the past, is moving to an electronic model. But you do need to document methods somewhere and not just in code, a reference document which could be shared, even if the specific implementation is different
I have a strong (probably unusual) standard when it comes to computer modeled science: if explanation requires software or data that I don't have access to, I completely disregard it. That might sound kind of crazy given the state of science and my chosen profession, but (maybe because of my chosen profession) I know that a complex system can tell you whatever you want it to tell you. It won't be the truth, but it will be "convincing" to most people. If the research really mattered, someone would reproduce it with independent software and data anyway.
I think this is a good solution to the problem. If most people only believed in reproducible science, there would be pressure on authors to use software and data that can be shared, or no publisher would carry their article. Seems easy to me.
As a working programmer in biological research science I would love to see a requirement that papers involving software be published in the "literate programming" paradigm. At the very least per reviewed publication must include all software be open source (a least in a loose sense). It is depressing how often a published result depends on custom closed source software.
1. package code and data into tarball or VCS repo.; 2. place package on long-lived website; 3. compute SHA1 hash or similar from package (if git is used, this is the revision ID, conveniently); 4. publish the URI and hash in any paper that makes claims based on that code or data;
5. as a reviewer, prefer papers that follow this method, all else being equal; 6. as an editor, suggest that submissions use this method.
(Edit 2: In case it's not obvious, the purpose of the hash is to allow users to be pretty confident that the code they downloaded is indeed exactly the code used in the paper. By putting the hash in the paper, I make this promise. If I want to make an improved version available, I just put it up at the same site, but I must make the exact original available and identifiable as such. This simple method of ensuring identifiability is our contribution.)
My group does this with every paper. I have a paper describing this method coauthored with a student under review now at a good journal, and I'm looking forward to seeing the response. (Edit 1: see link in comment below)
I'd also appreciate feedback from HN.