Hacker News new | past | comments | ask | show | jobs | submit login
A Code Glitch May Have Caused Errors in More Than a Hundred Published Studies (vice.com)
361 points by daddylonglegs on Oct 12, 2019 | hide | past | favorite | 171 comments



I've implemented algorithms I dug out of original research papers, which often included sample code. (That's why I first learned to read Fortran!) I've almost never gotten results that match the authors' exactly. Sometimes the bugs are obvious, and sometimes they're subtle. I'd estimate that 100% of sample implementations in published research papers have bugs. Researchers, even in computer science, are usually not skilled programmers. The product for them is the paper, not a program.

It's the same category of problem as "enterprise software". Whenever the customer is not the user, the user gets screwed. With research, the customer is the journal.


Academicia is suffering from a disease of uselessness where papers are written from a standpoint of getting credit. It’s not about other people understanding, reusing or learning how to use what they’ve found. Not to mention lacking in reproducability. Sadly this limits the usefulness of research for most. Yet for some reason we all still hold academia up on a pedestal


People in academia always have a pressing career goal. Especially the best of them do not get bogged down by minor issues. I'm not talking about actual fraud here, just a bit of sloppiness that doesn't invalidate the results, but religiously adhering to meticulous standards is not profitable for researchers.

The primary goal is to publish the paper, go to conferences, meet others and network.

While it's not a dichotomy, who will be seen as "better"? Someone who spent tons of time to write reproducible clean code with tests, tracked down all minor details and pondered about things that may change the results by 0.1 percent, or someone who was a bit sloppy but got 3 publications in the meantime and got to know famous professors at conferences?

Attitudes like this determine your success. If nobody values or indeed even sees or knows about your efforts, those efforts are practically wasted. "I'm a really detail oriented person" doesn't have the same ring to it as another few papers on your CV.

I may sound cynical but the other extreme, idealism is not useful either.


> While it's not a dichotomy, who will be seen as "better"? Someone who spent tons of time to write reproducible clean code with tests, tracked down all minor details and pondered about things that may change the results by 0.1 percent, or someone who was a bit sloppy but got 3 publications in the meantime and got to know famous professors at conferences?

This is a common narrative, but I'm not certain if the example you give is typical. I've written about this before: https://news.ycombinator.com/item?id=18743531

Since 2018 I have tried to publish at least one paper debunking something every year. So far, if anything, being more careful has led me to publish more, not less. Admittedly, I don't think everyone should be as careful as I am as people would quickly run into diminishing returns, but the idea that being careful necessarily means publishing less has not been true in my experience.

Also, I don't think the typical case is something changing the results by 0.1%. In my experience when I catch an error it's usually larger than that. The first debunking paper I published was about something that was one or two orders of magnitude off the correct value in the typical case, but still received 300 citations...

Ultimately I think it would be best if an academic field adopts uniform standards for publication. That way, a sloppy researcher can't pump out 3 bad papers. Many academic communities have nominal standards that are not enforced. Some examples are discussed here: https://www.osti.gov/biblio/1141709


It often feels like a quixotic fight. Strong claims sell papers, but are hard to stand behind for a researcher with high standards, who is very critical of overall practices in a field. Their "best" recourse in the idealistic sense would be to publish an overarching meta analysis of the field, and pointing out problematic practices, gaining enemies and being labeled as a crusader.

In my field, computer vision, a major problem is uncomparable setups, far reaching claims etc. A major pattern is introducing some fancy model, showing that it gets a bit better results that could well be due to noise or some tuning on the test set (not even secretly, just trying many different things and putting the bold formatting on your best number and claiming it as SOTA). Whether it was really due to your fancy model or not is hard to prove. But my experience is that mundane, pedestrian changes in a model introduce way bigger performance changes than whether or not you use the fancy module or architectural tweak that is proposed in mediocre published works. Which also means you can absolutely not compare models created by different people, especially when the performance gap is small. Tiny details can screw up the whole ordering, so trying to interpret the order and draw conclusions from it is equivalent to reading tea leaves.

Now surely this applies mostly to mediocre research. But that is the majority. Top of the line research is more solid, but if you're a small unknown researcher, your best strategy is to start getting out there, publish and network.


This is not about code, but look at figure 4 here

[1] https://www.pnas.org/content/115/50/E11790

and tell me why i should ever trust something from someone who swaps the location of towns on a map, while the point of said paper is about the spreading of the plague along trade routes.

Instantly invalidated...(forever!)


Because people will make mistakes. If you expect papers to be 100% accurate, you'll always be disappointed. You have to account for possible mistakes. Whether they're accidental or misleading on purpose is to a degree irrelevant at first - we should always watch out for them.

Even in situations where mistakes cost lives, we put multiple failsafes and reviews and clear instructions in place - and still say they reduce the likelihood, rather than prevent accidents.


You know...

I mailed them, with CC to all involved. No reaction at all. Hamburg and Lübeck still swapped. What should i make of that?


Given the paper already appears to have a correction for something minor, one more would seem appropriate, yes - it’s a pretty jarring error.

But it’s just disingenuous to say it affects (let alone “invalidates”) the actual study. And shouldn’t a proper pedant spot that the top pin isn’t even where Lübeck is - that’s Rostock.


Good spot.


Does the swap in the position of the two cities compromise the findings in any way? If not, it's an irrelevant mistake like misspelling a word. What do you expect them to do? retract the paper because they misspelled a word?


It could, because in the timeframe the paper is about, they were geographically near, but had no real direct connection to speak of. Traveling over land along branches of the 'old salt routes' took at least two days, and that was fast. They were transport hubs of different branches of the early Hanseatic League.

[1] https://en.wikipedia.org/wiki/Hanseatic_League

This is not the equivalent of a 'spelling error'.


The brokenness of academia starts with the exploitation of graduate students by educational institutions, which makes it a very lucrative enterprise to over-hire professors, each of whom is like an incubator for the school, which acts as an investor because they get a cut (upper five figures per year per student) no matter what. It's similar to college sports in a way. It's all driven by money.

Most research is useless. Most professors are unneeded given the size of the problem space. Students see this and in the end, besides the very few true geniuses, it's not the most purely motivated that become professors, but the ones who most aggressively play the game in trumping up their results.


> Most research is useless.

I agree.

> Most professors are unneeded given the size of the problem space.

I disagree.

First, most subjects are too deep for the professor to be knowledgable about more than just their specialty (Do you expect a single Computer Science teacher to have complete and up-to-the-date knowledge of formal methods, programming languages, and operating system virtualization?) Even then, they will either be out of date or have spent a lot of time keeping up-to-date.

The problem is in many cases, the size of the problem space is much too big for a single group of researchers to have any effect on it. Many 'simple' studies have thousands of factors that can affect the outcome, and almost all of them have too few people, with not enough time and energy to devote to isolating all of those factors. For that reason (and many others), most studies are not replicatable, and dubious at best.

This is the main reason why psychological, sociological, medical, and most other fields of research that aren't mathematical, are considered dubious. Not because their methods are inherently bad, or their fields inherently invalid, but because most studies do not have the manpower available to do a completely formally-correct ideal study, so they have to make-do with what they have, and trust that eventually we will have enough mediocre-evidence studies that together account for enough varying factors on the subject that we can eventually account and iron out the individual flaws through statistical methods.

If research were _truly_ a priority for humanity, and we really dumped all of our effort into scientific research as a society (i.e. governments and companies both prioritized R&D and gave the scientists enough resources to actually do the jobs properly), then we might see these fields as "hard science" rather than "soft science".

But that's like saying, if Jeff Bezos got up and actually objectively used his money properly, he would have billions left over and there would not be starvation or poverty in the modern world. It's an idealistic scenario that is extremely unlikely to happen.


>>> Most professors are unneeded given the size of the problem space.

>most subjects are too deep for the professor to be knowledgable about more than just their specialty

>in many cases, the size of the problem space is much too big for a single group of researchers to have any effect on it.

>most studies do not have the manpower available to do a completely formally-correct ideal study

I agree that the size of the problem space requires a lot of researchers. This is probably why there's so much attention focused on machine learning and big data, since they seem to have the potential to address the problems you've listed here. Of course, there are many technical and ethical issues to be confronted in developing and deploying them.

>If research were _truly_ a priority for humanity, and we really dumped all of our effort into scientific research as a society (i.e. governments and companies both prioritized R&D and gave the scientists enough resources to actually do the jobs properly)

It's not, because the greatest problem for humanity is still subsistence, which requires solving a massive resource distribution problem. There are some governments and some companies that do prioritise R&D and provide enough resources, but they are too few and far between.

>psychological, sociological, medical, and most other fields of research that aren't mathematical, are considered dubious.

>then we might see these fields as "hard science" rather than "soft science".

It seems to me that these fields are "soft" in part because the ethical issues surrounding the surveillance that's needed to collect the data to do a formally correct study are quite formidable.


You can view an undergraduate education (and especially a graduate education) as a significantly negatively paid (or at best unpaid) internship for the job of academia.

This has all the advantages of internships that employer's normally enjoy. The internship enables you to hire significantly better employees than you'd otherwise be able to get in the open market, because you can engage in more vetting and because of the power of defaults.

Many smart people go into academia who would have been much happier and more productive outside of it simply because going to school itself made pursuing a job in academia something much more of a default than it would have been for many people.


You should be able to contribute to these software projects, and receive credit akin to publishing a new result. Authors and contributors (and bug-fixers) should receive more credit for the subsequent results derived from their software. This aligns incentives more properly: The researchers struggling to gain insight using tools can trust the tools, and they are incentivized to make their new tools useful to the community, rather than publish-and-forget.


If you are a scientist and a programmer, you could make a software useful for others, publish an article about it, and provide it cost-free under the condition that when people use it for research they have to cite the article in their paper. This is how you could get lots of citations.


Many of the best-used robotics packages are handled in this way. The ROS paper itself has thousands of citations.

But it might be tough to get tenure doing that, and few jobs will support those efforts outside robotics startups or research labs, and often proprietary is the law of the land in those domains.

And outside tech R and D, where software still rules but isn't a core competency, you'll likely never have success with this model, sadly


Pop science coverage is always way overrated as a valuable exercise as well. So easy to cherry pick studies or fundamentally misunderstand the concepts or context.


Something I don't get is why universities don't to a better job of having professional statisticians and programmers on staff for the explicit purpose of providing support to researchers.

I guess grad students are cheaper and "good enough."


From experience:

- Research is an iterative creative process. The creation of the code is an organic process of discovery and not a separate stage from writing it's specification.

- The jobs are too small to split up. This is the same in industry. Try to split up a small task (1 person, 2 weeks) over a small team of specialists (Analyst, programmer, DB specialist, dev-ops, tester and now you need a project manager, ...) and suddenly it becomes way, way larger than it should be.

- Most production programmers realy dislike working in research environments. The objectives and nature of research code is very different from what turns-on and drives professional programmers, so you wouldn't have 'the best professional programmers' working there anyway.

Being able to code well enough to do your own experiments is part of the researcher's skill-set. Grad students should become 'good enough' at it. >

Let's not kidd ourselves. Independent academic research is a constant struggle for funds. Nobody is going to pay for a magnitude increase in the costs. If you are looking to up the quality, best look into how to get academic research out of the ridiculous quantified output metrics conundrum.


The University of Texas at Austin has statisticians faculty, staff, and students can consult with for free: https://stat.utexas.edu/consulting/free-consulting

I've used the service before and can recommend it.

Don't know anything comparable for programming at UT, though I can say that UT's computational science (different from computer science) program has some good classes on software engineering practice aimed at practicing research scientists and engineers.


As I understand it, most large universities have something like this. At smaller schools it might be less formal. Where I went to college, the students were expected to have their methodology approved by the "stats guy" in the department.

As for programmers, it's not just a matter of cost. Most scientists don't understand software engineering, but most software engineers don't understand science or math. Also, programmers who belong to their own organizational structure can't keep up with requirements that could change from one hour to the next. Despite "agile," realistically most software development is comfortable with timelines of months or years.

Don't get me wrong, I'm an R&D scientist at an industrial business that makes commercial software. The programmers are brilliant, and the software they make is amazing. But I have no choice but to be self sufficient for my own software needs in R&D. It's two different worlds.


If given a good spec, a clearly written formula, I don't think it's too unreasonable to expect a seasoned programmer to implement it despite not understanding the science that prompted the math. Maybe I'm wrong for some kinds of exotic math, but I've certainly implemented a few mathematical formulas that looked like Greek to me, at least when I started.

Moreover what I propose is that every researcher have a programmer available to consult with, who would perhaps write small amounts of code (perhaps make sure the code builds before it's published?) I don't suggest that researchers not write code, I think they should write code. But I think an expert in writing code should be available to help keep the standard of code high.


In my view, it would help a lot to provide scientists with resources for learning how to write quality software. Most of us don't even know how to write a specification, except by writing the code, and don't have an idea of what we want until we see something begin to work. Today, a lot of the math we use is written directly in code.

This stuff evolves over multiple iterations. Many of my programs never run more than once. Testing them requires being connected to the equipment if any kind of closed-loop control is involved.

I work in an R&D setting in industry. If something I make threatens to become a product, the thing I hand over to the software team is a proof of concept, that could include a hardware design and working code.


I've had these same ideas when working on physics-related code. They should have a few actual software experts around to consult and / or do the development.


Where's the money going to come from, and who's going to do it?

I wasn't making much above minimum wage working as a programmer at a university. People with computer science degrees don't tend to take such jobs, when they've got half the west coast waving 6-figure job offers at them.

Again, it comes down to the user and the customer not being the same person. The person reading the research paper would sure like it to have been written (or at least debugged) by an expert programmer, but it would be some department who would have to have paid for such a programmer. Departments don't have budget for that, and their currency is publication, not correctness proofs.

(I assume departments would have to pay, because we paid for all other IT services. We even had to pay the university for network access. It's not like the library, which is always free to everyone on campus.)


> Something I don't get is why universities don't to a better job of having professional statisticians and programmers on staff for the explicit purpose of providing support to researchers.

Complete lack of incentives to do so.

Academics are, by design, inbred. They care only about what other academics think. And few of them care about code quality. The attitude is slowly, slowly changing, but I think even now in most disciplines it is not the norm to share your code when you publish papers, and it is not surprising for an author to refuse to provide you the code when you email them.

Until journals require academics to submit any code used for simulations/analysis, things won't change.

In any case: To answer your question - there are always some programmers on staff for large scale research projects. I doubt they're judged by code quality, though.


I think ideally, universities should have enough professional statisticians and programmers on staff to provide a mandatory but nonbinding review for every paper any researcher at the university decides to publish. They shouldn't act as gatekeepers to publishing, but would provide feedback to the researchers for the benefit of the researchers.

Compelling researchers to release their source code would be the next step, but they'll fight you tooth and nail on that one. Many researchers are in very competitive fields and think that releasing their source code might give some other team a boost. I think this attitude is contrary to the interests of scientific progress, but because it's a matter people have such strong personal stakes in, it could prove a hard fight.

Chemistry departments often have glassblowing technicians. Perhaps this would be comparable.


> I think ideally, universities should have enough professional statisticians and programmers on staff to provide a mandatory but nonbinding review for every paper any researcher at the university decides to publish.

I guess I wasn't clear enough to connect all the dots. These kinds of policies are made by academics. The people with the authority to mandate this are themselves academics who got promoted. The reason the universities will not do it is because of the nature of the people in charge.

Academics don't work for these people. They are these people.

Pretty much the same story with grant approvers and journal editors. Mostly from academia.

The argument to do it is to improve the quality of the research output. But guess who evaluated the quality of the research?

It doesn't matter that you and I know this is an incredibly unreliable way to do research. As long as academics can continue to publish their papers with crappy software practices, they will do so.

As I said, there is a lack of incentive to do this. Who will benefit? Probably over half of researchers do not want someone to find flaws in their methodology. In my discipline, it was common to leave out inconvenient details from a paper. Common enough to the point of psychosis: The researchers convinc themselves that the flaw is not a problem, so it's not even a question of integrity any more. It's poor incentives leading to a few generations of researchers who are trained to be blind.


Because it would cost money, and I'm honestly not sure most academics (at least in CS) truly realize the dumpster fire that is their coding.


> programmers on staff

That is generally referred to as "research software engineers/groups" and are unfortunately currently fairly rare in academia, but are becoming more common. RSECon is dedicated to this topic, and hosts a list of research software groups [1]

[1]: https://rse.ac.uk/community/research-software-groups-rsgs/


I’ve tried to make this work for a while now, and it’s definitely not just an issue of getting statisticians/programmers involved. It’s an issue of power and incentives. My immediate boss is great, he understands that things need to take time and there are a multitude of challenges with development and helps me solve them. Some of the other researchers, however, see me only as a typing monkey.


Anecdote: One day a friend asked me to look at their family member's doctorate because she was having trouble with the analysis section.

All I could do to save face was point her to some resources, because if I said what I really thought, it would be in a trash can set on fire.

Or to put it another way: maybe they don't do it, because if they did, it would stop the requisite amount of papers being published.


100% of programs have bugs. We can't expect to hold the bleeding edge to the same rigor as, say, a flight computer. It's sometimes literally the first implementation of an algorithm or procedure.

The comments here using words like "disease" and such are not wrong, just tilting at the wrong windmill. The problem, I think, is not the initial bugs, but the lack of maintenance and improvement over time. You should be able to contribute to these projects, and receive credit akin to publishing a new result. Authors and contributors (and bug-fixers) should receive more credit for the subsequent results derived from their software. This aligns incentives more properly: The researchers struggling to gain insight using tools can trust the tools, and they are incentivized to make their new tools useful to the community, rather than publish-and-forget.


Does that make hello word the most perfect program? :)


Same here, not having done a PhD I’ve spent a lot of time just reproducing papers I liked and realized they were often very hand-wavy in the way they achieved their results, often exaggerating the impacts of the research.


I agree with everything you pretty much said, but to me it strikes at something deeper.

To me, you don't have to be a software engineer to write some scientific code.

What you need is a scientific mind: logic, repeat, test, falsify, question. Sure structure and design helps everyone, but if you think about code in this way you'll still find the bugs in a small script.

And to me that's the great problem laid bare with the shoddy coding in science, it reveals the emperor's new clothes: scientists and academics, supposedly the professionals in their field, can't think scientifically or apply such thinking to their own work.

They're chasing false positives, and that's deeply troubling...


I think without in-the-trenches software engineering experience, it is hard to determine what class of issues would be a problem.

In this case, it was a dependency on the ordering of filenames. Whenever your code depends on something being in a certain order, you should ask yourself what puts the items in that order. But it's perfectly reasonable to believe that filenames are returned in order; perhaps for your small sample size or when you run "ls" they're in order (because you have ls set up to sort, but don't know it).

The difficult part that comes with experience is knowing what you can and can't trust. I would trust that "select foo from bar order by timestamp desc" to return items in order. I wouldn't trust "readdir" to return items in any particular order (maybe ordered by inode number, but I can't think of any real program where I care about order that would want that order). But you can really go down a rabbit hole when you don't have the right level of trust or experience. (I have seen a lot of unit tests in my time that do something like test a core language primitive or library; scores of tests like assert(append([1], 2) == [1,2]). This comes from a total lack of trust for things that probably can be trusted. Or from people that don't have their homedir full of files like "foo.py" to figure out how some API works and instead put them in their unit test files ;)

The other thing is figuring out what tests would be valuable and knowing that you have to write them, and the researcher on the other end needs to run them. (I remember so many bug reports like, "I force installed that library after 'make test' failed and I'm getting weird results from my program that uses the library." Well yeah.) I see a lot of people run their programs on simple input, check "yeah, that looks fine", and then trust that code forever. But somehow, it breaks, and they don't notice it on their complex inputs, and it leads to problems like the one in the article.

On the other hand, the thing about nondeterminism is that you can't test for it. Sometimes it's very obvious, you do something like (keys(foo) == bar) and because dictionary keys are explicitly randomized by your language, your test fails the first time and you immediately find your error. But if you're talking about something like OS-dependent APIs, you'll write your tests and never see your bug, because the order is consistent on your particular computer. (At least someone on another OS will eventually find it, and there are plenty of services that will CI your code on various OSes, so you do have the potential to uncover that issue yourself.) An even more severe case deals with concurrency; your unit tests run one thread, so you never see the subtle 1 in a million timing issues until you push to production and see the 1 in a million bug every second because you're processing a million records a second.

(And things get even more obscure than that. I once got paged for one replica in a large job not working correctly. Look at everything that could possibly cause the issue, nothing. "I guess I'll restart the task." Problem persists. Someone suggests "Tell the scheduler to schedule the task on a different machine." Problem goes away and never comes back. Would suck if that one-in-a-million obscure machine-specific problem happens to the machine that's producing the results for your scientific paper.)


I'm under no illusion that it's inherently easy (I consider such to be my job and that's partly why I get money), but that's why one should be 'professional' and have to work for it.

If you make a calculation, you should have processes to verify if that calculation or result is correct, that you can repeat that calculation, think about why it might be wrong even if you think it's right etc.

If your technique relies on sorting things in a particular order, then you both explicitly sort them in that order, and then you figure out an independent method to verify they are sorted in that order.

Indeed in one of my jobs that was a bug: old bank code where sql statements were inherently assumed to be in a consistent/stable-sorted order for downstream processing because they used an 'order by' clause. When the vendor upgraded the sql engine to be distributed/multithreaded, the relative order inside equal keys changed (and indeed was now non-deterministic when it was technically deterministic at the time the code was written even though that was an assumption not a requirement).

But I don't think I'm unreasonable: yes, there is no black and white rule for how far down the rabbit-hole you go: is it too far to test each individual instance of integer arithmetic, or floating point calcs, or individual os's or processors... in practice probably yes. But you do need higher-level tests to determine if calcs are going wrong (that might then later start you down that route when your results suggest they are going wrong). And yes, there are bugs that are inherently difficult to track down, especially as you go up and up in complexity and abstraction. And there are parts of science where the feasibility of the calculation itself or its outcome is the great unknown. But these are edge cases and not 99% of science papers.

What's alarming is the reuse of code they don't inspect, the use of libraries and techniques they don't understand, and practically no critical thinking or inspection towards their data, results or methods.

They throw stats or a library or a technique at a wall, get a positive result and: bang! into (at least one) paper.

In my eyes that's not how science should work.

I'll also go one step further and say that if you have unexplained bugs, warnings, errors, and if your calculations are not reproducible, then for 99% of science you should not be publishing just because most of the time when you run it you get a positive result. Would it suck if that one in a million obscure machine specific problem happens when producing your paper? Sure. Would you be negligent in publishing anyway without a way to prove or assert the correctness your calculation? Yes.

Is that what everyone's doing? Hell, I think it's orders of magnitude worse. Some languages (like R) are almost built around the philosophy of 'the show must go on' rather than 'be correct or no result for you'. I think that's part of the reason why they're popular.

Again, I don't think that's software engineering specific, I think it's a mindset that is required to make a good scientist.


Could you please tell us more about this R 'the show must go on' philosophy?


In short, R contains silent coercions, vector recycling, lazy evaluation, partial matching on strings in various calls, exceptions to the rules, surprises, unexpected values, inconsistent and weird operations on edge cases and often chooses to continue on with these incorrect and bad values rather than throw errors and stop computation.

I'm not a fan of its writing style (there's a point where being funny hinders readability), but I believe chapter 8 of the R Inferno is replete with examples.

The combination of these factors results in a language where actually trying to reason about the veracity of a computation and coding defensively is exceptionally difficult, but people can load packages and make the wrong computations that look right exceptionally quickly, so they think of it as convenient.


Thanks, bookmarking this answer...


> It's the same category of problem as "enterprise software". Whenever the customer is not the user, the user gets screwed. With research, the customer is the journal.

This is a very good summary of the issues. I work in this space, and I can confirm that For a lot of PIs, the software is at most tangential to the work of writing about the software.

Features that literally have never been run or tested can be written about in papers as if they are 100% foolproof.

The sad part is that these discoveries of program bugs can probably only occur in the top 5% of all research software. The rest are either left unreleased, or are so poorly coded that they will never be vetted in any meaningful way.


I totally agree on this. Having read papers and tried to implement, or even re-implement the same algorithm, even when their source code is available, it's hard to get results that match.

I'm not a hard core TDD guy, but sometimes a few tests go a long way to making your code reproducible...or even just working correctly.


What’s worse, sometimes the bugs are deliberate. A slight tuning of a parameter that can be explained away but having a significant impact on the output data. This alone with statistical bullshiting ( there’s always a statistical manipulation that will create the desired result), renders 80% of all research practically worthless.

edit: code change -> parameter tuning.


Or in machine learning if that manipulation is not enough to get state of the art results, just don't cite anyone who is better! Now you're SOTA, congrats. Happens way more than people realize. Not as much in the top 2-3 conferences.


I can imagine this might happen in theory, but have you come across evidence of this in practice?


Parameter tuning is prevasive, but I guess I went too far to suggest that it's a code change.


Another common trick is to tune your program to a dataset then run all of the other programs with default parameters.

Bam, you've got a benchmark where you're fastest, just like everyone else.


Fortunately it's becoming more common to provide permalinks to gitlab etc so people can access the actual version of code used in a paper. That being said having some incentives to reward this practice would help speed up adoption.

Part of the blame for incorrect code also falls on reviewers. If someone agrees to peer review a paper, that should include running the code as part of the general scrutiny. If someone finds a bug after publication, they should let the appropriate people know so an errata can be published.


> I'd estimate that 100% of sample implementations in published research papers have bugs.

> Researchers, even in computer science, are usually not skilled programmers. The product for them is the paper, not a program.

This is so silly and misleading. Any nontrivial software is likely to have bugs, whoever the producer or customer or user may be. That's not the point. Merely having a bug doesn't invalidate an algorithm any more than lacking one validates it. Moreover, even when there isn't a bug, there's still no guarantee you can reproduce results exactly -- stuff like your compiler's particular handling of floating-point may well change the outputs.

Finding a bug that doesn't change the substance of some research is nothing to be smug about, whether it's obvious or subtle. All you'd be doing is wasting your time trying to prove that you're a good programmer and the author is a bad programmer, when that very 'bad programmer' is being productive and advancing science without getting hung up on irrelevant details. The metric you have to care about is finding bug that materially affects the conclusion being drawn from the experiment; that's the rate you need to report here.


Decades ago I had a friend who was taking a double major in CS and Physics at a school that shall remain nameless. He decided to do his masters thesis in environmental science. The department in question published a lot of papers based on simulations, showing how different factors could effect weather patterns and things like that.

The simulation program they used was written in fortran by a previous grad student, and there was no version control. It was instead copied from researcher to researcher, and expanded on in turn, then passed to the next researcher in the lab to be modified for their own work.

My friend asked for the version of the code which was used for particular research papers the group published. They couldn't find it. He asked where the test suite was. There were none. He asked if anyone had audited the code to make sure it worked, and - well you see where this is going.

Feeling frustrated, my friend spent the next month or two going through the code and writing tests to check that the code was actually correct. Amongst other things he found a + that should have been a - in a core part of the simulation code, and that change dramatically changed a lot of the results the program produced. Because none of the researchers actually kept the programs they used to generate their results, it was impossible to even tell when that bug was introduced, or which of the papers published by their department had correct results.

Anyway, this caused a big argument between my friend and his supervisor. My friend held that their department wasn't doing real science, and he ended up being asked to leave the team. He did his thesis in an entirely different field.

I don't know how common this sort of thing is, but its probably way more common than we'd like to admit. We should demand across the board that any paper which relies on source code co-publishes the source code. This is becoming increasingly common in CS papers, but it should happen across the board. If it were up to me it would be demanded by top tier journals as a condition of publishing.


Yes, I realize. I never disagreed with this. What you said is entirely consistent with the point I was making, which was that "100% of published research has bugs" and "researchers, even in computer science, are usually not skilled programmers" are the quite misleading claims to make about research quality.


It seems to me that this discussion has been about bugs that affect the research.

You're right that no one should care very much if there are typos in research code, or if it crashes on some input that it never sees, but the criticism here has been more substantial than that.


100% of programs have bugs.


Most programs don't have significant bugs that affect the entire purpose of the program.

You wouldn't accept a chess program that didn't know how bishops moved, or which only worked when you played a specific opening, but these classes of bug are not uncommon here.


Writing code in research is quite different from writing code in industry. For starters, in research, absolutely no one will read your code. Not your boss, not your peer reviewers, not your colleagues, not your tech-savvy users. No one. They just care about getting the right results and will complain if they don't, which most of the time steers you into producing correct code.

Also, most programs usually have a maintainer team that's exactly one (1) person strong, and that person is usually an underpaid grad student or postdoc with a billion other tasks at hand and not much time left for that issue you opened three months ago.

On the other hand, hey, you never have to fear the dreaded 'code review', so it's not all bad.


> Writing code in research is quite different from writing code in industry

Hah, not always.

A short while back I was tech lead for an AI project, the goal of which was to reduce the weight (and, ergo, cost) of large steel structures.

The megacorp consultancy I work for decided to staff this project only with AI people, about half of which were fairly fresh graduates. Now, they all had a great grasp of AI, both old-skool (neural networks, genetic algorithms, particle swarm optimisation etc) and more recent innovations. But these were not developers.

The customer was a Microsoft shop, and mandated that we use C# (which was cool, it's my favourite language!), but it was immediately apparent that the "devs" had only basic training in Java. The code was a fcking mess, and we had numerous issues around software engineering concerns such as source control, DevOps and processes - the customer eventually canned the project due to the bugginess of the platform. They really did have some brilliant minds on the project, but those were not the minds of software engineers.

On a more recent AI project, unusually I was consulted first on staffing, and I insisted on a mix of AI research types and* software engineers - unsurprisingly, this project was far more successful.


Seems like your formatting was messed up by putting an asterisk in "f*cking", which caused everything to be rendered in italics between that word and the "and" which you actually wanted to be in italics.


Bah, on mobile, apologies :/


I just finished an internship at a research lab and was completely dazzled by the lack of control on the code people write.

The reasoning is that since it's research code that is used to get results only and not be used in a product, doing tests is not useful. I disagree completely: being sure that your implementation of your model or equations is exactly doing what is should is very important to evaluate your model, and yield correct results.


I imagine you said that as tongue in cheek, but good "code reviews" should not be feared. If it is the case it either means that environment you are coding is toxic, or that you have not made clear with your peers that code review is not about judging.

Code review should not be about being right or wrong, but as a team to provide the best work possible. People should accept that everybody make mistakes, and the goal of the environment/society is to minimize the consequences of those inevitable mistakes, not judge individuals based on arbitrary metrics.


Code reviews delay gratification. Often by long enough that the visceral relevance drops off a cliff. While (good) code reviews definitely improve the software quality for a project, they can be very jarring breaks in flow. It should be fairly understandable why some of us would regard them as about as pleasant as eating a fibre supplement.


Problem is, we don't want the results the author thinks are right; we want the results that are accurate.


The problem is, it's impossible to know if they are accurate without doing everything the computer was coded to do for you.


This is not entirely true. When I implement a simulation, I specifically look for specific properties of the simulated system that can be checked, for example time evolution of total energy or (angular) momentum. If you have a decent set of properties with non-linear relationships between them, it is actually quite hard to have a misbehaving simulation that still produces corrrect values for these properties. In fact, these checks have led me to the discovery of bugs that would otherwise have been impossible to find because the sim output was just plausible enough.


This really rings true for me. I've also found that looking for several "known" properties within a complex model's behavior to be an effective way of rooting out subtle bugs. If the model isn't keying in on the obvious, it hardly has a chance of keying in on subtle unknown relationships. I've even gone so far as to optimize hyperparameters based off context specific properties, e.g., hunting for the model that has the most coherent output behavior with regard to a range of inputs (assuming that outputs are expected to have continuous behavior).

https://sproutling.ai/blog/harvest-simulations?jm

https://sproutling.ai/blog/growth-simulations?jm


Well, the premise stated above is that whatever testing you, the "author", judge sufficient doesn't matter; accuracy comes only at the hands of the sacred code review!


I dare you to review dense code for numerical computations and actually spot bugs. This is really hard! Unit tests are actually much more reliable but they are limited to deterministic algorithms and models that have reasonable complexity, that is, it is viable to compute expected results by alternate means.


Code review is more about determining if you have the correct test cases to cover the algorithm, and a solid architecture for maintenance, than it is about algorithmic correctness.

Code review is a tool to push back on your manager ignoring testing: “Steve requested I add X tests.”


That's not necessarily true. We could start building correctness or equivalence proofs for the building blocks of research software, and maybe some day we could prove some meaningful equivalence between how the software is described, and how it actually works.


If only research paid as well as violating the privacy of the American people to trick them into clicking on advertisements there might be more parity in code quality control between research and industry. But we’ve instead allowed google to take all the money away from the scientists, who are increasingly doing societies heavy lifting for free.


Generating clickbait ("engagement") and playing roulette ("trading") have become the greatest pasttimes of the STEM talent pool.


A grad student faces an existential career risk for every day they spend on anything other than finishing their degree. That's because events beyond their control can throw them out onto the street. This includes loss of funding, loss of their faculty advisor, health issues, visa problems, and so forth.

They may also be the only qualified programmer in their group. A lot of research, while officially housed in some specific department, is quite multidisciplinary.

I work in an R&D group in industry, and our team invited a software engineer to join our group and teach us better practices.


>Writing code in research is quite different from writing code in industry.

Depends on which industry. For example, writing code in research is actually a lot like writing Excel macros in Finance.


> They just care about getting the right results and will complain if they don't

That sounds like a group of people willing to cut corners to get approval from their boss... which isn’t all that different from the incentives in “industry”.


A couple of years ago I looked over the alignment software used in a bunch of genetics studies and spotted a pretty trivial mistake that caused mis-alignment of the strings. This in turn led to wrong conclusions about the order in which mutations took place. This sort of thing is probably quite common: biologists are not computer programmers (though the combination is for sure occurring), they tend to treat the computer programs in roughly the same way as they would treat lab equipment, stuff goes in, other stuff come out. If it looks good and seems to work then it probably is good.

But the complexity of the software is such that you need to check very carefully whether or not the software operates in the way you expect it to.


You could also choose to use formally verified software for sensitive research.


Full-time developers don't (to a first approximation) use formal verification, much less research scientists.


Other way around. Research scientists are the ones that may (or should) require additional rigor in their results. Formal verification would presumably help in that regard.


I don't disagree, I'm just saying that we as full-time software developers have not made formal verification standard practice, or even approachable. I would bet the average software developer has never even heard of it. So to expect people who do not write software full-time to do so is a little crazy.


You could, but your typical research scientist in biology will not even be aware of what formally verified software is, and even if it remains to be seen if there is such a thing for their particular application.


There's also the issue of cost, which can go up real quick in some fields.

It can be hard to justify expensive software which might not always be able to accomplish whatever you're aiming to do.


Too bad there's no money in pharma and medical research in general.


There is plenty of money for marketing. Research is a distant second as you probably well know given your nickname and if someone put their study on hold for a half decade or so just to make sure that the software they use is properly put together the funding would evaporate rapidly for that particular team or individual.


Let's institute a ban on pharma ads, at least the ones that disclaim "don't take $NEWDRUG if you're allergic to $NEWDRUG," and give the money to research.


As a physicist who worked in Biology for a while, money didn't always make things better and sometimes made it worse. Programs with slick user interfaces tended to be overwhelmingly chosen over better open source programs which were command line. In bioinformatics a lot of excellent software is open source and updated regularly. In contrast, closed source software could be nightmarishly opaque in how it handled the data. Also, companies had better salespeople than open source proponents. Graduate students used to take bets about whether a software would be purchased by looking at how hot the salesperson was.


Even informally verified software would be a revolutionary improvement in quality of science.


Considering formal verification for even simple stuff isn't seeing widespread use, well, anywhere, I don't see this happening


You have to dig really hard to find out what the specific problem was. It turns out this was the fix:

    def read_gaussian_outputfiles():
        list_of_files = []
        for file in glob.glob('*.out'):
            list_of_files.append(file)
        list_of_files.sort()
        return list_of_files
That is, the original code never sorted the files and instead it trusted that a simple `glob` would give it files sorted in whatever is the desirable way.

It is a reasonable mistake for a total noob to make (just like the hundreds of simulations which use the built-in `rand` function of whatever language they are using without even so much as mentioning this pertinent fact), this should be called a "programming error" instead of a "code glitch". Also, the error should be publicized up-front.


So many comments in this thread are saying that scientists are bad programmers, but even a great programmer can make mistakes like this. The solution isn’t related to better programmers, but instead for users to treat codes as instruments, and hence a verification and calibration step.


I would say it depends on the thought process that caused the bug. There are a couple of possibilities I can think of:

Did the author of the code simply forget to sort? That's equally likely to happen to anyone.

Did they assume that the sort order of glob.glob() was reliable across different OSs and file systems? I don't think this would have happened to an experienced software developer. At least there would have been enough doubt to go read the first line of the docs.

Did they not care whether the code worked anywhere outside their own personal setup? This is perhaps slightly more likely in an academic environment but I'm not completely sure about that. We'd have to know more about the specific circumstances.

What I find more astonishing is that it took so long for this bug to be found. It's not exactly an edge case or a rounding error. It must have caused wildly incorrect results in many cases.

So I think the more important differences may be on an organizational level rather than a question of individual competence.


A reasonable mistake for all who already used the POSIX `glob` function, which returns sorted paths by default.


Seeing the code, I don't get why this was encapsulated in a function, considering they don't even take a path argument.

It could easily have been just a oneliner:

    gaussian_outputfils = sorted(glob.glob('*.out'))
Of, if it needed function form:

    def read_gaussian_outputfiles(path):
        return sorted(glob.glob(f'{path}/*.out'))


Have you figured out, too, why this gives different results? I can easily imagine that it would affect rounding somewhere, but if so, neither of the numbers in the abstract (172,4, 172,7, 173,2, and again 173,2) is incorrect, and the relatively large range point towards numerical instability of the algorithm.

Secondly, I would guess the reported difference between Mac OS Mavericks and Mac OS Mojave is because they tested on a Fusion Drive (I would expect the filesystem to cause the difference, and that changed with High Sierra, except for Fusion Drives, where it changed with Mojave)


It’s too bad that the code and correction seem to be locked down, but the linked Twitter discussion has an explanation [1]: the code expects Python’s glob to return a sorted list of files, but it doesn’t guarantee this (the main fix is to sort the result). I had thought reading the Vice article that this would be about case-sensitive vs insensitive filesystems.

[1] https://mobile.twitter.com/bmarwell/status/11818211438352015...


Are they locked down? The article links to the paper [0], which has two zip files to download, first of which is the code, including an explanation of the corrections made [1].

[0] https://pubs.acs.org/doi/10.1021/acs.orglett.9b03216

[1] https://pubs.acs.org/doi/suppl/10.1021/acs.orglett.9b03216/s...


Ahh, I failed to click those. They seemed large enough to be “not the code, but the full dataset”. Thanks for the correction!


To me this just raises more questions like "Why does the order of the files even matter?", and "If it does why wouldn't you make sure it was sorted, or better yet sort it in some reliable fashion?".


If you are writing a paper about the order in which a sequence of events happens, the order matters. It's the difference between cause and effect, the heart of science. Filename is often used as part of the data, as a timestamp.

Obviously reliable sorting is important, which is why people are upset that python's glob() turned out to be unreliable


IIRC glob.glob() is a wrapper around os.listdir(). I used to develop code on an OS X laptop then push to a Linux box for the actual computational result (big dataset). And discovered at one point that one OS returns a sorted result, the other does not. Sorting is important if you want track/analyse instrument data over the days in a year.


> Obviously reliable sorting is important, which is why people are upset that python's glob() turned out to be unreliable

A quick glance at the documentation didn't mention it, does glob even promise sorting the result? Relying on that to happen when it's not even promised seems negligent.


I mean, I have learned that people simply won't accept that SQL doesn't guarantee the order of a result set if you don't use "order by". I tried to argue it and eventually gave up. It seems vastly easier to change a language than human nature. People are hardwired to ignore claims that something should not be relied on, because in normal human relations, that is a way of pushing away undue responsibility, not something to be taken literally.


> A quick glance at the documentation didn't mention it

What do you mean? It is the very first sentence of the documentation: "results are returned in arbitrary order"

https://docs.python.org/2/library/glob.html


Ah, see, I missed that, I had just skimmed, then grepped for "sort" and "order" (but mistyped that as oder).


Most developers look at the result from a function call rather than looking at the documentation.


It doesn't sort. But people assume it does especially if the files are returned in the order that they were created, which typically will be sorted.

I think this is why it doesn't get noticed; the files are written say

   result-20190901-1303
   result-20190901-1305
   result-20190901-1310
   etc...


Oh man, a sorting error of files due to filesystem differences.. A very similar issue happened to me while doing astronomy during undergrad. This was a software pipeline for a specific telescope and I followed the guide they provided. Essentially there were two directories, one for images, and another for error estimates. Files in both directories had the same filename. The software needed text documents of the images and errors. The guide listed to do a basic 'ls images/ > images_list.txt` `ls errors/ > errors_list.txt`.

HOWEVER HYPERTHREADING WASN'T ACCOUNTED FOR. Between the images_list.txt and errors_list.txt about every 20 or 30 lines the file order would be swapped. Invalidating the analysis and producing poor images.

It only took this undergrad two months to learn what was happening.


'Piles-of-files' is itself a bug in an era where sqlite exists and has bindings for virtually every language. Concurrent writing can almost always be reasonably avoided by keeping the processing parallel but serializing the writes. In cases where that earnestly isn't sufficient, a 'proper' RDBMS is a good option.

Pile-of-files is 1960's era tech. We've learned much since then and our hardware is much more capable. Unless you're shooting for the retro-UNIX aesthetic for artistic reasons, it should be avoided.


I don't think this is one size fits all advice. Surely once your results are measured in TBs this will fall part.


This is another great example illustrating the need for reproducible research practices even in the hard sciences, in this case so that the papers in question could be easily checked after Luo's excellent finding.

For an earlier discussion on the same topic, referencing other discussions: https://news.ycombinator.com/item?id=17819420

(Top comment: "This article harkens back to discussions of the "reproducibility crisis" in science, as discussed extensively here just recently (see link below). Where, in this case, not coughing up the code used in the simulations in a timely manner led to an apparently unnecessary multi-year dispute.")


Copy pasting a comment I left almost a year ago:

> A few weeks ago I had a conversation with a friend of mine who is wrapping up his PhD. He pointed out that not one of his colleagues is concerned whether anyone can reproduce their work. They use a home grown simulation suite which only they have access to, and is constantly being updated with the worst software practices you can think of. No one in their team believes that the tool will give the same results they did 4 years ago. The troubling part is, no one sees that as being a problem. They got their papers published, and so the SW did its job.

My own experiences when I was in grad school (engineering, not CS): No one cares about code or code quality. In those days, no one used version control. The attitude all my fellow grad students had was: "I don't need to learn this stuff. Just need to publish this paper. When I become a PI, I'll simply make it my student/post doc's job"


I've always felt like there should be a way to connect experienced software developers with research groups- something like this, for example, should have been never happened with code review.

For example, one could imagine a program that connects developers with with research groups to help develop good code, or with conferences to test submitted code.

Does anyone know if there already exists a program like that?


there is an organization called Software Carpentry that is basically like this.

however, the real problem has more to do with incentives and funding than it does knowledge of software best practices. right now, if you take the time to write thoroughly-tested clean good code, all you're doing is handicapping your research career against people who can churn out papers faster than you. you can see this because code from computer science departments is nearly as bad as in other fields, even from students who have had real industry jobs and learned how to do the right thing.

it does seem like in this case, though, because the scripts were actually intended as a tool to be used by others, the investment in more careful testing might have been warranted.


It's called an internship (for graduate students), but you're right that it should go the other way too but you'd run into a lot of money, IP, and institutional politics issues. Occasionally there are older industry people who work as staff scientists who get paid little and support the research effort in some capacity, usually IT. The biggest reasons why it doesn't happen as often are because in many cases, people who become professors have never done anything in industry, don't care unless funding is part of the conversation, or aren't motivated to have artifacts be reproducible. Their research group is a fiefdom over which they wield control, so unless they want to start a company on the side, they don't care about your programming skills and they don't like to give up any control.


A lot of the code used in research is hosted on Github and open for contributions. That would be an easy way to make an impact.

Uni Zürich also has a group that tries to connect non-academics with researchers. The focus goes both ways, but it could be something to look into if you're good with code and want to assist in some way.

Link: https://citizenscience.ch/en/


Why would the researchers bother with this? How does it help them?


Something like this is sorely needed because we are abstracting away layers of implementation details and quirks from the users of computational software. They are in fact abstracted away so much that a doubt that the author might have about the correctness of the software is almost never brought up in a research paper. It is just assumed that whoever wrote the software did their due diligence. You often don't see their code, and even if you do, it is incomplete or impossible to run because while the publication is designed to allow the reader to replicate the study, the same doesn't necessarily apply to the computational component that crunched the numbers to reach the conclusion of the paper.


Software Carpentry would be a good place to start: https://software-carpentry.org/


Do you know many developers with excellent mathematical and high performance computing skills willing to work for 30k a year?

I think this is the underlying reason that the way this connection happens in practice is via commercially available simulation software.


Most scientists are not software engineers so would not know about comprehensive testing. This was an eye opener to me when I moved from academia to a scientific software company.

The situation is improving somewhat as many grad students open source their software in github. Then have include their makefiles, documentation, and some tests. In earlier days we we software engineers would roll our eyes when management suggested using 'free' university code to save money. Sometime bringing such software up to professional industry standards took more work than starting from scratch from published papers.


The Reinhart-Rogoff error is probably the most famous Exel error in economics with huge consequences.

Their 2010 a paper, Growth in a Time of Debt https://www.nber.org/papers/w15639 convincing dataset to show that when external debt reaches 60 percent of GDP, annual growth declines by about two percent.

This only paper was used as top level politicians in the US and Europe to show that austerity is the only option. It started widely damaging pro-cyclical austerity policy during regression. The corrected result don't show any change in growth when debt to GDP ratio goes above 60 or 90 percent.


The paper claimed a -0.1% decline at 90%, not -2% at 60% as you mentioned, but only in advanced countries, data in emerging countries was much worse. The Keynesian economists who countered showed their data when extended over a longer timeframe and without weighting was actually +2% at 90%, this too seems be limited to advanced economies. High debt ratios in emerging markets still seems to have a negative correlation AFAIK.

Regardless, in practice, it looks like the study was completely ignored in the US anyway:

https://fred.stlouisfed.org/fredgraph.png?g=FHS&nsh=1&width=...

Around 2010 the debt only dramatically increased above 100% instead of declining, and has continued to grow.

In the EU, comparing debt to gdp in 2012 vs 2019 shows that there has been little decline there either:

https://i0.wp.com/factsmaps.com/wp-content/uploads/2018/02/d...

https://www.statista.com/graphic/1/269684/national-debt-in-e...


That's because the Democratic Party had majorities in both chambers. The paper was in the center of Rebublican policy led by Paul Ryan.


Oh apologies I misunderstood your comment:

> to show that austerity is the only option. It started widely damaging pro-cyclical austerity policy during regression

I’ve found much of the fear mongering around austerity to be pretty overblown. It’s been pretty rare to find examples of austerity except for some cases in the EU where attempts to reign in debt in countries like Greece where it was 180% (and still is today). Otherwise almost all of the advanced countries have kept it around 60-100% even almost a decade since the recession.

Moderate Keynesian and monetarist policy remains incredibly popular following the recession and in many places higher debt only became more popular even during the good times.

Even looking at the American right, besides Paul Ryan’s failed proposal 9yrs ago, there’s little evidence of the debt being lowered even during republican majorities in Congress and Senate.

Yet if you follow Krugman and the like you’d think austerity has been a giant problem in the west.

The world’s a lot more boring and moderate than the fear mongerers like to admit.


> Yet if you follow Krugman and the like you’d think austerity has been a giant problem in the west.

I don't follow Krugman closely since he is not macroeconomist, but Krugman's back of envelope calculation was correct in the retrospect.

Obama's stimulus was roughly of what needed and half of what was needed was the result. (tax benefits are not proper stimulus). According to CBO the US economy was 6.8 percent below its potential translating into $2.1 trillion of lost production. Lives were destroyed permanently due to prolonged recession.


There was a rumor when I was in econ grad school that there was a huge bug in Stata in the nineties affecting standard error calculations, rendering a large fraction of the studies from that time invalid.


Stata doesn't even version its packages, so it's literally impossible to reproduce research as you can't retrieve the packages as they were then.

The maintainer of a popular package could choose to change their api today and all code using it would break with no way to revert.

Stata is insanity.


I agree. And it's expensive whereas R is free and a much much much better language.


This should be easy to verify though.

I would be more worried about some of the more popular Stata macros & R libraries. I've spent a good amount of time on several occasions trying to reconcile results from Stata and R, only to be stopped - I hope - by limited or technically dense documentation that I couldn't grok (...or the code was bad)


Should be, but coding standards among professors of econ are generally abysmal. It's a big gap in the educational system for social scientists.


When I mentioned here some time ago, that maybe we shouldn't be trusting our climate science models so much, as they consist of millions of lines of poorly tested code, I was heavily criticized...


Computational fluid dynamics (CFD, of which climate modeling is a subfield) actually does software testing better than other scientific fields in my experience. That's not to say that all CFD models are well tested, just that they're doing a fair bit more than nothing. Typically testing is divided into "verification", which is checking the math, and "validation" which is comparing against experimental data. If you can't find anything about a model's verification and validation, don't trust it.

Also, it's actually fairly easy to get apparently good validation by cherry picking the right data, so even doing well here isn't enough. You basically need to be an expert to know the right test cases to look at...

Here's a slide deck that shows the (not great) state of typical practice in validation: https://www.osti.gov/biblio/1141709

Note that some of the "better" journals do worse...

Here's a public domain CFD software that I think does a good job on verification and validation: https://github.com/firemodels/fds

Full disclosure: I have contributed to this project.


Not trusting them is reasonable, but then where does that leave you? You don't know then whether they're underestimating or overestimating things. I'm strongly in favor of building a prior from the geological record and leaning on that as much or more than the models, but that doesn't lead to a very different conclusion.


I guess that, at least compared to society in general, I'm a cognitive nihilist? I highly doubt that it is in our power to predict with useful accuracy something as complex as climate. If I'm not mistaken, we're still having problems modelling stuff a billion times less complex than climate (such as an air flow in turbines etc.)...


Generally, more uncertainty around climate change means you should be more supportive of climate change mitigation rather than less, because it means we can't rule out things that look like the P-T extinction.


There's infinite number of events which we can't rule out. My approach is basically the opposite - if the evidence for something is super-sketchy, maybe we shouldn't spend trillions (or even quadrillions?) on preparing for it.


The evidence is completely airtight. The problem is with the precision of predictions, not the evidence for the mechanism and rough magnitude.


You were heavily criticized because you said nonsense. Cancer research also relies on millions of lines of code but you would not say the same, would you? The science of climate change goes well beyond the models. The models account for only a small fraction of the science and they are produced completely independently by many research groups, leading to overlapping results.


I am a neuroscientist and a good portion of my laboratory's scientific output is code. I can offer a perspective.

Every few months, a case like this one emerges. A published software contained a scientifically relevant bug and this may have had an influence on other studies that used that software. The reaction is always the same: blaming the naivete of the researchers because they are not professional coders.

While I would certainly love to see more interaction between professional coders and scientists, I don't agree with the overall sentiment. I find it counterproductive and even naive for several reasons:

1) mistakes happen. Wikipedia has a long and certainly not complete list of software bugs that had major consequences ( https://en.wikipedia.org/wiki/List_of_software_bugs ) and most of them come from professional coders in all kind of industries. There is no doubt whatsoever that code produced by scientists is in generall less streamed, less reviewed, and way uglier but is it on average more buggy where it matters? Unless we answer this question, the discussion is moot.

2) Scientists who produce software by themselves do this according to Open Source philosophy meaning that the code can be scrutinized by anyone. I'd rather use a possibly buggy but open-source software over one that I cannot even scrutinize. I think published results should only be accepted if they rely on open-source software: we would never accept a figure of a paper unless the protocol was 100% disclosed. This is one aspect where we can all immediately act. As a reviewer, I will never accept a paper that made use of new software unless the code is an integral part of the paper. I urge all my colleagues to do the same. Also, as authors, if you have to write software from scratch, take code readability into account exactly for this reason. Python is a good choice for this reason. R much less so.

3) we must encourage scientists to take on more coding, and not the opposite. If we keep telling them that they are not going to be good enough, we are not helping. A the moment the trade-off is very clear: either not-professionally coded software or no software whatsoever. Fields like crystallography have shown that the former option can still be extremely favorable.


I've been asked to consult on a number of programs used for scientific research or environmental modeling.

In one case a science student was expected to work on a C++ simulation for his dissertation. The existing code base had been written by other grad students from previous years. It was clear that no one involved had any experience with C++ and it wasn't going to be possible for him to do a good job with the existing source code. Fortunately, in this case, the C++ based simulation was eventually dropped.

Another time a team of professional civil engineers couldn't understand why their model was able to make predictions that perfectly matched their experimental data. It predicted a poorly understood physical phenomena (algae blooms) so perfectly that they asked me for help. In a couple of minutes I was able to explain to them how they had overfitted the model to the small sample of experimental data and that the model likely had no predictive value at all.

I was asked once to debug a large program used to understand the dynamics of floods. The program contained some of the worst coding practices I've ever seen. For example, the same global variables were reused for completely different purposes in different parts of the program, sometimes X2DDRATE meant one thing and later on some paths it contained a completely different kind of measurement with entirely different units. This was done in an effort to "save memory" by not having too many variables.

A different ecological model I helped repair needed to compute the integral of a function at one point and used the following method: pretend the function was a straight line so that the integral could be computed as the area of a triangle. The function was a simple, uncomplicated exponential, something that a high school math student could integrate exactly.


Researchers should consider making their papers runnable. Jupyter notebooks being a great tool for it. Also, check out Knuth’s literate programming.


I was going to say that would almost make things worse: right now these sorts of bugs are discovered by attempting to re-implement the described algorithm, and discovering that the described algorithm was wrong. If the code was available but wrong, most people would just run the code and find the same wrong answers as the original researchers.

But actually I think that's wrong; far more people would run the code than currently try to re-implement the code; and it's likely that someone will notice something fishy about the code and report it.


Well, we don’t have to wonder since this was done with PBRT.

Result? An incredibly high quality codebase with many hundreds of subtle fixes over time coming from readers and researchers.


This is standard practice in some fields. I'm helping an old colleague with one of their projects, and one of the requirements for publication is documented runnable code and all of the underlying data.

I'm still surprised that this isn't expected in other research areas.


> “We all kind of assume that a computer program always spits out the correct answer.”

I'm a bit surprised at this.


I'm not at all. Googling your problem and running your dataset through the first random R package on github you managed to install seems to be an accepted approach in the biomedical fields. If you're lucky they might even list which package it was in the paper's methods.


Retraction Watch has many, many, many examples of this type of thing.

https://retractionwatch.com


This simple glitch in the original script calls into question the conclusions of a significant number of papers on a wide range of topics in a way that cannot be easily resolved from published information because the operating system is rarely mentioned,”

In computer science it is a common practice to mention the experimental setup along with OS hardware configuration and third party libraries used for reproducibility purposes. Still many papers suffer from reproducibility. Now the conferences are demanding docker containers with all dependencies and configs. Which improves a bit better. But still lack of data is another reason why we can’t reproduce papers.


On this specific bug, I really wish Python just returned a sorted directory listing on every platform. It goes back to Python's original philosophy of passing through OS level behavior rather than standardizing like Java.


Most of the comments on here talking about repeatability, but I'm concerned about correctness too. It's $40 to have 48-hour access to the paper, and I'm not going to do that right now, but hear me out. The order in which the operating system listed the files made a difference. So just having it be the same everywhere doesn't necessarily fix the problem. What if it's the same everywhere, but still isn't the correct order for what the researcher intended? Everyone gets the same error, and that error isn't necessarily 0.


Oh don't tell me it's a spreadsheet app.


It says clearly it was a Python script


Should have used MATLAB.


Why not use formally-verified software?


Usually because the people capable of writing formally-verified software do not understand the specific subject area, and the people that understand the subject area are not generally capable of writing formally-verified software.


For a perspective of a mathematician that came to evangelize theorem provers, I recommend Kevin Buzzard's MS 2019-09 presentation [0] about LEAN. He highlights cultural misunderstanding and apathy on both sides of the domain divide. He also references the idea that the people who might make the appropriate tools may not have stayed in academia. So, he's structured his courses around using LEAN with the indirect consequence that power users (undergrads) may choose to become open source committers.

[0] https://www.youtube.com/watch?v=Dp-mQ3HxgDE One hour of presentation and then 15 min of q/a. My favorite is around 1:04:00 when someone asks a second time why he disprefers coq, and Buzzard complains that it can't represent some advanced quotient type that he'd have to work around. I'm reminded of [1]

[1] https://prog21.dadgum.com/160.html Dangling by a trivial feature


Having been inspired by that video by Kevin Buzzard and finally finding something that would be worth formalizing, I am trying out Lean at the moment. I have about 2-3 months of Coq experience, so I can say that even without the quotient type, Lean is much better designed than Coq. I can't vouch for how it will do at scale since I've only started it out, but from what I can see, Lean fixes all the pain points that I had with Coq while going through Software Foundations.

It has things like structural recursion (similar to Agda), dependent pattern matching (the biggest benefit of which would be proper variable naming), unicode, `calc` blocks, good IDE experience (it actually has autocomplete) with VS Code (I prefer it over Emacs and the inbuilt CoqIDE is broken on Windows), mutually recursive definitions and types, and various other things that are not at the top of my head.

If I were to sum it up, the biggest issue with Coq is that it does not allow you to structure your code properly. This is kind of a big thing for me as a programmer.


>My favorite is around 1:04:00 when someone asks a second time why he disprefers coq, and Buzzard complains that it can't represent some advanced quotient type that he'd have to work around. I'm reminded of [1]

>[1] https://prog21.dadgum.com/160.html Dangling by a trivial feature

Second what abstractcontrol said.[0] I wouldn't say being able to deal with quotient types is a "trivial" feature, because Buzzard does explain his reason for why quotient types are crucial for what he does.

The motivation is that he wants to formalise a concept that is the subject of current active research, instead of yet another concept from undergraduate mathematics. In this case, it's the notion of perfectoid spaces,[1] which was only introduced in 2012 by Peter Scholze.[2]

The justification for this is sociological: research mathematicians predominantly find the retreading of old ground boring and trivial. Buzzard wants to sell the power of theorem provers to exactly these mathematicians, so he felt that formalising something that many number theorists would be intensely interested in may have more impact.

Unfortunately, while I'd like to think that he's succeeded somewhat, the reaction I've seen is just more of the "well, isn't that nice" variety, outside of some enthusiasts, many of whom are young and/or not very influential. Mainly, it's because theorem provers still can't help mathematicians improve their productivity, because there's still a lot of stuff that's still missing in the ecosystem, and Buzzard pointed out some of the missing "features" that may help theorem provers take off.

[0] https://news.ycombinator.com/item?id=21239067

[1] https://en.wikipedia.org/wiki/Perfectoid_space

[2] https://en.wikipedia.org/wiki/Peter_Scholze


I watch that video on your recommendation and am glad I did.


Despite what people may have told you, writing formal software is a quite mathematical task requiring people involved to really engage and write decent software specifications to begin with.


Is kind of like if there is a CPU glitch


From the article:

> Luo’s results did not match up with the NMR values that Williams’ group had previously calculated, and according to Sun, when his students ran the code on their computers, they realized that different operating systems were producing different results. Sun then adjusted the code to fix the glitch, which had to do with how different operating systems sort files.


I feel like, where feasible, scientists should adopt Kubernetes or something so that the software they use is repeatable.


Kubernetes and “repeatable” don’t exactly seem like they should go together, introducing the extremely high complexity of kubernetes seems likely to result in the opposite of repeatability.

Just using a VM image or docker is simpler to use and also to understand.


There's definitely a big push in that direction in the scientific community right now. More and more tools and pipelines are getting distributed as containerized workflows.

Big projects have realized the need to make their code available and versioned just as they do their input data, side by side with hashes recorded all the way along and reproducibility made as simple as possible. Now we're starting to see it trickle down into less organized/large/disciplined projects as well.


Containers as a technology are nice, but it's easy to fall back into the same traps that make software non-reproducible using containers as well. You can precisely specify all your dependencies, but it often takes a lot of effort to make that happen.

I'm a fan of the approach Guix developers are taking for scientific computing, because it makes reproducible software simple enough for people to use without too many headaches: https://hpc.guix.info/blog/2019/10/towards-reproducible-jupy...


Isn't the whole idea behind containers to eliminate external dependencies?


In a sense. You're right that once a container is built, it has few external dependencies. But you need to get those dependencies from somewhere at build-time, and if you're not careful it's easy to do that in a way that makes it extremely difficult to rebuild that container in the future.

To use a slightly more concrete example: let's say you're using a library in your container that has a severe bug. This bug results in incorrect computations, so you would like to upgrade to a fixed version.

Now let's say that when you built that container initially, you installed packages in the Dockerfile by running e.g. "pip install <package>". The problem is that once this image is built, it's nontrivial to rebuild this image and ensure you're using the same dependencies you were the first time. In a sense, you've lost that information (though you can probably start to figure it out with close inspection of the image).

Yes, there are usually ways around this with the language-specific package managers; Node has package-lock.json, Python has Pipfile.lock, etc. But it's not even close to being the default.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: