Hacker News new | past | comments | ask | show | jobs | submit login
[flagged] Is the Reproducibility Crisis Reproducible? (argmin.net)
67 points by alexmolas 9 months ago | hide | past | favorite | 54 comments



I'll start by saying that if you're going to roast a paper that an econ nobel winner and one of the most famous and respected working statisticians put their names on, you probably want to turn down the volume and double check your claims a little more before hitting "post."

A z score is not at all morally equivalent to a p-value. It's just a standardized measure. Converting measures to z-scores aids in interpretation. They also can aid estimation in some cases: using non-standard parameterization in Bayesian analysis is often crucial to get MCMC to accurately sample from the posterior distribution.

Sure, you can take a z score and look at the area under the curve and come up with a p value. But you don't have to. In the referenced paper, they use z scores to be able to standardize the measures in the papers they draw from, so they're comparable.

The author's other critiques of the paper seem reasonable. It's a problem with all meta analyses: the amount of work it takes to correctly interpret publishes papers and then take those results and aggregate them is herculean. To do it to over 20,000 is inevitably going to lead to some mistakes. That said, those mistakes may not be fatal to the analysis.

Moreover, saying "no one knows what happened to those 11,285 studies" without checking in with the authors is completely unfair. The first author responded with the code showing exactly how they achieved that figure. Nothing mysterious.

Andrew Gelman responded in the comments to him, as did the first author. I find their responses convincing.


> Moreover, saying "no one knows what happened to those 11,285 studies" without checking in with the authors is completely unfair. The first author responded with the code showing exactly how they achieved that figure. Nothing mysterious.

But that is the whole point! The methodology of how they dropped the 11,285 studies was not even told in the original paper, and even in the comments the author doesn't explain "why", just "how". Hence, I think it's completely fair to call it "irreproducible".

The point of doing "reproducible science" is not that I write a paper, you email me asking how did I come up with my number, and I email you back an explanation. No! The important details should be in the paper already. You may do some magic on your dataset, and that's fair _as long as you detail what magic you did and why_, and given you can defend that practice in front of your peers. Otherwise what is the point of preaching "reproducible science" at all?


> The point of doing "reproducible science" is not that I write a paper, you email me asking how did I come up with my number, and I email you back an explanation. No! The important details should be in the paper already.

This is how it works, people email each other all the time. Why shouldn’t you? You can’t imagine every bit of information that someone would want, and papers have page length limits so you make choices about what to cut.


Don't you think that gives the people who are falsifying results rather an easy time?

If Bob falsifies data and someone e-mails him, asking him to send them rope to hang him with, he can simply delete the e-mail.

Or claim he forgot the details of the analysis. Or claim it was handled by a grad student who left. Or claim the info was lost when a hard drive broke. Or claim the data was the intellectual property of College A and he can't access it now he's at College B. Or claim privacy or copyright rules cover the key data. Or that they don't have a license to the software that can open the data files any more. Or any of a dozen other things.


Any of those responses (or non-response) by Bob would justify the asker in their skepticism of the falsified results. Giving somebody a chance to respond is not the same as relaxing your standards of evidence, it's just an acknowledgement that there might be explanations you haven't thought of and giving an opportunity for those to be brought up.


All anyone can do is try to use that research in their own work, and see if their work supports findings from prior work. Sometimes it doesn’t, I am not sure if that means someone lied on purpose. It’s possible they were bad at interpreting the results, or they made bad assumptions. I think poor research standards are the main reason for the reproducibility crises, and not people lying on purpose.

Typically bad research assumptions or implementations are rooted out during peer review, but it’s an imperfect process.

I do think there needs to be a dedicated non-profit and neutral organization solely responsible for reproducing scientific results for all fields, and assigning a reproducibility score to research finding. This could become an entire field by itself, and would have its own complications, but the reproducibility crises does exist and needs a solution.


A paper should have enough information for an independent researcher to reproduce the results.

Otherwise, years later, if the author dies, the paper would be basically worthless for reproduction.

(We're still talking about reproducibility crisis, right?)


I think your critiques are fair enough and I think the authors would likely agree they should have been more proactive with the data sharing.

Maybe it sounds like I'm parsing too much, but I nevertheless still think saying "no one knows what happened" is unfair. They know, they shared, and justified what they did when called on it with literally almost no delay. I agree they shouldn't need to be called on it.

Anyone who's advised students or asked even presenting researchers such questions know that often people will literally not know what happened to all their data.


> Anyone who's advised students or asked even presenting researchers such questions know that often people will literally not know what happened to all their data.

I am sorry that’s been your experience, maybe it varies by field and quality of research? Most people I’ve questioned have provided reasonable answers to their findings. I don’t understand why anything needs to be assumed in bad faith or shoddily done. It’s all a bit Dunning-Kruger to me where everyone assumes that everyone else is doing shoddy or bad work.


Things are complicated.

To be fair: Everyone I've worked closely with in research has gone above and beyond not to cut corners and produce high quality data and research.

What I have in mind here is a situation where people are actually quite careful but can still end up in a place where they don't know what happened because they don't have good systems for creating datasets and storing code.

For example, graduate students are not always taught to work in a reproducible way. It's definitely gotten better from what I can see, but it was normal for people to get source data and work that data into its final form in a lot of different steps, but not always reproducible steps. E.g., data comes in from secondary source or other provider. It gets cleaned. That file gets saved as something like "clean data 011234.csv".

More work is done, it gets saved again.

Time passes, things are revisited, and a handful of files exist that likely with some care could lead from point A to point B. But the exact process, to say nothing of the dozens, sometimes hundreds, of small decisions data preparation decisions get lost to memory.

Code doesn't go in version control. People get new computers. USBs get lost. Universities migrate to new data systems and so on.

All the while, these students and researchers were very careful while doing the work. They were just never trained to use good version control and pipeline processes. They basically do what they did with papers they write. Save and backup while working through the paper and move on when it's done.

This is made worse when data is proprietary or not legally shareable.

So people aren't necessarily being shoddy or doing bad work, they're just not using good systems.


> So people aren't necessarily being shoddy or doing bad work, they're just not using good systems.

Agreed. I think there isn’t an incentive to do this because reproducibility takes a back seat to so many other concerns. Unless PIs are told that their publication chances depend on reproducibility, this isn’t going to change.


> It’s all a bit Dunning-Kruger to me

Fantastic article on the reproducability of Dunning-Kruger effect: https://replicationindex.com/2020/09/13/the-dunning-kruger-e...

Also see: "The Dunning-Kruger Effect is Autocorrelation" https://economicsfromthetopdown.com/2022/04/08/the-dunning-k...

Lovely quote:

  “These responses to our work have also furnished us moments of delicious irony, in that each critique makes the basic claim that our account of the data displays an incompetence that we somehow were ignorant of.” (Dunning, 2011, p. 247).
Of course Dunning-Kruger is self-referential. Any mention of Dunning-Kruger automatically makes you a victim on wrong side of the graph.


I think the term is still useful for giving words to a phenomenon people experience. I guess a less charitable and more presumptuous term of such behavior would be calling the person displaying it narcissistic.


> [T]he author doesn't explain "why", just "how".

The below seems like a "why" to me.

> The criteria for selecting the data are an atttempt to get the primary efficacy outcome, and to ensure that each trial occurs only once in our dataset. The selection for |z|<20 is because such large z-values are extremely unlikely for trials that aim to test if the effect is zero.


Ah I am sorry, seems like a comment was posted later (Jan 9th) providing a justification of their criteria. However, the reply comment it was posted on wasn't there when I first read the article and the comments on Jan 8th.


Regarding your appeal to authority: high ranks in today’s scientific system is an incentive to defend the system.


I'm not sure it qualifies as argumentum ab auctoritate to say that a group comprised of Nobel prize winner's and renowned statistics experts are far more likely to be right about statistics than a solitary compsci professor.

Even if it does qualify, It's widely accepted that argument from authority is perfectly valid and often necessary when performing inductive reasoning.

I might even say that if you are not expert enough to judge the matter yourself (or willing to expend enough effort) it should be your default presumption that the more qualified speakers are correct, even if you have legitimate questions about their potential motivations.


My issue here is that Imbens and Gelman are respected because they are good statisticians but also clear thinkers who have proved themselves dedicated to doing good, careful work. If you have major issues with work their name is on, I would think their reputation would at least lead reasonable people to contact them first before writing something which leads to hurt feelings and numerous comments and corrections and more posts on each end.

This whole affair is a case in point. Several authors ended up responding with very reasonable responses, which the author then acknowledged. Then he writes another post, which is more measured and insightful.

The original post would have in fact been a far better one if the author had first sought those responses and just addressed them all together at once. His points about metascience would not then be dragged down by back and forth in the comments that clearly got personal.


Small quibble: converting a measure to a z-score requires an assumption of normality and that you're considering a population.

For a sample the equivalent is the t-statistic, which indeed IS very often used for p-values (with the ever popular t-test) and has a decently strict list of assumptions (which, like those of [O|W|G]LS are very frequently ignored).


z-scores just require knowing a standard deviation and a mean. you only need an assumption of normality if you want to do things like assign p-values. of course, that is mostly how they are used.


"You Come At the King,You Best Not Miss" ;-)


> I'll start by saying that if you're going to roast a paper that an econ nobel winner

"By 2005 or so, it will become clear that the Internet's impact on the economy has been no greater than the fax machine's." -- Paul Krugman ( Nobel Prize Winner in Economics )

If you have to pathetically appeal to authority right from the start, you probably have no argument worth considering.



Which are completely ignored by the author of this article. Since the paper was picked up by some right leaning news outlets, unserious people feel the need to step in and start attacking the authors and the paper out of some dogmatic response.


The next post in the author's blog addresses the comments by the authors of the article [0].

Also, I don't think the arguments written in the post are "unserious", you might not like them but the author of the blog makes some interesting points against the original paper.

[0]: https://www.argmin.net/p/arbitrage-in-data-exchange-rates


I would not call that addressing the comments, he merely acknowledged them and rehashed his same argument. The follow up was better left unposted.


Articles talking about the reproducibility crisis focus on soft science where it is more likely to occur but there is at least one major hard science (physics) reproducibility crisis:

We are unable to reproduce measurements of the Gravitational constant.

Reproducibility - distinct from the claimed uncertainty by the experiments - has not advanced at all since the 1940's. This is unlike any other physical constant where precision and reproducibility has consistently progressed along with our technology:

https://en.wikipedia.org/wiki/Gravitational_constant

> Published values of G derived from high-precision measurements since the 1950s have remained compatible with Heyl (1930), but within the relative uncertainty of about 0.1% (or 1,000 ppm) have varied rather broadly, and it is not entirely clear if the uncertainty has been reduced at all since the 1942 measurement. Some measurements published in the 1980s to 2000s were, in fact, mutually exclusive.[7][30] Establishing a standard value for G with a standard uncertainty better than 0.1% has therefore remained rather speculative.


It seems irresponsible that Ben Recht should draft this article without giving the authors a chance to provide input. In proper journalism, writers reach out to the person they’re reporting about or on for comments. It would certainly have saved Recht the embarrassment, and now he wouldn’t be unnecessarily doubling down on incorrect claims.

Recht should probably correct himself here (maybe by writing another article in collaboration with Gelman etc.), and people should be okay with other people making mistakes, and accepting apologies or mea culpas. Information is complex, and there are very few geniuses capable of understanding or synthesizing everything in isolation. It’s best to assume good faith work and collaborate to promote collective understanding.

Edit: ah okay, so this is related to politics somehow, and expectedly the discussion is devoid of reason. Or sense, apparently—see one comment in the blog where “the consensus is that there is no consensus”, whatever the fuck that means.


Also, note that Gelman never does that


Let's just take a step back and realize this article (the blog post) is not peer reviewed and uses generally emotionally charged or accusatory language towards the authors. Even if it's correct, it does not meet the same publication bar as the original paper. We also have comments from the original authors which challenge the claims made here. The two pieces of writing are not on the same playing field.

Honestly, that's enough red flags to dismiss this blog post as "not convincing enough, more work needed".


“Someone just shows you a univariate histogram and wants you to believe that it’s true. But you shouldn’t believe anything.”

It’s interesting that this is basically the same message he was attacking. He and Gelman and most other bright people agree, don’t trust most published research. Of course this is especially true for meta-research as the author seems to suggest.


> "...But if you want to fix experiments in clinical medicine, this is where you start."

You start by not allowing interests with a financial incentive to get a positive result from a clinical trial to run those trials - it's an obvious conflict of interest. However, the University of California has been a long-time leader in the exclusive licensing and privatization of discoveries made with taxpayer dollars, and so the UC also has an incentive to promote bogus clinical trials because it gets a percentage of sales from UC-licensed inventions, and so a professor at the UC who writes a screed defending these practices (which is what this article amounts to) will thereby ingratiate themselves with the administration - positive career move!

The real problems come down to deliberate systematic bias in experimental design, which people outside of academia might consider to be fraud - and it should be well-understood that statistical tests have a very hard time revealing such systematic biases.

Let's take a specific example - Gilead's remdesivir, and Gilead's fellowship programs and gifting to UC Berkeley professors. The corporatization of academics in the United States means that trust in the independence of academic researchers is no longer warranted, they're more in the cheerleading section now than in the rigorous science section. Thus (2021):

https://pubmed.ncbi.nlm.nih.gov/34252308/

> "Here, we critically evaluate the assumptions of the models underlying remdesivir's promising preclinical data and show that such assumptions overpredict efficacy and minimize toxicity of remdesivir in humans. Had the limitations of in vitro drug efficacy testing and species differences in drug metabolism been considered, the underwhelming clinical performance of remdesivir for both COVID-19 and Ebola would have been fully anticipated."


"Scientists prove reproducibility crisis by writing irreproducible paper." Quite smart really! Even if you're wrong you are correct.


Aside from the matter of whether the article's criticisms of the paper in question are valid[1], I think there's some equivocation going on here between two kinds of thing that can both be called failures of "replicability" or "reproducibility" but that are very different. (1) Sometimes a paper doesn't make it perfectly clear exactly what the authors did and why. (2) Sometimes it's clear enough that you can try to do the same thing, and when you do you get different results.

[1] As others have pointed out, there are rebuttals in the comments from some of the authors of the paper; make of them what you will.

The "reproducibility crisis" / "replication crisis" is about #2: lots of published work, it seems, reports results that other people trying to do the same experiments can't reproduce. This probably means that their results, or at least a lot of their results, are wrong, which is potentially a big deal.

The article's complaint about the paper by van Zwet, Gelman et al isn't that. It's that some details of the statistical analysis aren't reproducible in sense 1: the authors did some pruning of the data but apparently didn't explain exactly what criteria they used for the pruning.

You could argue that actually #1 is worse than #2, because maybe what the authors did is bad but you can't even tell. Or you could argue that it's not as bad as #2, because most of the time (or: in this specific case) what they do is OK. But I don't think it makes any sense to suggest that they're the same thing, that there's some great irony if a paper about #2 has a #1 problem. That's just making a pun on two completely different sorts of reproducibility.

(Note 1: The article makes another complaint too, which if valid might be a big deal, but it doesn't have anything much to do with reproducibility in either sense.)

(Note 2: One of the authors of the paper does explain, in comments on the OP, exactly how they did the pruning. I have not checked what the original paper says about that; it doesn't seem to be available freely online.)

(Note 3: Even if the Zwet/Gelman/... paper were absolute trash, I don't really see how that would make the reproducibility crisis irreproducible. This is one paper published in 2023; the reproducibility crisis has been going on for years and involves many many unsuccessful attempts at replication.)


If it is reproducible, then it shouldn't be reproducible, as per the crisis. If it isn't reproducible, then it has reproduced itself in the form of not being reproducible. Therefore, accurately answering the question, "Is the Reproducibility Crisis Reproducible?" is equivalent to solving the halting problem ^_^


"This sentence is False" actually has an answer (same with the halting problem). It is indeterminate. You may not like that answer, but it is one. There's a confusion that people think things have to be True or False (halt, not halt) but there's a third answer. This was one of the great logic steps in the 20th century and my name might remind people of that famous theorem.


:O

This reminds me of the brilliant Portal exchange between two AIs:

In the 2011 video game Portal 2, artificial intelligence GLaDOS attempts to use the "this sentence is false" paradox to kill another artificial intelligence, Wheatley. However, lacking the intelligence to realize the statement is a paradox, he simply responds, "Um, true. I'll go with true. There, that was easy."


The point that a study's strength of evidence should not be reduced to a singe statistic or parameter is a very very important one.


Lies, damned lies and (Bayesian) statistics


Bayesian statistics are sold as a panacea to fix the current (many) ills of science.

In reality they don't fix anything, and probably make things even worse. There are no commonly accepted or analyzed criteria for assessing Bayesian analyses. Heck, an analysis being "Bayesian" doesn't tell almost anything about it as it can mean a plethora of unrelated things.

The crises of science are sociological, not technical. My take of the core issue is the changing of science from an ethics and reputation based institution into a neoliberal bean counting paper industry.


> My take of the core issue is the changing of science from an ethics and reputation based institution into a neoliberal bean counting paper industry.

Saying it was an "ethics ... based institution" is challenging for me - it implies right and wrong, which requires someone to decide what is right and what is wrong.

Saying it was "reputation based" is even worst - lots of things got delayed because "reputable scientists" were not ready to embrace a change of paradigm.

Sciences as a system was and is constantly evolving and had many issues before and it has many issues now. Generally it evolved in a direction to maximize knowledge acquisition, and had to deal with the social structures at that time (ex: don't get at stake for you theories). If it is in a bean counting phase might not be that bad in the grand scheme of things, but should not get stuck here either.


> Saying it was an "ethics ... based institution" is challenging for me - it implies right and wrong, which requires someone to decide what is right and what is wrong.

Ethics here don't mean anything metaphysical. They mean something like "professional ethics" or even "taking pride of your work". More generally some shared (and maybe internalized) view of what's good/acceptable/bad behavior in a community. E.g. in medicine this is/was sort of codified in the hippocratic oath.

In such a system reputation stems (at least ideally) from how well the actors (individual or institutional) behave in terms of these ethics. This of course has its own set of problems and may lead to (other) forms of corruption. But IMHO it's less bad than the neoliberal model where individual actors should focus on their individual career success "by any means necessary" and just trust that "science is self correcting".


Wait Baysians are neoliberals now? I thought Baysians were infamously right wing?


I don't even get what the conflict is between Bayesian and Frequentist statistics. Too me it seems like Bayesian statistics is useful for some set of use cases, and frequentist statistics for another set.

Frequentist hypothesis testing falls somewhere in the middle, and as long as it's not over-interpreted is fine as it is, with the advantage being it's easy to do. Though the downside is that it's not easy to interpret.

The downside of Bayesian statistics, in my experience, is that it's really hard to teach it to less-than-very-bright people, even in the academy. I mean, those who don't even understand what "falsifying a null-hypothesis" means (and doesn't mean) will have a very hard time doing Bayesian analysis properly.

In a more advanced setting, though, Bayesian approaches provide a much better tool for comparing a set of alternative hypotheses. But it requires that users understand the math, and are not just following some script, and also that those involved are willing to provide their priors before evaluating the data.


I think they're suggesting that the problem in science is that it's become accounting and metric driven, just like the rest of the world.

Is this an actual stereotype that Bayesians are right wing? It doesn't seem at all political to me, but I guess I've never met someone who self identified as a Bayesian either.


The author of the blog post makes a similar argument in the comments section about randomized controlled trials - that they're good for "further down the line" efficacy studies, not scientific discovery per se[0].

[0] https://open.substack.com/pub/argmin/p/is-the-reproducibilit...


I read it as clever irony.


I don't think philosophical stance to statistics correlates very strongly to other political views. There are maybe some undercurrents regarding epistemology (Bayesians being relativists/antipositivists and frequentists being more positivist about "the truth") which maybe would be a slight left lean for Bayesians. And the father of frequentism/NHST was pretty HC right wing, but I don't think that means much either.

I'm more of a Bayesian in interpretation of statistics and epistemology, and I'm pretty far left. N=1 of course.

Neoliberalism refers to a social order and doesn't have much to do with probability and statistics.


Throw "neoliberal" into anything and everything to show your disdain for it! And, simultaneously, for whatever you happen to think "neoliberal" means!


Neoliberal means roughly the economic (and social) system/ideology we live in currently. A system that emphasizes relatively weakly regulated ("free") market and market-like structures as the way of allocating resources and production.

This can be contrasted to the preceding systems that had more regulation of markets and planned resource production and allocation. Roughly Keynesian or social democratic economics, with their associated view of social organization.


Okay, so what does this have to do with science, exactly? Back in the days of the original liberalism, often with no regulation of markets whatsoever, science indeed was a tinkerer's world, with ethics and reputation being at the core of an essentially rudderless and bottom-up endeavour. And it worked very well, but then matured and became more bureaucratic, formalized, which is a trade-off, yadda yadda. How is this latter state "neoliberal"? Are you saying this wasn't already happening in the 1970s when changes/reforms that can be described as neoliberal started happening, and then was caused by those reforms?


The most influential change has been the increase in competition for resources in academia, especially via grants and similar competitive funding (to both researchers/research teams and to universities themselves) [1]. The idea that competition breeds efficiency is a core tenet of (neo)liberalism.

There has been a concurrent shift towards bibliometric assesment of researchers and institutions, to a large part to have metrics for the competition.

In general this is largely an application of the (neoliberal) New Public Management [2] model applied to academia and science.

Before this shift funding was based mostly on budgets akin to how e.g. schools or (public) healthcare and police are funded. Academics were mostly just hired to a (permanent contract) job when there was an opening. Anecdotally, my supervising professor was tenured almost straight after he got his PhD in the 1970s. And he didn't really have to apply for competitive funding until 2000's or so. And this was more or less the norm back then.

In angloamerican countries the neoliberal turn started already in the 1970s. Although there was also a contemporary explosion in funding of science much due to the role of scientific and technological progress in the Cold War.

The degree inflation likely plays a role too, and in research this shows as more PhD students who have to churn out papers to get their degrees (which is beneficial to the universities as they are assessed on how many affiliated publications they output). There's of course plenty of quality PhD-level research but it's also quite obvious that the quality is also affected by the students still essentially learning the ropes.

Academia's, science's and higher education's societial position and "clout" was quite different and smaller in the era of classical liberalism (roughly until WW1), and public funding had a lot smaller role.

It's of course totally arguable that competition leads to more efficiency. However, the academic world doesn't really have a natural market and establishing comparative value of different lines of research, academic institutions or individual researchers is very difficult. And due to this the resource allocation is done largely based on quantity of scholarly outputs (papers and PhD degrees) and salesmanship. These of course are very easy to play, and those who don't engage in the play tend to not survive. And it adds huge overheads.

[1] https://link.springer.com/article/10.1007/s10734-008-9169-6 [2] https://en.m.wikipedia.org/wiki/New_Public_Management


> Wait Baysians are neoliberals now? I thought Baysians were infamously right wing?

Well, according to the statistics...


I have read many confirmations of the crisis. Without question it's a problem physical and psychological medicine; but also social sciences to a large degree(or at least the civil war within social sciences is figuring it out). In my opinion, climate science is one of the worst.

The discussion isn't whether or not there's a crisis, it's the reasons for the crisis and what is the root cause? Which I don't believe has been answered as of yet.

Is it because we simply don't actually understand the science and perhaps sometime in the future we will actually understand it? Maybe? But why then do we allege that it's even science at this point?

Is it because of some conspiracy. For example cPTSD ought to be in the DSM, why isn't it? Because it will wipe out half the book? cPTSD better diagnoses the vast majority of personality disorders and such. How about the major breakthrough in cancer and heart disease? China, Ecuador, Mexico, etc have basically confirmed cheap highly effective cure to cancer. Seemingly no limitation on type of cancer, same cure effective for breast, renal, pancreatic. As confirmed by actual cancer organizations published by NIH.

Is it because the STEM movement has basically robbed all the good people from these fields? The people who would have been a superstar in social sciences is now an engineer?

Or my personal favourite, it's political. They write their 'study' with p=0.00005 but conclude they were correct as opposed to a null value. Then the political publications pick it up as if it were p=1




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: