ML is useful for many things, but not for predicting scientific replicability

morkalork · on Aug 13, 2023

From the OG paper:

>Our machine learning model used an ensemble of random forest and logistic regression models to predict a paper’s likelihood of replication based on the paper’s text.

>We trained a model using word2vec on a corpus of 2 million social science publication abstracts published between 2000 and 2017

>converting publications into vectors. To do this, we multiplied the normalized frequency of each word in each paper in the training sample by its corresponding 200-dimension word vector, which produced a paper-level vector representing the textual content of the paper

If you took a paper and rearranged the words to have a completely different meaning, their method would produce the same prediction. It also has no understanding of, or the ability to differentiate between, quotation and references within the paper and content written by the authors themselves. Good luck with that! It's basically just learning some known shitty combinations of keywords.

seanhunter · on Aug 13, 2023

> If you took a paper and rearranged the words to have a completely different meaning, their method would produce the same prediction.

It's entirely possible for this to be true and their methodology to still be effective and valid. What that would mean is that the ordering of words adds little predictive power to the classification (whether or not the paper is replicable). I can easily imagine that to be the case. For example, it is a commonly-held belief that it would be a struggle to replicate many papers in the social sciences. If that was the case, this "bag of words" approach would work for those papers, because the mere existence of social science vocabulary would raise the likelihood of failure to replicate.

Secondly this is also the case for

> It also has no understanding of, or the ability to differentiate between, quotation and references within the paper and content written by the authors themselves.

In my example, social science papers would reference social science references. So those references in and of themselves could indicate a higher likelihood of failure to replicate.

One of the most important general themes in NLP is that a lot of the structure that humans find important to understand a document (eg word ordering, parsing structural differences like your text vs reference vs quotation distinction) is sometimes unnecessary for ML to have predictive power.

A common approach is to try something simple like this (bag of words) and then to try something where you add in position vectors (so the model includes word order) or part-of-speech or other structural tagging to see whether that improves classification performance. It's not always required.

notahacker · on Aug 13, 2023

> It's entirely possible for this to be true and their methodology to still be effective and valid. What that would mean is that the ordering of words adds little predictive power to the classification (whether or not the paper is replicable). I can easily imagine that to be the case. For example, it is a commonly-held belief that it would be a struggle to replicate many papers in the social sciences. If that was the case, this "bag of words" approach would work for those papers, because the mere existence of social science vocabulary would raise the likelihood of failure to replicate.

What is this is true, it would also mean that the model was of zero or negative value for actual real world use. It is not new insight - least of all to academics - that papers on subjects like cancer cures, consumer prices, or perceptions of well-being are more difficult to replicate than papers on subjects like molecular bonds or the behaviour of steel under torsion.

On the other hand, academics also have an understanding of study power, study construction, how [un]expected the findings are, and the actual significance of the results to the field, all hugely significant information the ML organism does not incorporate. So the model is less useful than the human judgements it purports to replace. Just because a curve fits doesn't mean the model is an improvement on existing understanding

Indeed, as the model described appears to be based on naive keyword associations across a very small number of papers, it's likely less useful even than a Reddit contrarian saying "bah economics" to every economics paper, as at least the Reddit contrarian might occasionally be able to highlight conspicuously suspect-sounding elements of some abstracts, and won't classify a bad economics paper on the steel industry as better than its peers because it contains terms associated with engineering papers and the name of the prestigious institution the author is an undergraduate at

seanhunter · on Aug 13, 2023

A model doesn't necessarily need to be an improvement on existing understanding to be useful though. It could have other attributes which are valuable. The map of London (where I live) that I use to navigate on my phone is not in any sense the best map of London that has ever been created. It is definitely not an improvement on existing understanding, but it's a great improvement on not having a map.

Say you need to classify all of arxiv for some reason. A model like this might give you the classification performance you need while also being easy to build and not too computationally expensive. It might give better out of sample classification performance than a model that attempts to more realistically model the problem domain. Or it might have a distribution of errors that is more tolerable for a particular use case (like say you're less worried about misclassifications of extreme cases but you really really need to be accurate in the middle of the distribution or vice versa for some reason). Or it might be more robust to over fitting or whatever.

Having a wide analytical toolbox is very useful for data science and this type of NLP has its place for some use cases. While it may seem fun to poke holes in the approach taken in this case I don't think the approach is necessarily intrinsically bad in any way. It's obviously (like all tools) bad if misused in the same way that a screwdriver can be used as a hammer, but it will suck.

notahacker · on Aug 13, 2023

> The map of London (where I live) that I use to navigate on my phone is not in any sense the best map of London that has ever been created. It is definitely not an improvement on existing understanding, but it's a great improvement on not having a map.

But if the "map" is actually composed by a ML process parsing texts about London and arranging the place names based on proximity of mentions and cardinal direction references (this might be a fun exercise, might actually result in something that looks surprisingly like a map with the right data sources, and sufficient tuning, and absolutely won't have any real world use), people are going to ask why you trusted it.

To be fair to the authors of the original paper, their actual claims appear to fall some way short of the blog's insinuation that they were proposing that academics should actually use their model to decide to spend hundreds of hours of work replicating research papers (which would be the perfect example of misuse of ML). But what they've actually done is a very basic statistical ranking of possibly misclassified subfields without any ability to address systematic error.

swsieber · on Aug 13, 2023

>> on a corpus of ... abstracts

Wait a minute. Isn't the abstract the summary? I've read accounts of people digging through a paper and seeing impossible, obviously fabricated numbers. I don't think those types of issues would surface in an abstract.

visarga · on Aug 13, 2023

What they are doing is just conditioning on the topic, author, institution and media attention to predict replicability. But they predict averages, can't make reliable predictions about specific papers. And the study is focusing only on Psychology, even though they trained their embeddings on 2m paper abstracts.

mike_hearn · on Aug 13, 2023

Well it's worse than that. It's surprisingly common to read a paper and discover the abstract doesn't match the claims in the body of the paper itself, or that the claims don't match the data in the tables. And of course press releases about a paper sometimes don't match the paper abstract, etc.

The amount of slippage in what's being claimed as papers get summarized is far too large in some fields and a major contributor to distrust in science, as it causes authors to try and have cake+eat it. Get challenged on an untrue statement you made to the public? Refer to the body of the paper where the claim isn't made, or is made equivocally, and then blame journalists for "mischaracterizing" your work.

It probably is possible to pick up signs of pseudo-science automatically with ML, but you'd want to use really big GPT-4 style LLMs with large context windows and detailed instructions. We're not quite there yet, but maybe next year.

light_hue_1 · on Aug 13, 2023

Yes. It's hard to believe this paper just came out.

It reads like it was written 7 or 8 years ago.

thumbuddy · on Aug 13, 2023

In ML people don't read the literature, establish precedent or even formulate sound hypothesis'. Why this got posted here, I'll never know... Oh wait I know, next week someone will publish a paper claiming success with this using ChatGPT. A year later it'll be found to be bogus, but that won't stop the cash from flowing in.

morkalork · on Aug 13, 2023

Indeed it does!

mirekrusin · on Aug 13, 2023

Reads like release notes for some early bayesian spam filter.

mdp2021 · on Aug 13, 2023

I saw the same: I was writing my comment (then I noticed this parent) as «Without the implementation of Understanding, a ML system would see, if not a bag of words, an ordered set of words (anything more?)... May or may not get more input Information than a Bayesian filter».

enigmurl · on Aug 13, 2023

In general, it's pretty hard to establish that an algorithm is "not useful" for a particular discipline.

a-dub · on Aug 13, 2023

bag of words style representations actually can work quite well in many document classification tasks.

wouldn't surprise me at all if affect bleeds through word choice and indicates some emotions like uncertainty or overconfidence that are frequently associated with irreplicable results.

but yes, predictive models of any kind should not take the place of more rigorous validation of past results.

tmpX7dMeXU · on Aug 13, 2023

Let’s just throw all this overly political AI stuff to the side. It sounds like what many people have been doing with good old fashioned linear regression for…decades…would severely upset you.

When you’re doing prediction for the sake of prediction, the results speak for themselves. There is no inherent necessity for the indicators to be intuitive. There are certainly contexts in which it does matter, though. When accountability matters.

It’s, as you say, “basically just” doing exactly what it’s doing. It works exactly as well as it works. To argue that it’s good or bad based on some implementation detail is … kind of beside the point.

thwarted · on Aug 13, 2023

Our machine learning model used an ensemble of random forest and logistic regression models to predict a paper’s likelihood of replication based on the paper’s text.

This reads like a description of phrenology.

eimrine · on Aug 13, 2023

ML able to predict likelihood of replication, which is also a prediction (about something in Nature) to me is a perfect example of GIGO.

onigirij · on Aug 13, 2023

That is an excellent point/

saithound · on Aug 13, 2023

This observation reminds me of trying to counter Betteridge's law of headlines [1] with the observation that if you were to negate the question in the headline, you would clearly get a question whose answer is not No.

Of course, the question you get is no longer an actual _headline_, so there's no reason to expect Betteridge's law to apply.

A similar thing happens here: if you take a scientific article, and rearrange its words to change its meaning, what you get is no longer a scientific article, so outside the scope of the classification task "classify scientific articles based on how likely they are to reproduce". It's not even meaningful to ask whether it will reproduce: what you have is no longer a scientific article reporting on research that actually took place and got a certain result, it's merely a bag of words that you generated by permuting the words of an actual scientific article.

Your observations are irrelevant: they cannot be used to tell apart models which accomplish the stated task (correctly classify scientific articles based on how likely they are to reproduce) from those which do not. This doesn't mean the proposed model actually works (I for one think it almost certainly doesn't), but it cannot be strong evidence either way.

It's like somebody proposing a model that can identify an animal based on children's pencil drawings of the animal, then you countering that if you rearrange the photo of a man standing next to a horse into a centaur, the model still returns "horse". Such rearranged, synthetic data is well outside the scope of the inputs the model was meant to handle, so you cannot draw any strong conclusions about the model's ability to perform its actual, advertised task.

[1] https://en.m.wikipedia.org/wiki/Betteridge%27s_law_of_headli...

crakenzak · on Aug 13, 2023

Very interesting, so you’re basically saying that a proper technique could potentially predict a papers chance of being successfully replicated?

If so, could be great for the field!

morkalork · on Aug 13, 2023

Sure, when we've got something approaching AGI that has in-depth knowledge of a field of research and some "intuition" from experience, then yeah why not.

mike_hearn · on Aug 13, 2023

You don't need AGI. GPT-4's intelligence is good enough. The problem is context window size.

Heck for a lot of things you don't even need ML at all. Regular expressions are good enough:

https://www.irit.fr/~Guillaume.Cabanac/problematic-paper-scr...

Inability to replicate is a very general and vague way to phrase the problem, because it implies that non-replicability is a sort of abstract issue that just randomly emerges. But in practice papers that don't replicate, don't for a reason, and often those reasons can be identified in advance given just the paper.

For a trivial example of this see the GRIM and SPRITE programs. If the numbers in a paper aren't even internally consistent, that's a good sign that something has gone wrong and the paper probably won't replicate.

For a less trivial example where you'd benefit from a tool-equipped LLM, consider asking a GPT-4 level AI to cross-check the claims in the abstract, body and conclusion against the data tables. If the claims aren't consistent then you can already assert it won't be possible to replicate because it's not even going to be clear what claim you are attempting to check in the first place.

visarga · on Aug 13, 2023

It would save us a lot of money wasted on experiments, the motivation is strong, so why are we still experimenting? Because you can't get from the model what is not in the model. But you just ask a LLM to review papers, it's not completely useless, you can get some manner of feedback. I used the prompt "write an openreview.net style review on this paper" + attach PDF, on Claude with 100k token window.

mhh__ · on Aug 13, 2023

AI Linting of papers isn't such a bad idea. There's a lot of crappy papers out there.

raverbashing · on Aug 13, 2023

Word2vec + random forest: very shallow learning

No, it really doesn't work like this

leedrake5 · on Aug 13, 2023

> They found that the model relied on the style of language used in papers to judge if a study was replicable.

I think the failure here is to train the model on published results. I’ve worked with scientists who could write and infer a marvelous amount of information from shit data. And I’ve worked with scientists who poorly described ingenious methods with quality data. The current academic system incentivizes sensationalization of incremental advances which confirm previously published work. I’m not in the least surprised at the manuscript level that replication would fail.

The proper way to do this would be to log experimental parameters in a systematic reporting method specific to each field. With standardized presentation of parameters I suspect replicability would improve. But this would present an near impossible degree of coordination between different research groups. But it would be feasible for the NIH or NSF to demand such standardized logging as a condition of grant awards of a certain size.

personjerry · on Aug 13, 2023

I feel like this is another way of saying "past data can't predict the future"

guyomes · on Aug 13, 2023

The book "Your Life in Numbers: Modeling Society Through Data" by Pablo Jensen provides an interesting insight on this, notably the chapter "We are not social atom" [1]. The author argues notably in an interview [2] that for social behaviors, prediction models work hardly better than just saying that next year will be the same as last year.

The main illusion of scientific models for predicting the future is that it worked well for group of simple objects like planets or atoms. However, applying this approach to groups of humans is often disappointing.

[1]: https://link.springer.com/chapter/10.1007/978-3-030-65103-9_...

[2]: https://news.cnrs.fr/articles/why-society-cannot-be-modeled

throwaway290 · on Aug 13, 2023

Yeah. If you can check reproducibility with AI then you can just randomly generate a million of texts and predict what will reproduce. But AI is not this magic since it is trained on past texts.

visarga · on Aug 13, 2023

I hope people start expecting less magic from AI.

1letterunixname · on Aug 13, 2023

2 800 lbs. gorilla truths:

0. AGI is a remote, distant possibility borne of science fiction, more remote than commercial fusion power or ubiquitous flying cars.

1. Narrow AI (software) will eat all areas that are reducible, but it cannot completely replace the interactive reasoning of a subject matter expert. That's why there will never be AI lawyers, mechanical engineers, or civil engineers delivering statute or code interpretation reports.

------

IIRC (or someone was pulling my leg ¯\_(ツ)_/¯ ), BMIR at Stanford was doing NLP ML of medical &| biomedical informatics papers and trying to draw new conclusions from automated meta-analyses of existing papers.

nerdponx · on Aug 13, 2023

> IIRC (or someone was pulling my leg ¯\_(ツ)_/¯ ), BMIR at Stanford was doing NLP ML of medical &| biomedical informatics papers and trying to draw new conclusions from automated meta-analyses of existing papers.

Seems like a silly idea, but it's hard to know if it's really a bad idea until you try it. ML algorithms are specifically good at finding complicated high-order interactions that might not be obvious to a human, and we're still not very good at knowing when there will or won't be something interesting in a new dataset. It's not unreasonable to want to just try everything.

sitkack · on Aug 13, 2023

How do you prove a negative?

My own heuristic works pretty good, if the artifacts for the paper are available and they’re either gonna get repo or Dr. image I’m gonna say that the paper is pretty reproducible. Or, if the paper instead of being exactly 10 pages is 20 or more it has an extensive appendix on the methods used. It also has a high likelihood of being reproducible, or if it includes links to data sets.

blitzar · on Aug 13, 2023

> Earlier this year, a paper in the widely read PNAS journal raised the possibility of detecting non-replicable findings using machine learning (ML).

I wonder if they faked their paper too.

godelski · on Aug 13, 2023

I wouldn't doubt it. As I said in a longer comment. I would be incredibly impressed if any statistical model could perform this. It isn't something even humans can perform. Human reviewers are only capable of determining if a paper is invalid or indeterminate, but not if a paper is valid.

The question really is if they intentionally faked the paper or if they tricked themselves.

godelski · on Aug 13, 2023

I would be VERY impressed if __any__ statistical model[0] could __ever__ reliably predict replicability. I just do not believe that there is enough information in papers to do this. I mean even in papers like machine learning works where algorithms are given and code and checkpoints handed out; these works even frequently have replicability issues! I just do not believe it is possible to put such information into writing.

Plus, it is missing two key points about replicability. First, just because you can type a recipe doesn't mean that the recipe is correct. I can tell you how to bake a cake but if I forget to tell you to add yeast (or the right amount) then you won't get a cake. This would be indistinguishable from a paper that looks replicable from one that isn't. You literally have to be an expert paying close attention to determine this.

Second off, a big part of replication is that there's variance. You CANNOT ever replicate in exactly the same conditions, and that's what we want! It helps us find unknown and hidden variables. Something confounds or couples with another and it can be hard to tell, but the difference in setting helps unearth that. You may notice this in some of the LK-99 works, how they aren't using the same exact formula as the original authors but "this should have the same result."

Instead, I think we're all focusing on the wrong problem here. We have to take a step back and look at things a bit more abstracted. As the article even suggests, works never get replicated because it is time consuming. So why is the time not worthwhile? Because replicating a result doesn't get you anything or advance your career. Despite replication being a literal cornerstone of science! Instead our model is to publish in top venues, publish frequently, and that publishing in venues requires "novelty" (whatever that means).

We live in Goodhart's Hell, and I think academia makes it much clearer. A bunch of bean counters need numbers on a spreadsheet to analyze things. They find correlations, assert that correlations are causations, make those the goal post, and then people hack those metrics. It is absolutely insane and it leaves a hell of a lot of economic inefficiency on the table. I'm sure if you think about this you can find this effect is overwhelmingly abundant in your life and especially in work (it is why it looks like your boss red the CIA's Subtle Sabotage Field Manual and thought they were great ideas).

Here's how you fix it though: procedurally. You ask yourself what your goals are. You then ask yourself if those those goals are actually what your intended goals are (subtle, but very different). Then you ask yourself how those can be measured as well as __if__ they can be measured. If they can't be measured exactly then you continue forward, but with extra caution and need to constantly ask yourself if these metrics are being hacked (hint: they will, but not always immediately). In fact, the more susceptible your metric is to hacking, the more work you have in store for yourself. You're going to need to make a new metric that tries to measure the hacking and tries to measure the alignment with the actual desired outcome. In ML we call all this "reward hacking." In business we call this bureaucracy.

The irony of this all is that more often than not, implementing a metric in the first place puts you in a worse position than if you had just let things run wild. The notion sounds crazy, but it does work out. The big driver is that there's noise in the system. People think you can decouple this noise, but it actually is often inherent. Instead you need to embrace the noise and make it part of your model. But if we're talking something like academia, there's too much noise to even reliably define good metrics. Your teaching output can't be quantified without tracking students over at least a decade, wherein they've been influenced by many other things by that point (first job isn't nearly enough). For research, it can take decades for a work to become worthy of a Nobel! Even after published. Most good work takes a lot of time, and if you're working on a quarterly milestone you're just making "non-novel" work, that needs to filter through a vague and noisy novel filter. You wonder why the humanities writes about a lot of BS? Well take this effect over 50 years and it'll make perfect sense. The more well studied an area is, the more crazy it is now.

So here's how you fix it in practice: let the departments sort it out. They are small enough that accountability is possible. They know who is doing good work and working hard, regardless of tangible output. If your department is too big, make sub groups (this can be about 100 people, so they can be pretty big). Get rid of this joke thing called journals. They were intended to make distribution easier and have never been able to verify a work by reading it (a reviewer is only capable to determining if a work is invalid or indeterminate, but not valid). Use your media arm to celebrate works which include replication! Reward replication! It's all a fucking PR game anyways. If your PR people can't spin that into people getting excited, hire new PR people. If Apple can sell us on a ground breaking and revolutionary $700 caster wheels and a $1k monitor stand, you can absolutely fucking sell us on one of the cornerstones of science. Hell, we've been seeing that be done with LK-99.

/rant

[0] without actually simulating the algorithms and data in the study, which we might just call replication...