>Our machine learning model used an ensemble of random forest and logistic regression models to predict a paper’s likelihood of replication based on the paper’s text.
>We trained a model using word2vec on a corpus of 2 million social science publication abstracts published between 2000 and 2017
>converting publications into vectors. To do this, we multiplied the normalized frequency of each word in each paper in the training sample by its corresponding 200-dimension word vector, which produced a paper-level vector representing the textual content of the paper
If you took a paper and rearranged the words to have a completely different meaning, their method would produce the same prediction. It also has no understanding of, or the ability to differentiate between, quotation and references within the paper and content written by the authors themselves. Good luck with that! It's basically just learning some known shitty combinations of keywords.
> If you took a paper and rearranged the words to have a completely different meaning, their method would produce the same prediction.
It's entirely possible for this to be true and their methodology to still be effective and valid. What that would mean is that the ordering of words adds little predictive power to the classification (whether or not the paper is replicable). I can easily imagine that to be the case. For example, it is a commonly-held belief that it would be a struggle to replicate many papers in the social sciences. If that was the case, this "bag of words" approach would work for those papers, because the mere existence of social science vocabulary would raise the likelihood of failure to replicate.
Secondly this is also the case for
> It also has no understanding of, or the ability to differentiate between, quotation and references within the paper and content written by the authors themselves.
In my example, social science papers would reference social science references. So those references in and of themselves could indicate a higher likelihood of failure to replicate.
One of the most important general themes in NLP is that a lot of the structure that humans find important to understand a document (eg word ordering, parsing structural differences like your text vs reference vs quotation distinction) is sometimes unnecessary for ML to have predictive power.
A common approach is to try something simple like this (bag of words) and then to try something where you add in position vectors (so the model includes word order) or part-of-speech or other structural tagging to see whether that improves classification performance. It's not always required.
> It's entirely possible for this to be true and their methodology to still be effective and valid. What that would mean is that the ordering of words adds little predictive power to the classification (whether or not the paper is replicable). I can easily imagine that to be the case. For example, it is a commonly-held belief that it would be a struggle to replicate many papers in the social sciences. If that was the case, this "bag of words" approach would work for those papers, because the mere existence of social science vocabulary would raise the likelihood of failure to replicate.
What is this is true, it would also mean that the model was of zero or negative value for actual real world use. It is not new insight - least of all to academics - that papers on subjects like cancer cures, consumer prices, or perceptions of well-being are more difficult to replicate than papers on subjects like molecular bonds or the behaviour of steel under torsion.
On the other hand, academics also have an understanding of study power, study construction, how [un]expected the findings are, and the actual significance of the results to the field, all hugely significant information the ML organism does not incorporate. So the model is less useful than the human judgements it purports to replace. Just because a curve fits doesn't mean the model is an improvement on existing understanding
Indeed, as the model described appears to be based on naive keyword associations across a very small number of papers, it's likely less useful even than a Reddit contrarian saying "bah economics" to every economics paper, as at least the Reddit contrarian might occasionally be able to highlight conspicuously suspect-sounding elements of some abstracts, and won't classify a bad economics paper on the steel industry as better than its peers because it contains terms associated with engineering papers and the name of the prestigious institution the author is an undergraduate at
A model doesn't necessarily need to be an improvement on existing understanding to be useful though. It could have other attributes which are valuable. The map of London (where I live) that I use to navigate on my phone is not in any sense the best map of London that has ever been created. It is definitely not an improvement on existing understanding, but it's a great improvement on not having a map.
Say you need to classify all of arxiv for some reason. A model like this might give you the classification performance you need while also being easy to build and not too computationally expensive. It might give better out of sample classification performance than a model that attempts to more realistically model the problem domain. Or it might have a distribution of errors that is more tolerable for a particular use case (like say you're less worried about misclassifications of extreme cases but you really really need to be accurate in the middle of the distribution or vice versa for some reason). Or it might be more robust to over fitting or whatever.
Having a wide analytical toolbox is very useful for data science and this type of NLP has its place for some use cases. While it may seem fun to poke holes in the approach taken in this case I don't think the approach is necessarily intrinsically bad in any way. It's obviously (like all tools) bad if misused in the same way that a screwdriver can be used as a hammer, but it will suck.
> The map of London (where I live) that I use to navigate on my phone is not in any sense the best map of London that has ever been created. It is definitely not an improvement on existing understanding, but it's a great improvement on not having a map.
But if the "map" is actually composed by a ML process parsing texts about London and arranging the place names based on proximity of mentions and cardinal direction references (this might be a fun exercise, might actually result in something that looks surprisingly like a map with the right data sources, and sufficient tuning, and absolutely won't have any real world use), people are going to ask why you trusted it.
To be fair to the authors of the original paper, their actual claims appear to fall some way short of the blog's insinuation that they were proposing that academics should actually use their model to decide to spend hundreds of hours of work replicating research papers (which would be the perfect example of misuse of ML). But what they've actually done is a very basic statistical ranking of possibly misclassified subfields without any ability to address systematic error.
Wait a minute. Isn't the abstract the summary? I've read accounts of people digging through a paper and seeing impossible, obviously fabricated numbers. I don't think those types of issues would surface in an abstract.
What they are doing is just conditioning on the topic, author, institution and media attention to predict replicability. But they predict averages, can't make reliable predictions about specific papers. And the study is focusing only on Psychology, even though they trained their embeddings on 2m paper abstracts.
Well it's worse than that. It's surprisingly common to read a paper and discover the abstract doesn't match the claims in the body of the paper itself, or that the claims don't match the data in the tables. And of course press releases about a paper sometimes don't match the paper abstract, etc.
The amount of slippage in what's being claimed as papers get summarized is far too large in some fields and a major contributor to distrust in science, as it causes authors to try and have cake+eat it. Get challenged on an untrue statement you made to the public? Refer to the body of the paper where the claim isn't made, or is made equivocally, and then blame journalists for "mischaracterizing" your work.
It probably is possible to pick up signs of pseudo-science automatically with ML, but you'd want to use really big GPT-4 style LLMs with large context windows and detailed instructions. We're not quite there yet, but maybe next year.
In ML people don't read the literature, establish precedent or even formulate sound hypothesis'. Why this got posted here, I'll never know... Oh wait I know, next week someone will publish a paper claiming success with this using ChatGPT. A year later it'll be found to be bogus, but that won't stop the cash from flowing in.
I saw the same: I was writing my comment (then I noticed this parent) as «Without the implementation of Understanding, a ML system would see, if not a bag of words, an ordered set of words (anything more?)... May or may not get more input Information than a Bayesian filter».
bag of words style representations actually can work quite well in many document classification tasks.
wouldn't surprise me at all if affect bleeds through word choice and indicates some emotions like uncertainty or overconfidence that are frequently associated with irreplicable results.
but yes, predictive models of any kind should not take the place of more rigorous validation of past results.
Let’s just throw all this overly political AI stuff to the side. It sounds like what many people have been doing with good old fashioned linear regression for…decades…would severely upset you.
When you’re doing prediction for the sake of prediction, the results speak for themselves. There is no inherent necessity for the indicators to be intuitive. There are certainly contexts in which it does matter, though. When accountability matters.
It’s, as you say, “basically just” doing exactly what it’s doing. It works exactly as well as it works. To argue that it’s good or bad based on some implementation detail is … kind of beside the point.
Our machine learning model used an ensemble of random forest and logistic regression models to predict a paper’s likelihood of replication based on the paper’s text.
This observation reminds me of trying to counter Betteridge's law of headlines [1] with the observation that if you were to negate the question in the headline, you would clearly get a question whose answer is not No.
Of course, the question you get is no longer an actual _headline_, so there's no reason to expect Betteridge's law to apply.
A similar thing happens here: if you take a scientific article, and rearrange its words to change its meaning, what you get is no longer a scientific article, so outside the scope of the classification task "classify scientific articles based on how likely they are to reproduce". It's not even meaningful to ask whether it will reproduce: what you have is no longer a scientific article reporting on research that actually took place and got a certain result, it's merely a bag of words that you generated by permuting the words of an actual scientific article.
Your observations are irrelevant: they cannot be used to tell apart models which accomplish the stated task (correctly classify scientific articles based on how likely they are to reproduce) from those which do not. This doesn't mean the proposed model actually works (I for one think it almost certainly doesn't), but it cannot be strong evidence either way.
It's like somebody proposing a model that can identify an animal based on children's pencil drawings of the animal, then you countering that if you rearrange the photo of a man standing next to a horse into a centaur, the model still returns "horse". Such rearranged, synthetic data is well outside the scope of the inputs the model was meant to handle, so you cannot draw any strong conclusions about the model's ability to perform its actual, advertised task.
Sure, when we've got something approaching AGI that has in-depth knowledge of a field of research and some "intuition" from experience, then yeah why not.
Inability to replicate is a very general and vague way to phrase the problem, because it implies that non-replicability is a sort of abstract issue that just randomly emerges. But in practice papers that don't replicate, don't for a reason, and often those reasons can be identified in advance given just the paper.
For a trivial example of this see the GRIM and SPRITE programs. If the numbers in a paper aren't even internally consistent, that's a good sign that something has gone wrong and the paper probably won't replicate.
For a less trivial example where you'd benefit from a tool-equipped LLM, consider asking a GPT-4 level AI to cross-check the claims in the abstract, body and conclusion against the data tables. If the claims aren't consistent then you can already assert it won't be possible to replicate because it's not even going to be clear what claim you are attempting to check in the first place.
It would save us a lot of money wasted on experiments, the motivation is strong, so why are we still experimenting? Because you can't get from the model what is not in the model. But you just ask a LLM to review papers, it's not completely useless, you can get some manner of feedback. I used the prompt "write an openreview.net style review on this paper" + attach PDF, on Claude with 100k token window.
>Our machine learning model used an ensemble of random forest and logistic regression models to predict a paper’s likelihood of replication based on the paper’s text.
>We trained a model using word2vec on a corpus of 2 million social science publication abstracts published between 2000 and 2017
>converting publications into vectors. To do this, we multiplied the normalized frequency of each word in each paper in the training sample by its corresponding 200-dimension word vector, which produced a paper-level vector representing the textual content of the paper
If you took a paper and rearranged the words to have a completely different meaning, their method would produce the same prediction. It also has no understanding of, or the ability to differentiate between, quotation and references within the paper and content written by the authors themselves. Good luck with that! It's basically just learning some known shitty combinations of keywords.