Hacker News new | past | comments | ask | show | jobs | submit login
Many Psychology Findings Not as Strong as Claimed, Study Says (nytimes.com)
99 points by mcgwiz on Aug 27, 2015 | hide | past | favorite | 34 comments



Take the time to read the story (many comments don't reflect more than the headline), which is far more complex and interesting than the headline. For one thing, almost all studies' effects were reproduced, but they were generally weaker.

* Most importantly, from the Times: Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

* Also: The research team also measured whether the prestige of the original research group, rated by measures of expertise and academic affiliation, had any effect on the likelihood that its work stood up. It did not.

* And: The only factor that did [affect the likelihood of successful reproduction] was the strength of the original effect — that is, the most robust findings tended to remain easily detectable, if not necessarily as strong.

* Finally: The project’s authors write that, despite the painstaking effort to duplicate the original research, there could be differences in the design or context of the reproduced work that account for the different findings. Many of the original authors certainly agree.

* According to several experts, there is no reason to think the problems are confined to psychology, and it could be worse in other fields. The researchers chose psychology merely because that is their field of expertise.

* I haven't seen anything indicating the 100 studies are a representative sample of the population of published research, and at least one scientist raised this question.


> The only factor that did [affect the likelihood of successful reproduction] was the strength of the original effect — that is, the most robust findings tended to remain easily detectable, if not necessarily as strong.

This is probably just regression to the mean. The comment above suggests to me that the tendency for the findings in replicated experiments to be weaker does NOT necessarily come from any flaw in the experimental design, but from the criteria for findings to be published.

You would expect any given effect to show some variation around a mean effect size. My lab and your lab might arrive at slightly different results, varying around some mean/expected result. If your lab's results meet statistical significance, you get to publish. If my lab's don't, I don't get to. So the published results are the studies that, on average, show a stronger effect than you might see if you ran the study 100 times.

> Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

If a third lab replicates the experiment, their results are more likely to be close to the (possibly non-publishable) mean value than the (publishable) outlier value. So on average, repeating an experiment will give you a LESS significant result.

If the strength of the original effect (and thus probably the mean effect strength over many repeated experiments) is larger, the chance of replicated experiments also being statistically significant is higher.

In other words, these new results are very predictable and don't necessarily indicate that anything is wrong.


Yes, however:

My expectation is that this regression to the mean should not apply to strong effects. Where I define strong by: the strength is enough that significance level and publication criteria are unimportant. In this case, I would expect half the results to return stronger.

The first result was a random sample, and the second result was a random sample. If there's no outside bias from publication cut-off, there should be a 50% chance that either is higher.

It's concerning if the strong results consistently re-test weaker. That shows systematic bias.


For an example of this effect in physics, look at the Millikan experiment

https://en.wikipedia.org/wiki/Oil_drop_experiment#Millikan.2...


Oh, the Millikan experiment. I had to do it in uni lab as a 4 hour long experiment. It's impossible to gather enough data in this short time and your eyes figuratively falls out by staring into the microscope measuring the drops' velocity. I can assure that this is the worst experiment one can do as a student.


> * According to several experts, there is no reason to think the problems are confined to psychology, and it could be worse in other fields. The researchers chose psychology merely because that is their field of expertise.

There's a tendency to stop collecting data once there are publishable findings. There's also a tendency to ignore (find reason to discount) results that aren't reproducible or don't make sense. That's even the case in physics. There's a tendency to debug until you get results that are consistent with prior work.


> There's also a tendency to ignore (find reason to discount) results that aren't reproducible or don't make sense. ... There's a tendency to debug until you get results that are consistent with prior work.

A lot of people fall into that trap actually. It's called confirmation bias: https://en.wikipedia.org/wiki/Confirmation_bias

Even Einstein (Einstein, of all people) when his findings suggested a Big Bang, adjusted them so that they would fit in the then-believed-to-be-true static universe model. He called it his greatest mistake. Confirmation bias.


> The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies.

In general, if you conduct an experiment, then conditional on finding significance the estimate of the effect size is going to be an overestimate of the true effect size. This is because the data that was extreme enough to produce a low p-value is also likely to be more extreme than the true population. Obviously this effect is less important when the true effect size is large enough - in high-powered studies. So this might not actually be so important: it may also mean that the effect size estimated in the original paper was simply miscalculated and estimated to be too high.


As a marketer, I have long given up on psychology as a field. The practitioners are too wishy-washy, the experiments too prone to confirmation bias, and the results too unusable.

I feel behavioral economists cover many of the same subjects, but their work is much more interesting and scientific. The experiments are tighter and ask better questions. And they focus more on specific characteristics in decision making, and less on asking big questions.


> And they focus more on specific characteristics in decision making, and less on asking big questions.

It seems more that it's behavioural economists that do studies relevant to your marketing, not a flaw in psychology itself. Psychology is an incredibly broad field, and it asks questions both big and small. Plenty of psychology studies ask very specific questions looking for very specific effects.


My place of employment was funded a grant, and did a study to show that teens were using mobile devices more often than they did years ago (I don't remember the timeframe). I believe they got something like a 200k+ grant for this. A team of like 10 PhD psychologists to show something we already know.

Not saying psychology as a field is pointless, or anything, just that they do occasionally do some silly studies.


Therein lies the problem. On many of the questions that psychology studies, people are deeply committed to an intuitive answer.

A layperson reading any psychology experiment will either say "That's obvious! You spent how much establishing something everyone already knows?" or "No way. You must be twisting the data to make a name for yourself." Particularly when it touches hot-button issues like "to what extent are [people I don't like] in control of their own situations?"

You don't get that around physics quite so much.


You're probably oversimplifying the study considerably, but even so, even things that "everyone knows" need to be formally studied. All across science, non-intuitive results happen all the time. In psychology, it's even more common, since biology doesn't play by the relatively clean rules of materials physics.

After all, "everyone knows" that blacks are inferior and only useful as slaves (they even want to be, deep down); "everyone knows" that women are too temperamental to vote sensibly; "everyone knows" that people of that religion over there eat babies and we should destroy them before they destroy us...

For a more recent example, "everyone knows" that young people use condoms more often now, given the higher levels of sex education about pregnancy and STIs they've grown up with - yet regional studies often show significantly decreased levels of condom use. Another one is the assumption that the current crop of young'uns are fantastic with understanding computers due to growing up with them, yet this hasn't been borne out in studies. Turned out that consuming from a device doesn't mean you understand how it works any better.

Such studies are particularly important at ferretting out what's happening with people who aren't society's favourites - we all know what a man is, right? Always looking to get laid, not afraid to get physical, plays sport, drinks beer. Most men are like that, right? Not really; there's actually a huge variety of interests and desires. Studies of 'obvious!' things are just as necessary as fringe things, because sometimes the results are really quite unexpected.


"After all, "everyone knows" that blacks are inferior and only useful as slaves (they even want to be, deep down); "everyone knows" that women are too temperamental to vote sensibly; "everyone knows" that people of that religion over there eat babies and we should destroy them before they destroy us..."

This is simply not true at all these days, and I don't like that HN users would push a normal conversation into one where you imply your "opponent" is racist, sexist, or a religious bigot. Might as well have called me a Nazi or brought Hitler into the conversation.

And I don't believe you need Psychologists or a peer reviewed study to discern mobile platform usage, we have other ways to getting said metrics.


She was saying that society in general was racist back in the day when they believed that conditions such as Drapetomania.


Thanks for the downvote, if she was saying that, why did she say "Everybody Knows" rather than "Everybody Knew." No, I see a distinct verbal jab in that statement, but whatever, I'd downvote her if I could, but you in power like to keep everybody but those in the click down by downvoting even when we contribute to the conversation. Hey, I have 351 karma, why don't you get your friends over here and bring me down to nothing!


It's interesting that you say that. In my time studying psychology one of the things that I've found interesting is the way that various subfields flee the label. Many IO apsychologists take to the label behaviorism economists. They're work is relatively opaque to main stream psychologists however in that they have their own professional societies separate from American Psychological Association (APA). The professional label of "psychologist" is actually strictly regulated in most states for the purposes of providing counseling and therapy, and typically requires a PhD from an APA accredited university and clinical experience to practice, but anyone can call themselves a counselor in most states (IIRC). Despite this many psychologists (the PhD ones) will use the counselor label for many purposes. (I think that family and marriage counselors require accreditation in most states as well, but I know less about the area.) In the area that I did research, people more typically consider themselves cognitive scientists, often doing work that overlaps machine learning techniques and perceptual measurement.

I wanted to ask though, since some of the statistical techniques we're using seem like they could have some use cases in your field, what kind of adoption (if any) is there in marketing research of techniques such as [Multidimensional Scaling of similarity ratings](https://en.wikipedia.org/wiki/Multidimensional_scaling) or other perceptual mapping techniques? It seems like a relatively straightforwards way to figure out what dimensions use in making classification judgments about products or companies...


>As a marketer, I have long given up on psychology as a field.

This very much depends on what you're referring to as psychology. Bear in mind that when you throw that term around, you're including very rigorous research such as this: http://cavlab.net/?lang=en

This article concerns social psychology, not cognition (mostly).

The underlying problem is that social psych has become increasingly politicized.


Perhaps the biggest take away I found while studying psychology is that a lot of research in the field just couldn't be trusted until you vetted it. Political pressures are too great a corrupting factor. Words would be redefined. Conclusions overstated or over applied. This is on top of the already existing 'publish or perish' issue that impacts science as a whole.

The closer one was to neurology (like physiological psychology), the better it became. The closer one was to sociology (like IO psychology) or to a politically charged issue, the worse it became.

Also applies to psychiatry. I remember reading some of the papers published related to the DSM-V and at one point it looked like little more than peer reviewed version of two siblings fighting (though that was the worse case, not the average).

One big thing is to look at how the researchers defined words and look into how things translate when multiple languages were involved.


> Words would be redefined. Conclusions overstated or over applied.

Even neuroscience has a hearty helping of this. The data itself is one thing, but what it's sold as showing is quite another. For example if you read a claim that neuroscience has discovered something about "addiction", "free will", "motivation", "friendship", "love", etc., dig into how they've chosen to define those terms for the purpose of the study.

There's often a bit of concept-laundering going on, where a common term is defined using a very narrow (and often conveniently chosen) definition to show a result, but then the implications of that result are discussed with reference to the original, more general sense of the term. News articles about neuroscience are the worst, of course, but even quite a few scientists themselves do this.


Here is the paper the Reproducability Project just released on Estimating the reproducibility of psychological science: http://www.sciencemag.org/content/349/6251/aac4716


Here's a nice, approachable treatment of some of the ways psych studies can go wrong; it also offers concrete suggestions for fixes:

  Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn
  "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant"
  Psychological Science November 2011 22: 1359-1366,
Free full-text (no paywall): http://pss.sagepub.com/content/22/11/1359

(via: http://bahanonu.com/brain/#c20150315 )


There has been a backlash against attempts at replication in psychology: http://phenomena.nationalgeographic.com/.../failed.../ http://andrewgelman.com/2013/12/17/replication-backlash/ and even a backlash to the backlash: http://phenomena.nationalgeographic.com/.../failed.../


One of the weird critisisms I've heard (and reflected in your first link) is "but the reproduced study had some slight difference". If a slight difference can make an effect disappear then it's not such an interesting or general effect as the original study claimed to show! Why didn't they already test for the effect of slight differences themselves?


You can't evaluate every single slight variation on your testing environment. That's just the nature of the field.


Your last URL is the same as the first.


The replication effect sizes are almost all positive [1]. This study represents very strong validation that the published results were qualitatively true, in the main.

The regression to the mean effect is unsurprising and doesn't diminish this finding. Given the difficulty of performing this kind of research, this is a very positive result for the field.

[1]http://m.sciencemag.org/content/349/6251/aac4716/F1.expansio...


For anyone interested in more information on this topic, the book "Psychology Gone Wrong" by Tomasz Witkowski and Maciej Zatonski is a very in depth look into the problems in this field.


This is how I evaluate studies, "are the data and code made available?" It's a simple request and this study is the first I've ever seen offer both. Bravo!


"Study says other studies are flawed"....

Didn't read the article, but couldn't help but comment on the title, good laughs.


"Many Findings About Psychology Findings Not Being as Strong as Claimed Not as Strong as Claimed, Study Says"


This chain of logic was covered in a 2011 fiction piece published in Science's competing journal Nature: http://www.nature.com/nature/journal/v477/n7363/full/477244a...

“Although it is nonsensical to rely on evidence provided by human-based research when judging whether humans are themselves inept, in doing so, the editors (all human, I note) provide a perfect example of the feebleness of human reasoning, thereby validating their claims.”


Psychology is more like religion than science.


And of the 40 that turned the same results, a fair chunk might be just by chance.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: