Hacker News new | past | comments | ask | show | jobs | submit login
The Effect of Statistical Training on the Evaluation of Evidence (2016) [pdf] (blakemcshane.com)
73 points by luu on July 3, 2017 | hide | past | favorite | 37 comments



The issue really surfaces when you dissect "statistical training". I had a conversation with a colleague getting the a PhD and about to publish a paper. She was mentioning to me that she tried "a bunch" of different specifications and the one that was significant was [I forget what the variables were, but this is in the education field].

I pointed out to her that, since she tried a bunch of specifications and chose the one which yielded the "positive" result, wouldn't that fundamentally alter the interpretation of the p-value, and possibly invalidate her claim?

She looked at me like I was a moron, uneducated, and trying to be difficult.

I'm forever indebted to my econometrics professor who helped me build a tremendous mental model about how to think about these issues. But between the really poor training (most people I encounter in the field just run STATA/SPSS commands with limited/no understanding of what they do mathematically) and the terrible incentives facing researchers (you don't see many negative results published, do you), it's no surprise that when people come in and attempt to seriously replicate even foundational findings, they often come away disappointed (thinking about the Reproducibility project - https://en.wikipedia.org/wiki/Reproducibility_Project)

I, too, don't have an answer to this. But I can only imagine that the way to prevent is to push the industry in the direction that truth-seeking, not novelty, is rewarded. I've mentioned this before here, but I know professors who established their entire careers on the backs of research which was later completely refuted -- even in cases where their actions where fradulent. And they are still teaching -- and successful.

As academics - like media and enveryone else -- are chasing our ever-dimming attention spans, I think this'll be a really tough nut to meaningfully crack


When you've invested 4 or more years in Ph.D. research, there are powerful motivations to "find" results that support your thesis. Perhaps they are subconscious, but they are powerful.


I think the real problem is that positive results are seen as better than negative results. Running a bunch of studies and showing that your theory doesn't work is just as much work as proving that it does work and is still a significant contribution.


The people who end up like that are the exact people who the PhD process is supposed to filter out...


No university or professor wants to be known for having any sizable number of their PhD candidates filtered out in the middle of the process. They'll delay poor performers for sure, but it just looks really bad when their candidates essentially fail out.

Also consider that often it's not even the candidate's fault that their research went nowhere. Sometimes it's just due to chance that a particular path became a dead-end, but other times it's due to a professor obstinately pushing them to focus on something unproductive, perhaps well after it's almost entirely evident that the dead end is coming.

It's a messed up system. I think the entire field of academic research needs a complete overhaul.


Yes, the current system rewards that behavior and as a side effect rewards misunderstanding stats, etc.

The people who have no qualms about "statistically significant" = "my theory is correct" (either due to maliciousness or ignorance) will graduate much sooner and be able to pump out more papers.


Fake it till you make it significant (p < 0.05).


It's really sad because there are more tools and methods to squeeze statistically meaningful results from data than before.

Even if the researchers could use these new tools, in many areas peer reviewers are not equipped to understand them.

Maybe its the time to divide research into different specialties. Main subject area research team must always consult statistical experts for model selection, experiment design, and interpretation of the results. Time when one group can do all these things alone may be over.


This already happens. It's not uncommon for a grant for, say, something in medicine to include as a budget item funds to hire a biostats person.


There's really no excuse for p-value fishing when procedures for controlling false-discovery and familywise error rates exist and can be implemented in a single line of code in R [0].

0: https://stats.stackexchange.com/q/63441/36229


> I'm forever indebted to my econometrics professor who helped me build a tremendous mental model about how to think about these issues

Can you tell me what that is?

Stats/probability is fairly unintuitive to me, but I can't help thinking there is a better way of framing it.

Like a 'probability' is often stated like it where a concrete thing, when it's really more of a 'best guess' based on the evidence, possibly with in an in-built assumption that the most conceptually likely thing (itself a 'best guess') did indeed happen, and inferring events (the most likely sequence of events) that led to observed results..

This is quite hard to unpack, and reason about in the same way as 'logic', which many minds seems to gravitate to...


It seems like you're using this example to imply statistical education is likely to blind people to obvious logical problems with their interpretation of evidence. If that's the case it isn't a very good example since multiple comparisons is a standard issue commonly covered in a first or second course in statistics.


Why can't journals set a bar and flag some of these practices in peer review?

It's disillusioning to see that academia seems to face as many of these issues as any other profession or industry.

So much productivity is lost due to behavior driven by long standing incentives to benefit institutions, corporations, and established power, rather than incentives that hold people accountable for quality and the efficient acquisition of knowledge.

Is it so much different than problems that come up with politicians, police officers, medical doctors, or corporate bad behavior?

I think my naivete was that a community dedicated to truth and discovery would somehow be less susceptible to these problems, when in fact, it's just human nature reacting to an environment that can be as unhealthy as any other.

At least I can allow myself an occasional fantasy of earning the 30B I'd need to take over Elsevier and turn it into a non-profit interested only affecting positive change.


> Why can't journals set a bar and flag some of these practices in peer review?

Kind of a chicken and egg problem. Peer review is done by other researchers working in the field, often on a volunteer basis. If the average researcher in the community is committing statistical common errors, they won't know to flag them in the paper.


At least speaking as a academic (4th year PhD student), one of the challenges with researchers over reliance on NHST seems to be the apparent lack of a more compelling alternative. One candidate which is gaining traction is Bayes factors but there are challenges with this approach as well e.g. with suitable specification of priors. The best way forward will involve fundamentally restructuring how we educate incoming researchers because it will necessarily mean embracing uncertainty over dichotomization of results into yes/no. Andrew Gelman and John Carlin write elegantly about ways forward here: http://www.stat.columbia.edu/~gelman/research/published/jasa...


The alternative is to stop testing a default "nil" hypothesis and instead come up with various mathematical/computational models for the process that generated the data, then test that on future data.

The model that makes the most precise and accurate predictions with the least assumptions/complexity should be used until a better one comes along.

This is quite honestly just how science used to be done before NHST. It still is done this way in many areas of physics and engineering.


Whoever choose letter p (suggesting probability) over s (significance) or e1 (type 1 error value) did a major disservice to the world of science.

P value is a measure of error (And not a probability but maybe one kind of likelihood), not evidence, how likely it is we are seeing the result of a chance. (How the chance is defined is important and hidden in the method used to compute the p value.) And any hard cutoff is really missing the point, there is a reason this is a number not a binary value.

Effect size estimate or risk ratio are much more useful numbers anyway.


>"P value is a measure of error (And not a probability but maybe one kind of likelihood), not evidence, how likely it is we are seeing the result of a chance"

Nope, please stop trying to "help" people understand p-values without understanding them yourself. Reminds me of this just a few days ago: https://news.ycombinator.com/item?id=14646844

Honestly, most of the problem is people who don't know what they are talking about teaching statistics/science to each other... They have created an insane vortex of misinformation and ignorance that keeps sucking in new areas of research.

PS: If you or someone you know is doing research and don't recognize almost unbelievably huge problems with the way things are done, you are most likely part of the problem.


P is a kind of likelihood with very strong assumption. Likelihood of type 1 error (accidentally getting a result) if error distribution matches student t squared - if you are using the Fischer test to compute it. (ANOVA that is.) And if the hypotheses are independent which is supposed to be the case for null.

Specifically a logarithm of a likelihood ratio of two hypotheses given the assumption.

Note specific three ifs. Then read what the difference between likelihood and probability is before correcting people. Or what a difference is between likelihood and likelihood ratio is.

(Not much in case of null hypothesis testing, unless you didn't check the null to be true in untreated population. Or how true it is, as in didn't get likelihood for placebo. You get the actual likelihood approximation from Wilks' theorem after relating F distribution to chi squared.)

Now this fails if:

The actual distribution of the observations is not close to T or normal. (There are some more technical reasons, this is a shorthand.) For example multimodal or highly skewed. This is typically glossed over. (Will usually cause a false positive.) Has to be checked for null too.

The hypotheses are related. (Say multiple observations over different dosages.) There are different tests that should be used instead.

A few more technical reasons.


This is wrong in multiple ways. The p-value is not an error rate even if the null model is correct, "alpha" (the cutoff, usually 0.05) is the error rate.

The p-value also has nothing to do with "how likely it is we are seeing the result of a chance." As you note regarding the t-distribution, the p-value calculation assumes the null model is true (which not necessarily, but usually amounts to "chance did it").

How can a procedure both at once assume something is true and give you the probability it is true? You are also using the term "likelihood" in a strange way (it has a specific meaning when it comes to statistics). Anyway, your comments seem to indicate you hold two standard misinterpretations of p-values:

1 If P = .05, the null hypothesis has only a 5% chance of being true.

2 A nonsignificant difference (eg, P >= .05) means there is no difference between groups.

3 A statistically significant finding is clinically important.

4 Studies with P values on opposite sides of .05 are conflicting.

5 Studies with the same P value provide the same evidence against the null hypothesis.

6 P = .05 means that we have observed data that would occur only 5% of the time under the null hypothesis.

7 P = .05 and P <=.05 mean the same thing.

8 P values are properly written as inequalities (eg, “P <=.02” when P =.015)

9 P = .05 means that if you reject the null hypothesis, the probability of a type I error is only 5%.

10 With a P = .05 threshold for significance, the chance of a type I error will be 5%.

11 You should use a one-sided P value when you don’t care about a result in one direction, or a difference in

that direction is impossible.

12 A scientific conclusion or treatment policy should be based on whether or not the P value is significant.

https://www.ncbi.nlm.nih.gov/pubmed/18582619


Alpha is the error rate of the binary significance test if at all. Not related to any of the observations.

However, since p is a loglikelihood ratio, it is related to the observation and null themselves. To get actual likelihood you need to exponentiate, know the degrees of freedom and know one of the actual likelihoods. (Typically null is easier to come by as the other side is assumed to be null+treatment.) This exact likelihood is of course tiny. Which is why inequality is supposed to be used.

The process is not valid if any of the test's assumptions is violated in a big way.

Other than this, all the points are true.


The standard meaning of likelihood is P(data|Hypothesis)[1] where, for the purposes here, hypothesis will refer to "chance/accident generated the data". Do you agree with this? (I understand you say "kind of likelihood" because we are dealing with the inequality)

If so, can you clarify what you mean by "P is ... Likelihood of type 1 error (accidentally getting a result)..." ?

[1] https://en.wikipedia.org/wiki/Likelihood_function


The standard meaning of likelihood is P(parameters|data).


Source?


Actually I should have written L(parameters|data), I'm sorry for the confusion. Which is equal, for a given value of data and parameters, to P(data|parameters). But the likelihood function is not a function of the data (with the parameters fixed), it is a function of the parameters (with the data fixed). Its "meaning" is not a probability distribution of different outcomes given the hypothesis. But I maybe misinterpreted your comment.


How is the p-value a log-likelihood ratio?


A mathematical sleigh of hand depending on big N, chi-squared statistics and Wilks' theorem. This is obviously inaccurate for small N.

Otherwise p depends on the exact test employed. (Specifically the distributions employed. For Fischer's exact, hypergeometric, notoriously hard to calculate, but if you can, you can recover likelihoods too.)


A p-value can be calculated for any statistic. The p-value is often calculated using the sample average. I wouldn't say that it is an average, though.


What's wrong about #6? Is it just blurring the difference between "less than .05" and "only .05"?

#7 isn't clear to me either, though I'm guessing it's a pithy way of saying that two results that are both "significant" can easily have different evidential value.


Here are the reasons given by Goodman (2008):

>"Misconception #6: P = .05 means that we have observed data that would occur only 5% of the time under the null hypothesis. That this is not the case is seen immediately from the P value’s definition, the probability of the observed data, plus more extreme data, under the null hypothesis. The result with the P value of exactly .05 (or any other value) is the most probable of all the other possible results included in the “tail area” that defines the P value. The probability of any individual result is actually quite small, and Fisher said he threw in the rest of the tail area “as an approximation.” As we will see later in this chapter, the inclusion of these rarer outcomes poses serious logical and quantitative problems for the P value, and using comparative rather than single probabilities to measure evidence eliminates the need to include outcomes other than what was observed.

[...]

Misconception #7: P = .05 and P <= .05 mean the same thing. This misconception shows how diabolically difficult it is to either explain or understand P values. There is a big difference between these results in terms of weight of evidence, but because the same number (5%) is associated with each, that difference is literally impossible to communicate. It can be calculated and seen clearly only using a Bayesian evidence metric.16"


Thanks for liberating those quotes from the shackles of Elsevier (and fixing the parent).

About the letter 'p': I'd ban the word 'significant' instead if we're going to go there. You could say something like 'null-improbable' if you insisted on using NHST -- such a term would make it harder to forget that it's a fiddly technical concept.


There was a problem with pasting the symbols (= vs < vs <=, etc), I think I have it fixed now. Please check again.


It seems to me that model selection is usually the problem when good evidence in the data is discarded.

Most researchers are taught statistics using linear models and linear correlation. They learn how to use them well and they look statistical significance using linear models. Linear models match only linear patterns. Everyone knows about Anscombe's quartet, but it seems that there is very strong tendency to overdo linear models.

You sometimes see scatter plot that shows clear nonlinear interesting pattern that differs significantly from the null hypothesis, but authors use linear model to get P-values.


The letter p was chosen to suggest probability because the p value is a probability.


I guess this could be considered kind of OT. On the other hand, if people were introduced to it sooner, before and outside of the context of a personal investment, i.e. "proving" their own research, we might be better off also with regard to our science.

----

Delaying statistics until college -- for most public school programs -- is a mistake.

It leaves a large portion of the U.S. population with a fundamental mathematical illiteracy.

And given the degree to which statistics rule our lives and collective decisions these days, a societal illiteracy.

Just look around you.


Indeed, probability and statistics make a lot of sense to include in a high school curriculum. In addition to the practicality, they are very colorful topics that can serve to make students more interested in math.


Okay, this is really interesting. I'm pretty skeptical of their methodology for assessing "statistical training." The setup could result in studying a different correlation entirely. For example, we might instead be studying the difference in a researcher who's unwilling to send a survey back without Googling or double-checking the work vs. a researcher who is busy or willing to be wrong... among researchers who are willing to respond in the first place. I don't expect that to be a productive line of discussion, though, so I'll try to ignore it.

I can't help but think of Thomas Kuhn, who argued that the institutions of science (from researchers to reviewers to the press) tend to ignore study results that conflict with the current paradigm. So as long as our paradigm is correct, science progresses extremely quickly. When our paradigm and assumptions are wrong we spend a lot of time floundering because we ignore conflicting data that should instead lead to refinements or more questions. We want to slap a label of right/wrong on something and move on to the next question. That mentality can really hinder our ability to assess anomalies later on.

This is similar to the problem discussed in the paper. When researchers decide a paradigm or test result is true or false with no further thought to the confidence level or details, you end up with a system where uncertainty and anomaly are both ignored. It then takes a tremendous amount of momentum to overcome the assumptions, which are sometimes several steps back into "accepted science" at that point. We might even throw out some good ideas from the previous paradigm as we transition. I'm not a physicist, but it seems like they're seeing this happen right now. We've certainly seen this several times with dietary health.

A decent analogy might be a hike with unclear trails and markers. At the first crossroads we might decide that left is the correct path to your destination. Once down that path, there are dozens of other paths. If we find trails that continually lead nowhere, the common human response is to keep trying well past the point where we should have gone back to our original assumption. When we finally decide to go back and try the right-side path we completely give up on the left side, despite the fact that we haven't checked every possible sub-trail on the left side.

Wandering may be impossible to avoid, even with good judgement, but we can avoid a lot of wasted time by looking back at each of our crossroads/assumptions, assessing the probability that each is correct, then moving forward in a way that's most likely to answer a new question while testing the previous assumptions. By and large this is not how science is working. We get a few studies, accept something is true, and then wander off randomly down the trails that look most interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: