The Statistical Crisis in Science

yummyfajitas · on Jan 16, 2015

So at least two people reading this seem to think it's about using science in the context of their pet peeves. It's not.

It's about using a statistical test for a data dependent hypothesis and interpreting the test as if it were used for a data-independent hypothesis. That's all.

It's not about using statistics in politics or finance. It's about first looking at the data, then formulating a hypothesis, then running a standard test which is based on the idea that you chose the hypothesis independently of the data. This is a problem in any field.

tansey · on Jan 16, 2015

Indeed!

I have a particularly relevant horror story. For one of my graduate classes, I built a game where two people (a liar and a truth teller) would try to convince a third person (the judge) they had experienced something. The goal was to see if people used different language when lying than they did when telling the truth. It was a fun project and we were focused more on the NLP/ML side of things. Turns out to be reasonably possible to separate our ~500 examples via a linear SVM, with some interesting separating words. That's where our claims stopped.

Then I took the results to a psychology prof. They loaded the data into SAS or something and it proceeded to perform HUNDREDS of independent t-tests. The results came out in a few seconds and the professor exclaimed "Oh look! Pronouns are statistically significant! Oh and possessive nouns too!" -- I cringed.

The flip side of this is that as statisticians, we do know how to handle different kinds of testing. If you're going to be looking at all these different outcomes, that's fine, but you just need to correct for it. We've had Bonferroni correction since the 1960s; Benjamini-Hochberg and related false discovery rate methods have been around for almost 20 years now. In fact, there are even situations where the data can help define a prior for your hypothesis testing [1].

Lastly, there is this quote from the article:

> There is no statistical quality board that could enforce such larger analyses—nor would we believe such coercion to be appropriate

I'm not sure if that's such a bad idea. In theory, that should be the job of the journal reviewers and editor. In practice, it's often the blind leading the blind, with confirmation bias thrown in to boot. Maybe we need an SRB as a companion to IRB.

[1] http://arxiv.org/abs/1411.6144

mcguire · on Jan 17, 2015

"Then I took the results to a psychology prof. They loaded the data into SAS or something and it proceeded to perform HUNDREDS of independent t-tests. The results came out in a few seconds and the professor exclaimed 'Oh look! Pronouns are statistically significant! Oh and possessive nouns too!' -- I cringed."

This wasn't, by any chance, James W. Pennebaker from UT Austin? (Or somebody related to him?)

I was just reading The Secret Life of Pronouns and his chapter on lying does point to pronouns as a key factor for distinguishing lying. On the other hand, he does take some pains to note that getting to 65% correctness is reasonably easy but getting much more has been impossible so far, outside of restricted environments such as laboratory conditions.

noahl · on Jan 16, 2015

So actually, I think the idea that data-dependent hypotheses are bad is fundamentally wrong, and is based on a misunderstanding of probability.

The reason you'd avoid data-dependent hypotheses is simple: if the data comes from a process with some sort of randomness in it, then there will usually be things that appear to be interesting patterns but are in fact random artifacts. If you look at your data, you may be tempted to formulate a hypothesis based on these random artifacts. It may pass statistical tests (because the data you have does contain the pattern), but it is not, in fact, causal. To avoid this, you maintain the discipline of only making hypotheses before you look at your data, which means that you can't see a random effect and then guess that it's real.

The problem is, this doesn't mean that if your hypothesis passes a statistical test, the result must be causal. It only lowers the probability - there is still a chance that your hypothesis was wrong, but your data happens to contain a random fluctuation that makes it look right. The only way to protect against this danger is to continuously gather data and re-evaluate your hypotheses, while understanding that there is always some probability that the effect you think you see is really random noise.

And once you're doing this continuous monitoring anyway, then there's no reason to reject data-dependent hypotheses. By definition, if the effect you're seeing is a random occurrence, then it should go away with more data. If it doesn't go away, then maybe you've found something that you wouldn't have been able to guess in advance, which is good! And if you see a random effect, form a hypothesis that passes some test, and then assume that your hypothesis must be true, then the problem is not your data, but rather that you misunderstand how probability works.

In short, avoiding data-dependent hypotheses is a hack that only reduces the probability of an error that you should be avoiding entirely anyway. Once you accept this and start avoiding the error, there's no reason to avoid data-dependent hypotheses, and they can be quite useful.

cja23 · on Jan 16, 2015

Yes! I think any good statistician would agree with all of that and emphasize the importance of how you present your conclusions. Forming "data-dependent hypotheses" is part of what most statistician's call "exploratory data analysis" (EDA). When presenting EDA findings, we should use terms like "association", "relationship", "correlation", "possible", and not words like "cause", "effect", "p-value", "test", etc.

yummyfajitas · on Jan 16, 2015

No, the problem is that p(reject|null hypothesis) != p(reject|null hypothesis, experimenter saw pattern).

Its true that continuous observation eventually fixes things. In the limit of large N, we can skip the stats and just do division.

lrei · on Jan 16, 2015

I upvoted your comment but a problem in "any field" seems too strong a statement.

AFAIK this isn't much of a problem in CS (my field) and never heard math or physics people complaining...

Seems to only be problem in Social "Sciences" & Bio/Med where many (most?) results are statistical significance tests.

anon4 · on Jan 16, 2015

I think yummifajitas means that if done in any field, it would be a problem, not that it is necessarily observed in every field.

lrei · on Jan 16, 2015

Yes probably what he was saying.

Also, my understanding - not expert, likely wrong - is that it is sometimes done in physics (e.g analyzing data from sensors). But haven't heard any complaints (which does not mean they don't exist). Someone else would need to weight in as this is just supposition and hearsay :)

tveita · on Jan 16, 2015

I bet you could find a fair number of statistics errors in machine learning and other empirical branches of CS.

anon4 · on Jan 16, 2015

Was about to post this. Thanks.

dschiptsov · on Jan 16, 2015

Not only errors and misuse of statistics and misapplying of probability theory, but also abstract modeling in general.

The very idea of modeling dynamic abstract processes such as finance markets, which itself are mere abstractions is a non-science, it is misuse of pseudo-scientific methods and mathematics, and what we have seen so far is nothing but failures.

Too abstract or flawed abstractions and wrong premises cannot be fixed by any amount of math or modeling. They only has to be discarded.

The famous "subject/object" false dichotomy in philosophy is the good example too. People could spent ages modeling reality using non-existent abstractions.

Today all these multiverse "theories" are mere speculations about whether Siva, Brama or Visnu is the most powerful, forgetting that all these were nothing but anthropomorphic abstractions of the different aspects of one reality.

The notion that so-called "modern science" is a new religion (a contest of unproven speculations) is already quite old.

btw, a good example of the reductionist mindset (instead of pilling up abstractions) could be the Upanishadic reduction of all the Gods to one Brahman, to which Einstein accidentally discovered a formula - E = mc2, where c is a constant, implying that there is no time in the Universe).

hessenwolf · on Jan 16, 2015

You are throwing the baby out with the bath water, with respect to financial modelling. Yes, there are failures, and, yes, the models are severely imperfect.

We reduced the risk on a portfolio of 2 billion Euro, from a about a billion Euro to a risk of about 50 million Euro using hedging. The remaining 50 million was mostly basis risk, i.e., the mismatch between the underlying instruments in the liabilities and the hedge assets.

Using a similar logic to yours, senior management argued that we introduced a new risk called basis risk by trading derivatives.

learnstats2 · on Jan 16, 2015

This might be naive of me, but how did you have a billion euro risk on a portfolio of 2 billion euros? How do you assess that risk to now be instead 50 million euros?

With respect, this seems like numbers were plucked out of somewhere. I propose that, if these come from a statistical model, it's the model that paints your position in the most optimistic light.

I don't judge you for that, but you should really read this article and look out for other fallacies of statistics.

mynegation · on Jan 16, 2015

There are other factors that affect the magnitude of risk that are independent of the model. One of them is leverage. Let's say, for simplicity's sake, there is a stock worth $100 and the assumption is that in a week it has 50% chance of becoming $90 and 50% chance of becoming $110. This assumption is your model and your risk is $10.

Now let's say you are bullish on the stock and borrow $400 to buy 5 stocks. In the same model, after gain or loss is realized and $400 is returned to the creditor after one week you are either $50 poorer or $50 richer, your magnitude of risk grew 5-fold by leveraging.

dschiptsov · on Jan 16, 2015

How do you model events such as yesterday's EUR/CHF?

hessenwolf · on Jan 19, 2015

I think the Swiss Bank intervention is the classic black swan story. The model really only works until somebody puts their foot in it.

Risk management looks at the usual processes via stochastic modelling, but you need to also make scenario analysis, where you just make a deterministic event happen and then measure the effect on the value of your portfolio.

mynegation · on Jan 16, 2015

Stochastic volatility and jump diffusion models

hessenwolf · on Jan 19, 2015

The risk wasn't 2 billion Euro, just the portfolio. The risk, according to solvency II, is 30-50% of the equities, approximately a 1% raise or drop in interest rates, 25% change in FX, etc.

The risk is probably about 40% of the 2 billion.

dschiptsov · on Jan 16, 2015

What is wrong with financial modeling, in my opinion, is not only that models cannot grasp too complex "reality", but that it is changing while you are finishing your model, so no statistical "snapshot" or data-set is even close to be correct.

Also, so-called Black swans could occur only within such models. There is no chance that one day c or even g could change (no matter what "scientists" used to say in journals).

Btw, finance is a business, not science.)

tripzilch · on Jan 17, 2015

> There is no chance that one day c or even g could change (no matter what "scientists" used to say in journals).

Well the thing is, c is defined in metres per second. And since 1983, the metre "has been defined as 'the length of the path travelled by light in vacuum during a time interval of 1/299,792,458 of a second.'".[0]

Since 1983, c is defined as constant.

[0] https://en.wikipedia.org/wiki/Metre

codecam · on Jan 16, 2015

You don't actually wan't your model to be a fit to the data -- which I think you are implying it is.

hessenwolf · on Jan 16, 2015

Yeah, so, I'm guessing you don't do a lot of financial modelling.

dschiptsov · on Jan 16, 2015

Not everyone is so lucky.

chriswarbo · on Jan 16, 2015

Scientists have tried over at least the past few hundred years (depending on your definitions) to build, from scratch, a perspective on the world which is as free from human bias as possible. At the moment, the jewel in the crown is quantum physics: an inherently statistical theory, so detached from human biases and assumptions that many smart people have struggled to understand or accept it, despite its incredible predictive power.

At the heart of the whole process is statistical inference: generalising the results of experiments or observations to the Universe as a whole. A "statistical crisis in science" would be terrible news. We may have been standing on the shoulders of the misinformed, rather than giants. Our "achievements", from particle accelerators to nukes and moon rockets, could have been flukes; if the underlying statistical approach of science was flawed, the predicted behaviour and safety margins of these devices could have been way off. We may be routinely bringing the world to the edge of catastrophe, if we don't understand the consequences of our actions.

Oh wait, it seems like some "political scientists" have noticed that their results tend to be influenced by external factors. I hope they realise the irony in their choice of examples:

> As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military.

The article criticises scientists' ability to navigate the statistical minefield of biases, probability estimates, modelling assumptions, etc. in a world of external, political factors like competitive funding, positive publication bias, etc. and they choose an example of measuring how political factors affect people's math skills!

To me, that seems the sociological equivalent of trying to measure the thermal expansion of a ruler by reading its markings. What do you know, it's still 30cm!

semi-extrinsic · on Jan 16, 2015

Saying that quantum mechanics is an inherently statistical theory is a blatant misrepresentation. Precisely the point that makes QM so weird is that it is not caused by statistics. In a (properly set up) double slit experiment, a single electron is simultaneously travelling through both slits and causing an interference pattern.

SideQuark · on Jan 16, 2015

>Saying that quantum mechanics is an inherently statistical theory is a blatant misrepresentation.

Every axiomatic QM formulation (for example, [0]) I have ever seen has an axiom a statistical statement, usually about observables.

Can you provide a set of axioms reproducing QM without such an axiom?

If not, then it's inherently a statistical theory.

In your example, where a single electron lands is completely governed by a probability distribution that is determined by the setup of the test.

[0] http://en.wikipedia.org/wiki/Dirac%E2%80%93von_Neumann_axiom...

gsteinb88 · on Jan 16, 2015

Actually, that's an interpretation of quantum mechanics, not quantum mechanics itself. QM itself is in fact a statistical theory: the result of a calculation is a probability distribution of possible outcomes (even in the cases where the outcome might be deterministic). Taking your example of the double slit experiment, the calculation yields, in the case of two slits, a probability distribution with peaks and nulls corresponding to interference fringes. Close one slit and you get a different distribution without said interference.

What you describe is an interpretation of those probability distributions, and there are many conflicting views on the interpretation of QM (though, to date, these interpretations have not led to conflicting numerical results in the calculations). Wiki has a reasonable list of the extant interpretations [1] but I'll also draw your attention to an interpretation that is inherently statistical: Quantum Bayesianism (Nature has a nice overview [2]). To be crude, the idea is to look at QM from the point of view of the experimentalist: what information does she have available to her? In the case of the double slit experiment, it's impossible to say from a measurement of the electron at the screen which slit it passed through, so you cannot condition the measurement on either one individually, but both together.

QuBism is only one of the many interpretations of QM; others include the many-worlds interpretation (only deterministic in the crudest sense; any collapse of a wavefunction is thought to cause as many divergent universes as there are possible outcomes, but the one the experimentalist observes is still a statistical result), the copenhagen interpretation (also statistical, while the wavefunction is thought to represent the "true" state of a particle, its evolution under measurement is still non-deterministic), etc.

Finally, in case anyone is interested in more in-depth reading, the Stanford Encyclopedia of Philosophy has some fantastic readings on these, and related, topics. Articles worth checking out are that on measurement in QM [3] and those on the various interpretations (e.g. [4-6]).

[1] http://en.wikipedia.org/wiki/Interpretations_of_quantum_mech...

[2] http://www.nature.com/news/physics-qbism-puts-the-scientist-...

[3] http://plato.stanford.edu/entries/qt-measurement/

[4] http://plato.stanford.edu/entries/qm-copenhagen/

[5] http://plato.stanford.edu/entries/qm-manyworlds/

[6] http://plato.stanford.edu/entries/qm-modal/

chuckcode · on Jan 16, 2015

"all models are wrong, but some are useful." - George Box [1]

George Box expressed early my general feeling about statistics, it is a very useful tool but remember the limitations of the methods, the data, and the people applying them. I would like to seen an emphasis on openness and transparency with data so others can replicate the analysis and the community can come up with ways to make best practices accessible to anyone.

[1] http://en.wikiquote.org/wiki/George_E._P._Box

SaberTail · on Jan 16, 2015

A good (in my opinion) trend in physics in the past decade or two has been the rise of "blind" analyses[1]. Basically, the entire analysis is predetermined, before looking at the data. Once all the details are nailed down and everyone agrees with the approach, the blinds are taken off. There's no room for "p-hacking".

This has some disadvantages, though. It requires a good understanding of the experiment so that you can figure out what an analysis will actually tell you. It's difficult to do a blind analysis on a brand new apparatus, since there can always be unanticipated problems with the data. As an example, one dark matter experiment invited a reporter to their unblinding. At first, it looked like they'd detected dark matter, but then they had to throw out most of the events because they were due to unanticipated noise in one of the photomultiplier tubes[2].

[1] http://www.slac.stanford.edu/econf/C030908/papers/TUIT001.pd... is a quick review.

[2] http://www.nytimes.com/2011/04/14/science/space/14dark.html

amelius · on Jan 16, 2015

I really don't understand the meaning of this sentence (below). Perhaps somebody could explain?

> As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military.

lkbm · on Jan 16, 2015

There was a paper recently that found people did poorly in math problems if the naive, wrong solution confirmed their political views: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2319992

It's been making its rounds in the popular media with headlines like "Politics makes you dumb".

zeroxfe · on Jan 16, 2015

As yummyfajitas said, this is just an example of how the actual issue manifests, but here's what that sentence means:

Democrats and Republicans have their own biases. These biases may be skew their thought processes and make them perform differently in mathematics tests which are worded differently. For example, a Republican may (unconciously) overshoot the numbers for a question about healthcare costs, or a Democrat for a question about military expenditure. Although this shouldn't happen, since mathematics tests are quite rigorously worded, it might, and a researcher is interested in investigating further.

jmmcd · on Jan 16, 2015

> In general, p-values are based on what would have happened under other possible data sets. As a hypothetical example, suppose a researcher is interested in how Democrats and Republicans perform differently in a short mathematics test when it is expressed in two different contexts, involving either healthcare or the military. [...] At this point a huge number of possible comparisons could be performed, all consistent with the researcher’s theory. For example, the null hypothesis could be rejected (with statistical significance) among men and not among women—explicable under the theory that men are more ideological than women.

The meaning of a p-value is expressed in terms of what would have happened with a different data set, yes, but that different data set would have arisen through a different random sampling from the population. The explanation above seems to completely misunderstand the issue.

CurtMonash · on Jan 16, 2015

Between the failings in statistics and those in modeling, there's a whole lot of science that's on shaky ground.