Here's a good example of how p values are misleading:
Consider a bag containing two coins, one that's fair and one that always shows heads. Pick one coin at random and flip it ten times. Suppose you see ten heads. Would you be comfortable saying that you have the biases coin? Most likely you would. Formally you'd compute the p value (in this case something like 1/1024 or 2/1024 depending on if you do a one sided or two sided test). You'd compare this with some cutoff, perhaps 0.05 and claim that you can reject the null hypothesis.
So far so good.
Now let's perform that experiment again but this time instead of having one fair coin and one biased coin in the bag, suppose we have one biases coin and 10^1000 fair coins. Suppose again that you randomly pick a coin from the bag and flip it ten times. Again you see ten heads. Do you think you have the biases coin? Of course not! What's the chance that you just happened to choose the single biased coin instead of one of the 10^1000 fair coins? Practically zero. In this case it's much more likely that the ten heads were a statistical fluke from a fair coin. However, if you compute the p value you still get 1/1024 (or 2/1024 for a two sided test).
This example should be reasonably convincing that one needs access to a prior in order to correctly interpret p values. One way of looking at it is that you need to set the threshold appropriately based on the plausibility of the alternate hypothesis. Another way to phrase it is that we actually need to compute the posterior Bayesian probability of the null hypothesis being true.
There are tons of other confusing / surprising examples where a naive p-value calculation will be misleading.
Except this is not what a p-value is supposed to tell you in the first place. It's supposed to tell you how compatible your observations are with the ASSUMED theory. The minute you use it as an interpretation of a posterior probability (let alone a malformed one), you have violated the assumption on which you rely to interpret it.
To use your own example. If you picked a coin from said bag which is mostly full of otherwise unbiased coins, and your hypothesis is "this coin is not biased", a sequence of ten heads would be largely "incompatible" with your hypothesis (as if the hypothesis of unbiasedness was true, you should only expect this result 1/1024 times).
It is up to you to then place this result in the context of other prior assumptions etc, and say "yes it's incompatible but still very likely". The p-value does not preclude you from performing a post-hoc analysis of the base-rate if you so wish. It just so happens that the 'assume a hypothesis' approach is a useful one in the absence of all other information. Which is not so different from a Bayesian saying 'I'll use an uninformative prior and let the data speak for itself'.
But using it as a test of posterior probability, especially when prior information exists, and saying "ha see it sucks" is not a fair criticism for it (whereas criticising its use as a test of posterior probability is).
Yes, that's exactly my point. Saying that it really supposed to be the probability of the data assuming the null hypothesis is true is pretty confusing to most people and often times the gut feeling people keep us that it still somehow tells you how likely it is that the null hypothesis is true. I just like this example because it highlights that there is a distinction between these two probabilities.
In fact, if you look at the document I link to above, you'll see the correct Bayesian calculation where we actually compute p(Biased | 10H). In the first case it's pretty high but in the second case it's very small.
I 100% agree that p-values are hard to get right (not that Bayesian analysis is necessarily more straightforward, mind you). What I disagree with is the assertion that a) the two scenarios listed above are equivalent, and b) that you cannot incorporate a prior analysis for use with p-values (which is best thought simply as a measures of compatibility of the data with respect to a hypothesis). E.g. in the biased vs unbiased example, if your null hypothesis is "The coin is biased", then it is simply the wrong hypothesis for the problem (or, more specifically, the wrong null/alternative pair, since these two only make sense as a pair; a null by itself is not sufficient to define the assumptions for the problem). If your alternative hypothesis is "ok, it's one of the 10^1000-1 unbiased ones", then your null should have necessarily been "it's the biased coin with 1/10^1000 chance of being chosen", which effectively forces you to apply your prior probabilities on the hypotheses themselves. Once you do that, you will easily see that the compatibility of the data with the alternative hypothesis is much higher, exactly as bayesian analysis would predict. In fact, mathematically it does not make sense for a Bayesian result to contradict a frequentist one, when looking at the same outcomes; if this happens one should suspect that the two problems were formulated using different assumptions, and are thus not equivalent, or that one is looking at different outcomes.
I also disagree with the 'mindblowing' example in your blogpost linked above (though I did enjoy reading it, thank you for linking!). Your example only shows the 'superiority' of bayesian reasoning because you (presumably unintentionally) misrepresented the alternative hypothesis by the wrong 'tail' (to keep things simple - one could go further say that there are more effective alternative hypothesis formulations than a 'tailed' one here, since we are not testing simply against the alternative of 'higher bias' but against a known specific bias of 90%). One's choice of alternative hypothesis is part of one's assumptions reflecting prior knowledge. The analysis is misleadingly poor for the frequentist analysis, in that it is equivalent to the bayesian analogue of saying "Assume we have a coin that gives heads 90% of the time and I toss it 10 times; let's represent this with a prior probability Binom(10, 0.1)". What? Of course you can expect such a poor prior to give wrong results with high certainty! It makes no sense to assume that if your null is not true, then you should look for prior knowledge in the lower values. Furthermore, the exact definition of a p value allows great flexibility in the definition of the 'or worse' criterion. You are free to choose a criterion that takes into account prior information (and that forms part of your assumptions). In the 'mindblowing' example, you see that as soon as you choose the ranking measure to be P/(P+Q), and use it to calculate p-values by summing over the probabilities of the 'or worse' cases with regard to that ranking, you get very clear patterns that agree with your bayesian analysis (i.e. that 0 HEADS are more compatible with the unbiased coin hypothesis, than 1 HEADS, and similarly that the reverse applies for the biased hypothesis).
I think this example misleads one's intuition for the following reason: in the proposed scenario, you'd only see the coin come up heads 10 times in a row about 1 in 1024 times you ran the experiment. While your conclusion would likely be incorrect, you almost never run into that scenario.
For example, if you conducted a study every week for 20 years, you'd both be extremely prolific and expect to have drawn about one wrong conclusion.
The example is a case of an absurd premise (i.e., a fair coin comes up heads 10 times in a row) leading to an absurd conclusion (that the coin is biased). Of course, this is exactly the guarantee the hypothesis test provides: under robust assumptions, you'll draw the wrong conclusion only rarely.
A 1 in 1024 chance is not an absurd premise! Something that has a 1 in 1024 chance of ocurring to a person happens to 7 million people. If you get that evidence, you need to be capable of interpreting it correctly.
The OP's example would also worked well enough with even just 5 heads; you'd already pass p < 0.05, despite the actual probability of having picked the biased coin still being minuscule.
You're actually being misled. Formally, you have an observed event, 10H and so you want to know the probability that your coin was fair, conditioned on having seen 10 heads. You have fair coins (F) and biased coins (B)
You have P(H) == 1/1024, P(H|B) == 1/1. And,
Then the probability that you picked a biased coin (P(B))? 1/10^1000. The probability you picked a fair coin (P(F))? (1 - 1/10^1000)
All in all, P(B|H)? Well Bayes rule tells us P(A|B) = P(B|A) * P(A) / P(B). Or in this case, P(B|H) = 1 * 1024 / 10^1000, or approximately 1 / 10 ^ 997.
> For example, if you conducted a study every week for 20 years, you'd both be extremely prolific and expect to have drawn about one wrong conclusion.
This is a prior! In the kind of experimentation I do, I run literally tens of thousands of experiments a day.
I've been playing a board game daily for the last two weeks. As part of this game, a player has to draw one out of six unique cards. In the last five games, I've repeatedly drawn the same card. This is a real example.
You can't rely on odds after an event has happened to determine probably. I could shuffle a deck containing 1000 unique cards and then look at their order. It's astronomically low that this order occurs but it did happen.
When discussing conditional probability, this is absolutely what you can do.
Let's use your example. Conditional probability is P(A|B), the probability of event A, given that event B was observed. What's the probability that I am a magician, given that I shuffled the deck and when you saw it it was still in new deck order?
Now certainly, there is an astronomically small chance that this was observed due to random chance. And yet if you observed this, I'd expect that you would, with relatively high confidence, believe that I stacked the deck.
I find it fallacious to see this as a criticism of p-values.
If I wanted to identify whether a particular coin was the [most] biased out of 10^1000, flipping one coin 10 times isn't a reasonable setup for the experiment.
And, my conclusion to your hypothetical experiment would be: the person who told me that there was a 99.9...% chance this coin was fair was likely incorrect about that.
I am an engineering manager in a large ecommerce company, overseeing machine learning and our in-house A/B testing framework.
One of our big tenets in my organization is that null-hypothesis significance testing and any associated methodologies (random effects, frequentist experiment design, and various enhancements to fgls regression) are simply not applicable and not useful to answering questions of policy (any kind of policy, which subsumes pretty much all use cases of running tests).
We take a Bayesian approach from the ground up, put lots of research into weakly informative priors for all tests, develop meaningful posterior predictive checks and we state policy inference goals up front to understand what we are looking for (predictive accuracy? measurement of causal effect sizes? understanding risk in choosing between different options that have differently shaped posteriors?).
which discusses “type m” (magnitude) and “type s” (sign) error probabilities in Bayesian analyses, and how that can provide some benefits and flexibility that NHST methods a cookie cutter power designs cannot.
Your mileage may vary, but my org has found this to be night and day better than frequentist approaches and we have no interest in going back to frequentist testing really for any use case.
For feature analysis (should we enable this new feature) I think Bayesian approaches are far better.
I work in a slightly different space, which is qualification of builds (this assumes you have feature flags and code, and the code shouldn't enable features, but it may include refactoring, as well as the code that will later be enabled by a flag etc.)
For this, I originally wanted to push my org to do things in a more Bayesian, but it didn't work. What did work was forcing org leaders to sit down and actually decide how costly true positives were, and therefore how much developer time they were willing to sink into chasing ghosts.
If you can say "we expect to spend 4 hours of developer time investigating an alert here", and 1/10 alerts will be real at our current sensitivity, then you can decide if these things are alright.
Ultimately we can't control all the variables (our sensitivity requirements are informed by things like SLOs that we don't directly control), but it does help make informed decisions about prioritization.
The type m/s stuff is really neat, although in my space we don't care too much about those, although that may be because we have conventionally gargantuan sample sizes.
Something that helped me understand where frequentist logic sat was Computer Age Statistical Inference. A lot of the tools of frequentism were developed for a time that is quite different to our own.
Just personally, I don't think frequentism is incompatible...it seems to have just been built up to frame everything in terms of a problem that is tractable (i.e. t-tests), and that is effective but comes with pitfalls. In economics, as an example, it seems that this toolset has gone pretty far.
What I like about Bayesianism, to my ill-informed mind, is that Bayes Theorem feels parallel to other parts of statistics that are effective: Kalman filters, Metropolis-Hastings, MDPs, even ELO is Bayesian (Glicko makes this explicit). And, this is in no way empirical but, when the results are directly comparable then I have seen better results (even when the Bayesian model is at an information deficit). No idea if that generalises beyond my activities (largely, sports modelling), there are still many pitfalls, and implementation can be tricky (I still don't understand Bayesian Data Analysis)...but it is pretty useful.
I am philosophically Bayesian, but given that I work with large datasets, I am practically frequentist.
I actually think that many of the people who promote Bayes everywhere have never tried to run a simple Stan regression on 100k+ data-points (pro-tip: sample, then sample some more, give up as it's taking too long).
That being said, from a philosophical point of view, Bayes is definitely the way I think.
In my org we frequently run Bayesian regression fitting on datasets of millions of samples.
The key is: don’t use pymc or stan for large data, just actually write your own MCMC code and write log likelihoods for your own models. It’s very easy and very fast, even in Python.
We do still use pymc and stan for other, smaller modeling tasks.
Yeah, but it's not worth it for my purposes. Given the kinds of data and the wide variety of problems we deal with, it would be an investment of too much resources relative to the rewards.
Great comment. I can see how the Bayesian setting uniquely equips you for dealing with non-point testing settings in which reasoning about alternatives to do power, type m, and type s design would be hard.
I'd be very curious about two follow-up questions here:
1 - A frequentist approach to the above could still be formulated in a minimax sort of way, but then you have to deal with alternatives which are close to your null. It's not like this problem goes away for Bayesians, it still seems like the final sample sizes you calculate could end up being very sensitive to the prior. Does this happen in practice?
2 - What kind of optimization goals do your users prefer when trading off power, type m, and type s? My guess based on this formulation it's something of the flavor "max power st P(type m or type s or type I) <= alpha", but wanted to check.
For 1 - keep in mind that frequentist inference cannot be used to support a statement like “the probability that variant A is better than variant B is X” or “the distribution of improvement from variant B over variant A is Y” - the only question a frequentist analysis can answer is, “Assuming the null hypothesis distribution is true, the unlikeliness on the observed data is Z.” From that point of view, we find frequentist analysis simply is epistemologically unsuited for comparing multiple (or a continuum of) policy options, period. Given this, then we start to ask totally new types of experiment design questions. We no longer ask an unphysical question such as, “assuming effect size X and sample size N, what is the probability of falsely rejecting the null” - the idea of “rejecting the null” doesn’t map to any notion of optimal policy selection, so we just don’t care about such a question when designing an experiment. Instead we ask ourselves, “if the true effect size is X, what is the probability I will make a mistake in estimating that effect size, and how large a mistake?” or “if the true relationship between the target and the covariate is positive, what is the probability I’ll mistakenly think it’s negative?” - basing experiment design on how my physical beliefs can be wrong helps me make decisions. Basing it on tail properties of a null distribution does not.
2. It really depends on the experiment. For causal inference, both type s and type m need to be very low, but type I can be high (I don’t care about rejecting a null). For inference where I only care about final predictive accuracy, this may not matter. For example in an extreme case I could have two perfectly collinear predictors, which means their coefficients can be arbitrary as long as the sum yields the true coefficient on the underlying linear component. If my goal is causal inference of the effect size, this would ruin it, since type m error can be unboundedly bad. But if the goal is overall predictive accuracy, it doesn’t matter at all - I just ignore the coefficients.
> “ it still seems like the final sample sizes you calculate could end up being very sensitive to the prior.”
I prefer to flip this around. A frequentist model has a prior too whether anyone wants to admit it or not - usually it is some unrealistic flat / uninformative prior or improper prior. The results of a frequentist method are equally sensitive to this implicit (huge) assumption. A Bayesian approach at least makes it explicit, admits the sensitivity, puts the range of prior choices out in the open for skeptical review, lets you carry out sensitivity analysis, and very often relies of real data and domain expertise to posit a much more physically plausible prior.
This looks really neat but I unfortunately don't have the necessary background knowledge to understand what this is even visualizing. Is there a tutorial to help acquire the necessary math/background knowledge to be able to comprehend what's being modeled?
For example, what are "Power", "Alpha", "n", and "d"?
Also "Type I" and "Type II" errors? What's the intuition as to how these all relate each other?
Is there a blog post, chapter, that explains all of this? Or would this require significant learning i.e. a full semester course or a textbook.
I'm going to try my best to explain these in an intuitive way. Statistics has lots of terms with names that are arbitrary and confusing.
To set the context, we are trying to use data to help us test a hypothesis. An example might be: "if we give this pill to a person, they will be cured of their disease". Statisticians test this by setting up two groups: Group A gets nothing (or a placebo), Group B gets the pill.
In statistics, you assume the "Null Hypothesis", in other words, that there is no difference between the two groups. You use hypothesis testing to help you "reject" the null hypothesis, to say that the groups are actually different. If the groups are different, that means the pill cures the disease. So we take a bunch of data about the two groups, run some math on that data, and use the result of that math to help us decide if we can reject the null hypothesis.
Statistics is a bunch of tradeoffs between certainty, making the wrong call, and data volume. The terms you have mentioned are either "knobs" (tradeoffs) we can make or measures that helps us understand our results.
Here's what those terms mean:
Type 1 Error: also known as "False Positive". You thought the pill cured the disease, but it does not.
Type 2 Error: also known as "False Negative". You thought the pill did nothing, but it actually works.
Power: the chance to avoid Type 2 error (false negative). The higher your power value, the lower chance you incorrectly assume your pill is ineffective.
N: The number of "observations", in our case, the number of patients in the trial for our pill.
The others are a little trickier to explain.
Alpha: Statisticians use a "confidence interval" as a way to communicate how uncertain they are about a particular result. In our trial we might say "patients were 15% less likely to have the disease after taking the pill, give or take 2%". We don't think the decrease is exactly 15% (what we observed) but is instead somewhere in that neighborhood. Alpha is a measure of the chance the real effect is OUTSIDE of your confidence interval. So in this case, the chance the effect is < 13% or > 17%.
Cohen's D: In our trial, we might measure "the number of times the patient coughed in a day" in addition to "do they have the disease anymore yes/no". In order to compare our two groups, we make look at the average number of coughs per day in group A vs group B. This is called measuring the "difference in means". Cohen's D is a formula to measure the difference in means that also encodes your uncertainty in the result.
> Alpha: Statisticians use a "confidence interval" as a way to communicate how uncertain they are about a particular result. In our trial we might say "patients were 15% less likely to have the disease after taking the pill, give or take 2%". We don't think the decrease is exactly 15% (what we observed) but is instead somewhere in that neighborhood. Alpha is a measure of the chance the real effect is OUTSIDE of your confidence interval. So in this case, the chance the effect is < 13% or > 17%.
I know this sounds intuitive, but it is wrong.
The true effect is not a random variable.
The random variable is the statistic.
When we say "95% confidence interval", we are referring to the fact that 95% of the confidence intervals constructed based on the sampling distribution under the null will contain the true effect, not the chance that the true value is in the specific confidence interval you constructed.
Edit: The latter is either 0 or 1 but you don't get to find out in the context of a single test.
I absolutely loath statistics terminology. It’s such a road block for people and an example of erudite traditionalism not being challenged enough.
(although the current terms in question are very mild and not knowing them probably speaks to how the vocab in general blocks people from learning stats rather than these particular terms doing it)
It’ll probably require 2-3 semesters in stats to really understand - most intro courses barely cover the basic and don’t even reach power. You really have to apply regression to many real datasets to truly understand the concepts.
I am sure some statisticians have good reasons for hating on descriptivist statistics but I wish they would express it with less vitriol. I am not generally sympathetic when someone says a tool sucks when the tool seems to do an okay enough job for most purposes.
I find this visualisation to be one of the better looking ones for demonstrating this concept. I wish the author had included the formula however since it is helpful to see exactly which one they have used.
There are two main reasons why frequentist[1] methods are losing popularity:
1. Your tests make statements about some pre-specified null hypothesis, not the actual question you’re usually interested in: Does this model fit my data?
2. The probability statements from frequentist methods are usually about the properties of the estimator, not about the thing you’re estimating. This is very unintuitive and confusing.
So why are these big deals? Let’s tackle the null hypothesis first. The null hypothesis isn’t necessarily that the effect is null, or there is no effect. It’s exact meaning is very specific to the test.
The easiest way to think about it is imagine you have some test that you’ve designed that tests if a person’s height is 5’ 10”. You build some measurement error into the test, so if a person is reported at slightly above or below 5’ 10” they won’t be automatically eliminated. This is how you design a frequentist test. You set up a scenario like this, and you study the properties of your test so you can make a statement like if a person is 5’ 10”, I’ll only be wrong with this test 5% of the time (false positive rate, or Type I error) or, alternatively, if they’re not 5’ 10”, this test will say they’re not 5’ 10” 80% of the time (statistical power).
The problem with this approach is only a problem if you’re situation doesn’t match the conditions the test was originally designed for. You probably don’t care if someone was 5’ 10” exactly. You’re probably more interested in knowing what the likely height of a person is also while accounting for measurement error. You also might not necessarily have the same type of measurement error. You have different measuring tools and different people collecting your data. If you know frequentist statistics, this isn’t such a huge deal because you can make your own test by modifying the conditions of this one. If you don’t and you just use this original test, then you might end up with some really odd results (like the replication crisis in the early 2000s in psychology). Most frequentist tools are exactly this kind of thing. The null hypothesis of the test is usually that some effect is zero, but that effect being zero or non-zero is not really that meaningful. Your model could be easily misspecified and still give you non-zero effects.
If we look back at the story of the height test, then you can already see the beginnings of our second point. The probability statements concerning the test aren’t statements about the likelihood of the quantity of interest. They’re statements about the behavior of the test. Our test, repeated enough times is theoretically guaranteed to only make errors of one type, false positives, 5% of the time, and only make errors of another type, false negatives, 20% of the time. We don’t actually say anything about the value of the height and its likelihood with this.
Isn’t that weird?
This is the exact problem with confidence intervals and p-values (which are basically two ways of talking about the same thing). The 95% confidence interval isn’t saying that there’s a 95% of the true height in there. It’s saying that this construct, the confidence interval, will include 5’ 10”, when 5’ 10” is not their true height, only 5% of the time.
This is so strange and unintuitive that people, even trained statisticians, incorrectly interpret them. If anyone says they’ve never looked at a CI and thought about the endpoints as rough expectations for the possible ranges of “height”, they’re lying. Obviously we don’t go publishing those unnaturally easy mistaken interpretations as findings in academic papers, but anyone that does by mistake (!) should be forgiven or at least sympathized with.
I don’t say all of this to convince you that they’re useless. In fact they’re not at all! Bayesian statistics actually relies quite heavily on the frequentist theory of estimators to evaluate the results they get from their conditioning procedures (Hamiltonian Monte Carlo, Metropolis-Hastings, and other Markov Chain Monte Carlo techniques). Frequentist statistics is very useful and very valuable. It’s just very easy to misunderstand and abuse. The estimators are like little domain specific tests that get too broadly applied because they’re very easy to fire and forget. Unfortunately a lot of decisions went into developing the exact scenario that they’re testing, and most people don’t know how to assess those conditions or modify them to better suit their needs.
I hope that was helpful, sufficiently interesting to read, and easy enough to follow!
[1]: I believe you meant frequentist, not descriptive. Descriptive statistics are just things that describe data, like how many modes does this have, what’s the mean or median, etc.
>I hope that was helpful, sufficiently interesting to read, and easy enough to follow!
I am studying biostatistics and currently struggling to wrap my head around the minutae of estimators so I enjoyed your response. It's all stuff I've heard before, but it seems everytime I read such an argument I understand a little more :)
Don’t worry. I have an MS in biostatistics and I’m in my sixth year of a PhD in statistics. I and everyone I know struggled with these concepts for a really long time. It just takes time and some hard work. It’s hard, unintuitive stuff!
Consider a bag containing two coins, one that's fair and one that always shows heads. Pick one coin at random and flip it ten times. Suppose you see ten heads. Would you be comfortable saying that you have the biases coin? Most likely you would. Formally you'd compute the p value (in this case something like 1/1024 or 2/1024 depending on if you do a one sided or two sided test). You'd compare this with some cutoff, perhaps 0.05 and claim that you can reject the null hypothesis.
So far so good.
Now let's perform that experiment again but this time instead of having one fair coin and one biased coin in the bag, suppose we have one biases coin and 10^1000 fair coins. Suppose again that you randomly pick a coin from the bag and flip it ten times. Again you see ten heads. Do you think you have the biases coin? Of course not! What's the chance that you just happened to choose the single biased coin instead of one of the 10^1000 fair coins? Practically zero. In this case it's much more likely that the ten heads were a statistical fluke from a fair coin. However, if you compute the p value you still get 1/1024 (or 2/1024 for a two sided test).
This example should be reasonably convincing that one needs access to a prior in order to correctly interpret p values. One way of looking at it is that you need to set the threshold appropriately based on the plausibility of the alternate hypothesis. Another way to phrase it is that we actually need to compute the posterior Bayesian probability of the null hypothesis being true.
There are tons of other confusing / surprising examples where a naive p-value calculation will be misleading.
I wrote about these things here: https://mindbowling.wordpress.com/2016/07/19/p-values/