>The type of analysis being banned is often called a frequentist analysis
I find that there is a trend of associating "bad statistics" with "Frequentists Statistics" which isn't really fair. If you found a statistician trained only in Frequentist methods and asked their opinion on experiment design in psychological research they would likely be just as appalled as any Bayesian.
I'm a big fan of Bayesian methods, but the solutions of "we'll solve the problem of misunderstanding p-values by removing them!" is still a problem of misunderstanding p-values! The misunderstanding is the issue, not the p-value.
I think what this journal doing is probably a good thing, but only as the lesser of several evils. The truth is something more like... it is easier for soft sciences to abuse frequentist statistics than bayesian. Both have merit, it cannot be argued, but it is simply easier to produce meaningless conclusions with frequentist statistics done wrong.
This situation is so bad that it merits banning frequentist for this journal and I think that's reasonable. This doesn't mean that every journal in every field should, but perhaps it will be a useful temporary measure to improve quality.
The problem is that p-values are begging to be misunderstood, and in fact you cannot use them as a decisionmaking procedure without "misinterpreting" them – after all, you're deciding whether to accept the hypothesis P(HA|D) based on 1-P(D|H0) on the grounds that, while they're not the same, they're proportional. (In that sense the p-value is like the poor man's likelihood ratio.) There's nothing wrong with p-values as a concept, but there's everything wrong with p-values in hypothesis testing. The misunderstanding is baked in.
You can update your posterior based on the p-values yourself though. "Well those eggheads may have disproved X, but X is just common sense, so I'm gonna keep believing it anyway. U-until I see more studies confirming the finding I mean."
I think the problem is not "if you find a frequentist (as opposed to bayesian) statistician", but "if you find a frequentist (as opposed to bayesian) e.g. biologist".
Non-statisticians have been trained using bad, frequentist methods, and one way of forcing them to retrain is by forcing them to learn new statistical tools to get published.
Banning p-values makes sense to me, as they force you to declare an effect as either significant or not significant, rather than looking at the preponderance of the evidence and building knowledge over multiple experiments. It also leads us to focus on statistical uncertainty at the expense of all of the other kinds of uncertainty researchers are faced with: do I have the theory to back this up, am I actually measuring what I am trying to measure, is this coherent with other findings in the field? I do think the editors might be right when they say banning p-values will make the quality of the research go up, not down.
But if you read the original editorial, up at http://www.tandfonline.com/doi/pdf/10.1080/01973533.2015.101..., you can see that they also reject confidence intervals and Bayesian reasoning with uniform priors (which is really the same thing) without providing any guidance at all on better procedures. I fear that will just lead readers to try and guess the reliability of the data themselves, or worse, interpret the sample statistics as numbers without any associated uncertainty.
So they're doing away with poor statistical procedures, but at what cost? It's like that old joke: we've found a 100% reliable cure for cancer – bombing the planet until everybody's dead.
Methodological flexibility in statistical modeling has gone trough the roof. New tools and estimation methods make Bayesian methods easy to use: Jags and Stan, Hamiltonian Monte Carlo, the variational Bayesian approximations, expectation propagation and even probabilistic programming (note: Bayesian in this context does not mean subjectivity. It's the ability to quantify uncertainty in the results and increased flexibility in the modeling).
New methods and tools are faster to use and give better results, but they require more statistical knowledge. If the scientist applying these methods or peer review can't understand the advances, it's all for nothing. Many sciences are methodologically very conservative to the extent that it holds the science back.
How do you increase the statistical knowledge of the field so much that peer review and researchers can be expected to understand and use the new methods if they can't be trusted with p-value?
I spent about a hour explaining p-values to a fellow graduate-level researcher a few weeks ago. I pretty much just kept rephrasing the definition in slightly different ways until the person finally got it. In undergrad, hypothesis testing was more or less taught as "do this inscrutable calculation and if the result is 0.05 or less, you win". The point is, in my experience, a lot of people really don't get p-values, even graduate-level and post-graduate scientists.
Let me replay it to you and see if I understand it, because I don't know if I do:
You assume a null hypothesis that (usually) represents the status quo of no influence between the theory and the data. You then collect data. The p value then describes the probability of that data aligning with / being as a result of the null hypothesis.
In other words, a p <= 0.05 says that you have <5% chance that the data came from the theory stated in the null hypothesis - that is, you have a 95% confidence that you can reject the null hypothesis in favour of your new theory.
Is that correct? I may have minced terms there because my stats training is woefully inadequate, but I think I adequately conveyed the concept?
Not quite. The 5% represents the chance that if the null hypothesis is true, you would draw data at least as extreme as the data you just saw in a repeated experiment.
Computing the probability that the data came from the theory stated in the null hypothesis would require a (Baysian) prior.
Also, Tloewald's reply is completely and inexorably wrong. Tloewald seems to want a Bayesian answer, which frequentist statistics can't give you.
"Given the evidence, there is a >=5% probability of the null hypothesis being true"
and
"There is a >=5% probability that if the null hypothesis were true, that your data would be at least as extreme"
The only difference I see is how you avoided saying anything about the null hypothesis, but I don't see how you can avoid saying anything about it.
if the h0 were true, then the probability of the result is unlikely, how can you not conclude that h0 is unlikely? What step are you missing other than collecting a preponderance of evidence against it?
The article never enters into this distinction. It makes it clear that people misinterpret evidence against the null hypothesis as evidence of the alternative, which is a false dichotomy.
I am confused. I also have sympathy for Tloewald at this point.
Sure. Let H0 be the null hypothesis and D be the data you observed. The first statement is P(H0|D) = 0.05. The second is P(D|H0) = 0.05.
The two quantities are related to each other via Bayes rule:
P(H0|D)=P(D|H0)P(H0)/P(D)
So indeed, as P(D|H0) goes down, so does P(H0|D). But if P(H0)/P(D) is sufficiently large, you can easily have P(H0|D) high while P(D|H0) is low.
I too have sympathy for everyone confused by frequentist stats - they tend to answer the exact opposite question that one really wants answered. In contrast, Bayesian stats tend to answer the question that most people ask.
P(D) is the probability of observing the data you just saw, due to either the null or non-null hypothesis. It's a strictly Bayesian quantity, since it's dependent on a prior. If your model has only a null and alternative hypothesis, then:
Since I just learned yesterday that statisticians precisely distinguish between "probability" and "confidence", kudos for using it correctly. At least I believe you used them correctly.
Others have already explained it well enough, but I'll just add a few important concepts. First, if you remember nothing else, remember this: the p-value has absolutely nothing to do with the alternative hypothesis. In the context of p-values, there is no alternative hypothesis; it doesn't exist. The p-value only says anything about the relationship of your data to the null hypothesis. The second thing to keep in mind is that any null hypothesis likely has lots of assumptions built in to it, and if any of those is violated, the associated p-value is invalid. So you should always be aware of what those assumptions are, and try to verify that your data satisfied then if possible. (Almost no one does this, of course.) For example, the standard t-test assumes, among other things, that your data is normally distributed and that all groups have equal variance. If either of these assumptions is violated, the test is invalid.
Lastly, if you're in a situation where you're performing a large number of tests, always correct for multiple testing (i.e. false discovery rate or some similar method). Also, you can take advantage of your large data set to construct a negative control (e.g. by shuffling samples, the exact method will vary) and verify that your chosen statistical test gives a flat distribution of p-values, which is the expected result when all of the null hypotheses are true. If you have an excess of small p-values in the null data set, this indicates your test is producing false positives and is not reliable (presumably because one or more assumptions have been violated).
When you reject H0, it means that you can be somewhat confident that there was was some kind of distortion in your data that moved the mean away from (in this case) 0.
You can create a number of theories that purport to explain this mechanistically, but you'll often need particular setups like a randomized controlled trial that can eliminate alternative explanations. When you've eliminated all the competing reasonable hypotheses, there can only be one. If you can use that hypothesis to make non-obvious predictions, that's further proof that it's right as well.
Hypothesis testing is there to tell you when to take an effect seriously, it doesn't tell you whether your explanation is right outside of very carefully constructed circumstances (i.e. where if you see a particular effect, only one theory can explain it).
But the additional theories that you will disprove in further experiments are a subset of Ha so how can disproving H0 not be seen as evidence that Ha is correct. How can
H0: the 2 means are equal.
and
Ha: The 2 means are not equal not encompass the whole universe.
For practical purposes NHST is a function that returns either H0 or Ha.
In my opinion the most useful answer would be, for example a mean:
The mean is within this interval. We used a model that has this properties, 95% of the...
I think it is ridiculous that the main focus always seems to be arcane properties of the statistical algorithm and not the answer that it delivers.
This is false. The null hypothesis is either correct or false. The veracity of the null hypothesis does not change as long as the experiments are repeated the same.
How does this contradict what I said? Enlighten me?
I said the theory -- by which I meant the non-null hypothesis -- is not endorsed by a low p value (indeed, this is a major point of the article). A low p-value says "hey they doesn't look like random data" not "your brilliant hypothesis is probably true". The data might not look random because of a methodological error, outright fraud, or a confound.
This is particularly important when you consider people looking at data over and over again trying to find "an effect". Theoretically, the tests are supposed to get tougher and tougher each time you examine the data (add one degree of freedom) but in practice this doesn't happen. It doesn't matter much with large data sets, but the social sciences often use datasets where n is roughly 100, and you might only have 20 subjects in a cell.
I agree. I'm in the pool of people who don't really understand p-values even after taking 2 statistics courses. The main issue I've noticed is that compared to many other calculations, you just have no idea if your result is correct. You can make wrong assumptions, wrong calculation, apply wrong methods, but in the end you get a number... and that's it. It may be a wrong number and you'll never know. This is completely different to many other practical applications of math where you can verify your result in various ways, or validate your answer against the initial assumptions, or test your program against lots of inputs.
The main issue I've noticed is that compared to many other calculations, you just have no idea if your result is correct. You can make wrong assumptions, wrong calculation, apply wrong methods, but in the end you get a number... and that's it.
The same is true in Bayesian statistics, and even simple formal reasoning with no statistics in sight. If you make wrong assumptions, you'll get the wrong result.
The only thing you can expect statistics to do is help you change your opinion about the relative merits of opposing theories. If both your opposing theories are wrong, you will still be equally wrong.
The true flaw with frequentist statistics is that it goes out of it's way to hide this fact from you. In contrast, Bayesian stats forces you to explicitly choose a prior, enumerate your assumptions, and accept that your conclusion is based on these things.
It's actually entirely possible to do some "checking of your answer" for p-values, as well. As you mentioned, for practical math, you can often validate your answer against the initial assumptions. This is true for statistical testing too, as it typically relies on many theoretical assumptions. So what you can do in practice is propose a different set of assumptions, perform the hypothesis test in a manner that follows those new assumptions, and see if you obtain a similar result. Typically, for any given hypothesis that you want to test, there are several possible methods for performing that test, so you can redo your test many times. This is one type of robustness checking, which includes many other things as well (e.g. running your test over subsamples or resamples of the data, checking for sensitivity to outliers, etc). Good statisticians generally like to do lots and lots of robustness checking.
I understand p-values, but I always have real problems understanding the thing of 95% confidence interval not meaning 95% probability of the true parameter being in the interval. I once grasped it, but then I forgot the reason. And now I look at this paragraph:
"the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval. Rather, it means merely that if an infinite number of samples were taken and confidence intervals computed, 95% of the confidence intervals would capture the population parameter"
And I say, "What? If out of every 100 random samples, in 95 of them the parameter is in the interval, then surely the probability of the parameter being in the interval is 95% by definition?"
Where was the catch? I remember there was one (which is enough for practical purposes because it means I won't say in a paper that the probability of the parameter being in the interval is 5%) but I feel dumb for not being able to see at first glance something that is supposed to be basic statistics...
OK, I think I more or less got it and I can answer my own question... correct me if I'm wrong, but I think the interpretation "What? If out of every 100 random samples, in 95 of them the parameter is in the interval, then surely the probability of the parameter being in the interval is 95% by definition?" is flawed because we are not talking about the same interval every time, right?
If I have a variable that is always positive, then I could have a weird procedure to generate confidence intervals that gives me the interval [-inf,0] 5% of the time and the interval [0,inf] 95% of the time.
This would meet the definition of confidence intervals perfectly, and yet when I get [-inf,0] the real probability of the parameter being in the interval is 0%, and when I get [0,inf], it's 100%.
I wonder how large this discrepancy may be in practice (as this is obviously a made-up extreme case).
The key thing to note is that the parameter being estimated is a fixed (but unknown) quantity. Unlike in Bayesian inference, where we assume the parameter (or our belief in it) is random. The distinction is important. We want inference about the parameter, yet because the interval is the random quantity with confidence intervals, we cannot assign probability statements to the parameter.
As a completely fabricated example, suppose the true proportion in a coin flip experiment is 40%. If my confidence interval is [.45, .65], what's the probability that .40 is in [.45, .65]? It's 1. The probability that a fixed, but unknown parameter will lie in any confidence interval will be either 0 (it isn't in the interval) or 1 (it is). The _proportion_ of times the interval contains the true parameter is the level of confidence (95%).
To your always-positive example, that procedure is not particularly weird. There's always a balancing act with CI's about length and confidence level (otherwise, I could choose all reals as my interval and get 100% confidence level). That your [0, inf) interval has 100% coverage means that you could probably shrink that interval so that it has finite upper bound without losing more than 5% confidence. Hard to say without a specific distribution in mind or mild assumptions, but an application of either Markov's or Chebyshev's Inequality would allow you to make really loose bounds with only relatively minor assumptions.
IIRC, in bayesian terms, confidence intervals express `P(X ∈ [A,B] | X=x) = 0.95`, where X is the unknown parameter and [A,B] is the interval, assumed to be some fixed function of the data. Ordinarily I think they expect this to be satisfied for all values x. So this is P = 95% where the parameter is known but the interval is not (because the interval depends on the data, which is not known yet).
On the other hand credible intervals express `P(X ∈ [A, B] | A=a,B=b) = 0.95` (or more generally `P(X ∈ [a,b] | the data) = 0.95`). The latter is what is intuitively meant by "95% probability" of the true parameter being in the interval, because you do know a and b but not the parameter.
The example with random sampling of confidence intervals from {ℝ⁺, ℝ⁻} is indeed a good illustration of the difference.
"The probability of this parameter being in the interval" is what's called a credible interval, and it will coincide with the confidence interval if you assume a uniform prior, that is, if before running the experiment you figure every result is equally likely.
In general, frequentists just don't like to talk about the properties of this particular sample, only about long-term frequencies – hence the name. Why? Because they object to the idea of probability as a degree of belief rather than as an objective measure, and given that attitude the statement that "there's a 95% probability the parameter is in this interval" doesn't make any sense: either it's in the interval or it isn't.
> it will coincide with the confidence interval if you assume a uniform prior, that is, if before running the experiment you figure every result is equally likely
Huh? Unless you can bound the set of potential results, this isn't possible. Say I want to estimate the half-life of some material (bounded below, but not above). A uniform prior doesn't exist. How will the credible interval relate to the confidence interval?
There was a question yesterday about Evidence Based Medicine vs Science Based Medicine. The SBM criticism of EBM is the over-reliance on Randomized Controlled Trials that meet p=0.05, without looking at the prior probability that a treatment would help.
For example, EMB would say that if you have an RCT that shows that a lucky rabbit's foot works, then you have reasonable evidence to put that into practice. The issue is that, as this article points out, even with a statistically significant result for such research, there's no plausible mechanism by which a rabbit's foot makes you lucky. Therefore the RCT is just one part of the whole picture, and subject to a specific type of manipulation.
Something related to keep in mind when it comes to p-values is the false discovery rate:
> If you run lots of phase 2 trials with different drug candidates where only a minority (lets say 10%) actually work, then with standard trial statistics (80% power and 5% false positive rate) you will get 4.5% false positive and 8% true positives – so less than 2 out of every 3 positive trial results were real. A much lower success rate than the 5% error rate commonly assumed.
> The SBM criticism of EBM is the over-reliance on Randomized Controlled Trials that meet p=0.05, without looking at the prior probability that a treatment would help.
The strength of significance testing is that it purposely doesn't try to tell you how likely something is to be true, only how likely the data you got was the result of chance assuming the treatment is no better than placebo. You're still taking the prior probability into account when trying to figure out the truth, you're just not putting a number on it.
My concern with bayesian approaches is that, like with frequentist approaches, the truth is still fundamentally unknowable, only now you're encouraged to put a number on that and pretend that it's science. While bayesian approaches totally make sense in trying to determine a patient's likelihood of having some disease when there is already data available for the prevalence in a population and the sensitivity and specificity of the tests, using bayesian logic to weight clinical trials strikes me as being highly dubious.
It would be one thing if SBM actually developed a framework to give a weight to each methodological feature of a trial, but so far I haven't seen much work to build a functioning system. Though if you're really honest about all the ways that you can have positive results without something actually being true, it seems like almost no amount of research will ever have a significant effect on the prior.
You're absolutely correct that Bayesian approaches are not magical and do not suddenly supply you with vastly more information than frequentist approaches (particularly when you have a really poor prior, in which case the Bayesian approach will be similarly poor). Bayesian statistics is certainly very popular right now, but it should not be looked at as some sort of panacea for all statistical problems.
However, I would say that Bayesian approaches do have a big advantage in terms of helping with the interpretation problems that plague frequentist significance testing. Namely, as the OP article points out, Bayesian approaches reformulate the testing question in a way that is more intuitive, i.e. "what is the probability of the hypothesis given both the prior probability and the new data?". So yes, Bayesian methods surely do not fix everything, but since interpretation of statistics is such a major concern, they can be quite beneficial.
That number will be based on some unproveable assumptions. That's a fact of Godel's incompleteness theorem, if nothing else. So given this, why is it "dubious" to make those assumptions explicit and obvious?
> The fact is that to make a good decision, eventually you need to compute a single number.
Your linked blog post states that if you make a good decision, then there is a process computing a single number which is equivalent to your process. This is not equivalent to what you claim. As a matter of fact, it's the same kind of confusion that exists around the p-value.
It's not the case that a process explicitly computing such a number automatically makes good decisions, which is what you seem to claim implicitly.
Also, Gödel has nothing whatsoever to to with this.
Eventually you need to compute a number which is either above or below your go/no go threshold. That's the number I'm referring to.
I don't claim you can't arrive at it by some perfect heuristic. I merely claim that you are better off being explicit about your assumptions and formalizing your reasoning. That just makes mistakes more obvious, makes your strong assumptions more clear, and makes it more likely that you will correctly update your beliefs rather than incorrectly discounting/overvaluing evidence.
You are right about godel, it's a separate theorem I'm referring to which says you need unproveable axioms. I misremembered, sorry, wrote that before my coffee.
I see where you're coming from, and I agree with you in large part, specially about making your assumptions explicit.
However, I think it's important to notice that an explicit formula for your thought processes can be difficult (computationally expensive) to find. Our brains have evolved to use heuristics and "gut feelings" to make decisions, and the approach you propose forces you to throw all that away and use the much slower general purpose processing part of your brain to emulate those processes. So there's a tradeoff there.
> That number will be based on some unproveable assumptions.
Given that, would you support using a random number generator as part of the drug approval process to remind people of the importance of the unknown and unknowable?
[edit: I previously said I didn't understand Alex's point.] I now understand the point you were trying to make. A better way to put it - if a random number generator were used in a decision process, I'd favor making the algorithm and random seed explicit.
Any procedure you use will have assumptions. You can't escape this. The only question is whether we show or hide them. Can you give an argument in favor of hidden assumptions and non-explicit procedures?
> Can you give an argument in favor of hidden assumptions and non-explicit procedures?
So as counterintuitive as it sounds, I think there are actually a couple of good arguments that can be made here:
1) With significance testing, the burden of supplying the assumptions and determining meaning is largely on the reader. With bayesian, it's transferred to the author. While it might make sense to use Bayesian for things like the Cochrane report, it's not obvious to me that each person who designs a research study and collects/analyzes data should also be in the business of trying to say whether some phenomena is real when looking at all other studies.
Essentially each study now becomes a metastudy, with all of the practical and epistemological problems that entails. The fact that it's difficult to figure out what that even means should be a red flag. (And yes, I realize this is the Chewbacca defense.)
2) So TokenAdult actually turned me onto this book Measurement In Psychology, which is all about the epistemological problems with assuming that anything you can assign a number to is a measurement. That is, having the property of being meaningful when interpreted on a ratio scale. The exact argument is kind of esoteric, but the basic takeaway is that it's very easy to trick yourself into thinking that just because you can assign a number to something that it's a measurement, to the point where assigning numbers to things in the first place tends to lead to worse decision making than if you had just used a green/yellow/red system or whatever.
Regarding (1), with significance testing the burden of supplying assumptions is not placed on the reader. The assumptions are implicitly built into the NHST rather than explicitly built into the prior.
As for each study becoming a meta-study, that's silly. This is indeed the chewbacca defense. Rather, each empirical study provides Bayes factors which the reader can then use to update their posteriors.
Regarding (2), obviously not every number is a measurement. In Bayesian stats, numbers representing probabilities are quite explicitly opinions. They are meaningful on a ratio scale, and are even asymptotically known to be correct. But they aren't measurements.
(They are correct if your priors are absolutely continuous w.r.t. reality. If you hold a religious belief so strong that evidence can't change it ("100% certainty"), that's not an absolutely continuous prior.)
You have heard "19 times out of 20" described in the news? That is the 0.05 restated for laypeople. 1 time out of 20 you will get a false positive, in this case that the rabbit's foot worked.
Nobody. The problem is that if you conduct 20 trials of the efficacy of rabbit feet, you'd expect 1 trial to show a significant effect. (if you're interpreting P-values the way many people do, incorrectly.)
The article is certainly correct that p-values and confidence intervals (or confidence sets, in multi-dimensional contexts) are widely misunderstood, not just in psychology or other social sciences, but in the hard sciences as well. The problem is even worse when you look outside of academia at common practices in more applied settings.
As suggested, a good approach is to take p-values not as conclusive or decisive, but rather as a tool that must be supplemented by other statistics. In particular, the article emphasizes Bayesian methods, which can certainly provide additional information, but this approach can also be rather limited when priors are not well-defined or are entirely unknown, which is unfortunately often the case in many problem domains.
One potential question is how to determine the nature of the distinction mentioned in the conclusion between "preliminary research" and "confirmatory research", particularly in cases where statistics provide the primary evidence, as in, e.g. psychology. Further studies in the same vein as the preliminary research can certainly provide additional supporting statistical evidence, but this doesn't escape the problem that all of the evidence is probabilistic in nature. The key issue here is that since statistical approaches can only give probabilistic evidence that a hypothesis is correct, then they strictly cannot tell you what is certainly true, so even confirmatory research is quite open to falsification. So we wouldn't want the label of "confirmatory research" to somehow suggest to the public the idea that it is certainly correct.
Is the cautious approach then to treat a p-value in the absence of priors on the same level as a p-value in presence of unfavorable priors? When someone tests positive for a cancer test, the priors are known (probability of cancer in the general population is usually very low, and the false positive rate of the test may be relatively high), and so usually that first test is merely indication that further tests are needed. So when you don't know the prior and you observe a low p-value on something, isn't that just "preliminary research" that needs to be further confirmed with other methods or at least the same test but using other data?
> Is the cautious approach then to treat a p-value in the absence of priors on the same level as a p-value in presence of unfavorable priors?
In the presence of a poor prior the Bayesian probability would be biased in some way, so frequentists would say that the p-value in the absence of priors is actually superior in this case. Bayesians would reply that if they thought the prior might be poor then they would simply consider multiple different priors, but it's not clear how this would improve things much over the frequentist approach that simply assumes that the prior is unknown.
> So when you don't know the prior and you observe a low p-value on something, isn't that just "preliminary research" that needs to be further confirmed with other methods or at least the same test but using other data?
Yes, when you observe a p-value with low significance it should definitely indicate to you that more testing is necessary, either by using different testing methods, gathering new samples, or even just increasing the original sample size if that's possible. What I was trying to suggest in my last paragraph was that this should be the case even when we have highly significant p-values, because even significant p-values are not decisive. So even when we have "confirmatory research" that is highly statistically significant, we should still do all of the things that we would do when we have a p-value with low significance. It is sometimes the case that this subsequent research will overturn even very highly statistically significant results (though often this is unfortunately because mistakes in the original statistical methodology are uncovered).
The paper talks about how it seems researches are "hacking" (their word) p-values.If the researchers lack the ethics using one form of statistics, what is really stopping them from misusing Bayesian Analysis?
Sidenote: While we are talking about bayesian stuff.
I recently ran into the sleeping beauty problem ( http://en.wikipedia.org/wiki/Sleeping_Beauty_problem ) and it showed how different interpretations of an experiment can lead to people believing in two very different answers in an almost religious way. Some people argued this thought experiment shows a clear flaw in bayesian thinking. I'm not sure either way, I'm still thinking about the puzzle.
That's a cool problem. It took a while to understand the angle.
I'd say I'm a halfer. I think the extreme sleeping beauty problem highlights how if you're woken up, it's not any more probable it's a head or a tails awakening, since, for a reason I can't explain, the million tails awakenings "don't accumulate".
In a sense, it's like the opposite of the Monty Hall problem, since here the sleeping beauty receives no information whatsoever during the experiment.
<quote>
BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.
</quote>
In other words: report the effect size, plot the data, and increase the N.
The problem is this: You need a solid background in statistics and some mathematics to perform these tests properly. Most apparently, people from that field lack these skills. If someone can't hit a nail on the head with a hammer, blaming and banning the use of hammers won't solve a thing.
Depends what are you trying to solve - people's understanding of p-values, or many false results? If you care about getting true results and there are easier ways to get there than p-values, why would it be a bad solution?
Similar thing to what happens in software security. Sure, you could tech people how to use memory management properly - and then watch them fail again and again. Or you could just provide an alternative solution like rust which changes the issue completely.
Taking hammers away will solve the problem of holes in things which aren't supposed to have them, including thumbs, walls, people's heads, and candy jars.
I never understood where 0.05 came from. It seems like an arbitrary, magic number. Aren't those bad in science? Shouldn't we have some reason for every number we use? Why 0.05 instead of 0.049?
It is somewhat arbitrary. 0.05 corresponds to roughly 2, and 0.01 to roughly 3 standard deviations under the normal distribution. In high energy physics, for example, in the past a result with 3 sigma significance was considered “a discovery”, but it turned out there were a lot of false signals, so it was upped to 5 sigma (~ 3 x 10^{-7} percent chance of getting the effect assuming there is no effect in reality), while 3 sigma (or 4) result is considered “evidence of effect“. Unfortunately in softer sciences, considering their relatively generally poor state of research standards, we have no idea if 0.05 or 0.01 levels really are good standards of “discovery”.
If the stats aren't based on samples of real-world data, rather than very rigorously designed experimental comparisons, I think there's another reason to be wary of p-values. [Note, in what follows I probably use some terminology wrong, because I'm not a statistician, but I do think the point is important and I don't see much written about it.] In the real world data is not a bunch of independent events, but events (or data points) that are interconnected in ways that are difficult to quantify. I once heard a presentation by an expert in AB testing at a leading tech company who became wary of his results and brought his concerns about non-independence of events that were being tested to corporate statisticians. (He was concerned that, even though the AB testing procedure supposedly randomized the test, interconnectedness within website traffic was not being accounted for.) By his account, the statisticians agreed it was a problem but recommended that he assume that variations due to non-independence would more or less balance each other out. He wasn't satisfied with this and said that he went ahead and did some more measurements, then ran out all the binomial expansions rather than relying on approximations. When he did this more detailed work, he found out that with at least some web-based AB tests, where conventional statistical formulas showed a p-value of .05, he was getting a confidence level of more like 30%. (I don't think you could call his measurement a p-value because he wasn't using the formulas normally used to compute p-values, but I think what he was saying was more or less that the p-value from the forumalae was .05 when it should have been .30 from a more rigorous look)
As I'm not a statistician by trade I don't keep up on the literature very well, but interconnectedness of data does seem to me to be a very important issue. I'm wondering if anyone can point me to some helpful reading to understand this side of the issue better. In particular, is there any approach to AB testing that can reliably address the issue of data interconnectedness in the kind of situation described above?
Could you expand on what you mean by "interconnectedness within website traffic"?
If one person visiting the site has no influence on other people visiting the site, then measurements of their behavior will be independent. If Facebook tests a different interface on half of their users and the changed behavior of those users indirectly has some impact on the behavior of the control group, then your measurements would have some level of dependence – I can imagine that this could happen but it's not clear to me that this would be a common scenario. The same would happen if you measure the behavior of the same person more than once – but in this case there's many procedures for working with paired or autocorrelated data.
The presenter I mentioned did not go into details about what interconnectedness he found, but I think it's quite obvious that people visiting a site do have influence on other visitors, which is at least a part of the underlying issue. On the simplest level, most web sites have share buttons to make it as easy as possible for visitors to influence other traffic. Or, other examples, a trending tweet can massively influence patterns of usage of a website or Facebook page (or many tweets with small reach can cause many small influences); or an RSS feed might influence patterns of tweeting or posting elsewhere. I think there are myriad other interconnections within web traffic. We do a great deal of work to drive traffic that is based on the premise that different visitors to websites are mostly interconnected. It's these factors that give me pause when I think about statistical measures that are premised on an assumption that we're measuring independent events.
Hmmm, again, there are certainly ways that interactions between visitors can cause statistical dependence, but not in the specific case you mention. Let's take an A/B test on a referral funnel. If a user invites all of his friends, and his friends then visit the site, they will be randomized over A and B just like the original user, and so any effect that is not due to changes in the referral experience will simply not matter because it will contribute equally to both groups.
Without better examples it's very hard to judge whether this is a real problem.
I understand if you think this is a non-issue, though I don't agree. The speaker I referenced about asked the statisticians at his company about this and they said it was a non-issue because things balanced out. He thought that was an idealization and claimed to have tested it building in some real world data, and reported that interconnected data of this kind drastically affected confidence levels. He didn't get into the details of how he measured interconnectedness, however.
The example you give seems to me to oversimplify the issue of complex interconnections between data points, as if the traffic on a real website came from one set of referrals, while in reality its much more complex, with referrers inducing other referrers and a variety of campaigns, postings, etc. influencing each other, and over time, overlaid in a fairly complex pattern. In other words, a bunch of interrelated data, very little of which is actually independent of other items.
I'm not really asking for an explanation of this in the comment thread here; what I'd like to know is, if there are any studies or other publications that deal with the issue of how to evaluate tests run on interconnected data of this kind.
There are absolutely ways to deal with what you call interconnected data, as I mentioned earlier: paired tests, corrections for autocorrelation, nonparametric and bootstrap methods for non-normal data and so on. But barring any examples of what you mean with interconnectedness in this context, it's hard to recommend any studies or publications because there is no One Method Of Interconnectedness Correction.
Also, statistics deals with many idealizations but the idea that randomization allows you to cleanly measure the effect of an intervention in the face of what would otherwise be confounding is simply not one of them. Sorry to disappoint, but with all you're telling us it simply sounds like the speaker was clueless.
Well, if he was clueless then two very large and successful tech companies had a clueless guy running their AB testing and showing great results in each context.
I'm certainly not looking for "One Method for Interconnectedness Correction" (especially not, as you put it, with each word capitalized). I'm looking for studies or papers that might have addressed anything like the effect of interconnectedness of web data on AB testing. I think you're saying, you don't know of any, and also that you personally don't think it's a real issue.
I kind of which that the widely read publications, like newspapers, that have turned out science articles based on all these now pretty much discredited research articles would issue retractions or at least clarification for each of the articles that has been challenged in this regard.
This is very interesting for me. I recently switched from AI to Human Computer Interaction which is more cross discipline, specifically with a strong influence from psychology. I'll gladly admit that I don't have the best understanding of NHST but I think I understand it well enough.
Interestingly enough there was a post on HN somewhat recently about Bayesian alternatives (BEST). The paper that was linked was:
ftp://ftp.sunet.se/pub/lang/CRAN/web/packages/BEST/vignettes/BEST.pdf
And the recommended book I settled on was by the author of that paper (Doing Bayesian Data Analysis)
I feel like I'm "ahead of the curve" thanks to HN :D
I wonder if it just boils down to this: Exclusive reliance on any single tool by an entire field for a long enough time period will eventually lead to a proliferation of bad results.
More like exclusive reliance on a single tool will lead to a requirement for everybody to use it, even if they don't understand how to use it properly, which leads to both unintentional misuse (by experimenters who don't understand it) and the inability to catch intentional misuse (because adjudicators don't understand it either).
i think this is also important on a conceptual level. There's just too much literature in life sciences nowadays which, instead of formulating bold hypotheses with clearly distinct effects, is happy to report miniscule effects that just crossed the arbitrary significance barrier.
the main reason to ban significance testing in all fields is because of logicallee's getcher scientific results agency.
Our prices are:
$10,000 random study, no results guaranteed. Not recommended! Highly likely to be damaging.
$20,000 basic study, p=0.5 - study inconclusive but implies it's at least not "more likely" that the damaging/negative result is correct. (Assuming uniform priors.) No scientific value.
$100,000 weak FUD. Suggests that the damaging/negative results (assuming bayesian reasoning or uniform priors) may be incorrect, at a suggestive p<0.10 level. Not conclusive and invites further studies which can strengthen damaging/negative result! Not recommended unless further studies are unlikely. Consists of ten studies, one of which is published with remaining buried. Unscientific.
$200,000 basic refutation. Refutes the damaging/negative result at a statistically significant p<0.05. Likely to be referenced and accepted. However, due to significance level, links to logicallee's getcher scientific results agency should still be avoided! (Invites skepticism.) Scientific. May take up to six months.
$500,000 Silver refutation. Highly significant refutation at p<0.02. Study can be extremely rigorous and sponsorship can be public. The results should be referred to and referenced as widely as possible. Unpublished results (the other 49 studies in the series) to remain unpublished and unreferenced. Due to the number of parallel studies to be involved, may take up to one year for these results. Highly scientific.
$1,000,000 Gold refutation. Highly significant refutation at p<0.01, with extreme amounts of data to be published. Should become the gold standard for data in the field, and the most significant effect published. Full refutation of damaging/negative result, with full scientific rigor. Should be widely promoted. May take up to 24 months to produce data. Gold standard of science.
ACADEMIC BONUS: prices are free for tenured professors, thousands of whom can do all the studies they want and only publish if they see some significant effect.
I find that there is a trend of associating "bad statistics" with "Frequentists Statistics" which isn't really fair. If you found a statistician trained only in Frequentist methods and asked their opinion on experiment design in psychological research they would likely be just as appalled as any Bayesian.
I'm a big fan of Bayesian methods, but the solutions of "we'll solve the problem of misunderstanding p-values by removing them!" is still a problem of misunderstanding p-values! The misunderstanding is the issue, not the p-value.