Hacker News new | past | comments | ask | show | jobs | submit login

Is this not the definition of "confidence interval"? The first few Google results all define it this way...



Unless I've forgotten more than I hope, I believe the formal definition of a 95% confidence interval is that "if the model is true, 95% of the experiments would result in a point estimate within the interval." This is distinctly different from "a 95% probability that the true value is contained within the confidence interval", but that is typically what is loosely inferred.


It is distinctly different, but I have to think really hard to follow the practical difference between "If we ran 100 more identical experiments, 95 of them would estimate a population mean result somewhere in this range", and "There's a 95% probability that the population mean falls somewhere in this range."

The thing that makes it hurt is that what seems like a minor rewording of that statement, "We're 95% confident that the range we calculated contains the true population mean," would be correct. (It's not the definition of a CI, but it is implied by the definition of a CI.) And that the reason why one is true and the other is false is down to the technical distinction that the true population mean is a discrete value and not a draw from a random variable, so, from a strictly mathematical sense, it is nonsensical to apply probabilities to it. By a the same token, an even more subtle rewording gets us back to falsehood: "There's a 95% chance that the range we calculated contains the true population mean."

Which, delving into to that level of hair-splitting makes for interesting math, but also leaves me with the opinion that, sure, the most common intuitive interpretation is wrong, but it's wrong in a way that isn't really of much practical importance.

By contrast, the most common intuitive interpretation of the p-value is disastrously wrong.


Nope. It is (from a frequentist's model of statistics), exactly what the article is claiming it isn't: If the model is true, and we repeat the experiment several times, 95% of the intervals we calculate will contain the true value. The actual CI you get in each experiment will differ.

Another discrepancy between frequentists statistics and the article is that yes, the values at the boundary of your interval are as credible as in the center.


This is the first post I've seen that states the definition of a CI correctly.

Another note: 'confidence interval' typically refers to the frequentist meaning, whereas 'credibility interval' is used in the Bayesian setting, when describing an interval of the posterior with 95% probability (which is arguably more interpretable). The usages of the two terms do not seem to generally be strict, however.


> the values at the boundary of your interval are as credible as in the center.

What would that mean in the frequentist framework?


To be explicit, and using an example similar to the one in the article, if your CI is (2, 40), with the center being 21, there is no reason to believe that the true value is closer to 21 than to, say, 3.

To provide an extreme case, during the Iraq war, epidemiologists did a survey and came up with an estimated number of deaths. The point value was 100K, and that's what all the newspapers ran with. But the actual journal paper had a CI of (8K, 194K). There's no reason to believe the true value is closer to 100K than it is to 10K. Or to 190K.


You're right, from the frequentist definition of a confidence interval (2,40) we can't say the the true value is more likely to be closer to 21 than to 3.

But we can't neither say that the true value is equally likely to be closer to 21 than to 3.

The point is that, from the frequentist definition of a confidence interval, there is nothing at all that we can say about how likely the true value is to be here or there.

It could be 3, 21, or 666 and there is nothing that can be said about the likelihood of each value (unless we go beyond the frequentist framework and introduce prior probabilities).


>The point is that, from the frequentist definition of a confidence interval, there is nothing at all that we can say about how likely the true value is to be here or there.

Yes - sorry if I wasn't clear. I did not mean to imply that each value in the interval is equally likely (and looking over my comments, I do not think I did imply that).

The complaint is that the article is stating otherwise as fact.

>One practical way to do so is to rename confidence intervals as ‘compatibility intervals’

>The point estimate is the most compatible, and values near it are more compatible than those near the limits.

They simply are not in a frequentist model (which is the model most social scientists uses). I agree with the main thrust of the article in that there are many problems with P values. But I am surprised that a journal like Nature is allowing clearly problematic statements like these.

I don't know enough about the Bayesian world to be able to state if his statement is wrong there as well, but if it is correct there, it is problematic that the authors did not state clearly that they are referring to the Bayesian model and not the frequentist one.

(Not to get into a B vs F war here, but I remember a nice joke amongst statisticians. There are 2 types of statisticians: Those who practice Bayesian statistics, and those who practice both).


> I did not mean to imply that each value in the interval is equally likely (and looking over my comments, I do not think I did imply that).

When you said that "the values at the boundary of your interval are as credible as in the center" you kind of implied that, which is why I asked.

I won't defend the article being discussed, but you opposed their statement that "the values in the center are more compatible than the values at the boundary" with an equally ill-defined "the values at the boundary are as credible as in the center".


I do not read my statement to imply uniform distribution.

What I meant was "there is no reason to prefer values at the center more than values at the boundary" based on the CI (there may be external reasons, though). To me, this is equivalent to your:

>there is nothing that can be said about the likelihood of each value


Ok, we agree. "As credible as" implies uniform "credibility". "More compatible than" implies non-uniform "compatibility". Without any clear definition of "credibility" or "compatibility" it's impossible to interpret precisely what are those claims supossed to mean.


> there is no reason to believe that the true value is closer to 21 than to, say, 3.

I find this very silly, since if we ditch the arbitrary 0.95 and go with 0.999.. confidence interval of [-998, 1040] for example. How can one say that one cannot tell if which value is more likely, 21 or 1040?

If this is an actual limitation of the frequentist model like you said, everybody should be a bayesian thinker then. And the "confidence interval" is just a quick way to communicate how wide and where the posterior bell curve is.


Confidence intervals can be applied to point estimates, estimates of means, and estimates of other things, including higher order moments.

The difference is hugely important, the central limit theorem and the implication of a normal distribution commonly applies to sample means.

You can calculate confidence intervals for most (not all) other statistics, like point estimates, but the distributions might not be normal.


"for most (not all)" - yes, if analytically. If you can afford bootstrapping, then it is just "for all".


Technically no for the standard frequentist confidence intervals, but if they use the Bayesian Credible Interval then I believe that would be the correct interpretation.



This one of many Bayesian vs. frequentist blog posts where the frequentist example is presented in such bad faith or is so wrong that it's impossible to take seriously. Why is the sample mean used for the frequentist CI when it is not a sufficient statistic, and especially since it appears after the section discussing a "common sense approach" in which the author does mention a sufficient statistic: min(D)? All this blog post shows is reasonable bayesian approaches are better than frequentist approaches where common sense isn't allowed.


Unless I'm missing something the author answers that here:

> Edit, November 2014: ... Had we used, say, the Maximum Likelihood estimator or a sufficient estimator like min(x), our initial misinterpretation of the confidence interval would not have been as obviously wrong, and may even have fooled us into thinking we were right. But this does not change our central argument, which involves the question frequentism asks. Regardless of the estimator, if we try to use frequentism to ask about parameter values given observed data, we are making a mistake.


In other words, the author has no mathematical examples to support the argument, and the objection is purely philosophical...


Not as I understand it. Note that the author's argument there isn't "frequentism is bad because it gives an unreasonable answer here", it's "the fact that frequentism gives a different answer here demonstrates that it really is answering a different question".


But the frequentist answer is only different when the frequentist can't use common sense. If you use min(D) as the frequentist estimator, you would get a very different confidence interval, as it would have the form [min(D) - constant, min(D)]. The CDF of the truncated exponential is F(x) = 1-exp(theta-x), and the CDF of the minimum of three samples is 1-(1-F(x))^3. I get that the frequentist 95% CI is [9.00142, 10], which for all intents and purposes is the same as the credible interval the author computes.

I agree that credible intervals and confidence intervals answer different questions. I don't think that it's obvious that the confidence interval approach is wrong, and the example in the blog post is definitely not evidence towards this.


A 95% confidence interval will contain the true mean 95% of the time (across an infinite number of replications of the experiment/study). For a single confidence interval, you have either captured the mean in your confidence interval, or you've not -- there's no probability about it.


I believe this is the correct frequentist interpretation. To quote wikipedia:

> A 95% confidence level does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval (i.e., a 95% probability that the interval covers the population parameter).[10] According to the strict frequentist interpretation, once an interval is calculated, this interval either covers the parameter value or it does not; it is no longer a matter of probability.


This is where I get lost:

> For a single confidence interval, you have either captured the mean in your confidence interval, or you've not -- there's no probability about it.

Isn't there? The underlying truth is that you either definitely have or have not captured the population mean in any specific confidence interval. But you can't know this truth. In the long run, if "a 95% confidence interval contains the true mean 95% of the time across an infinite number of replications of the experiment/study," then isn't it true that any single specific experiment's CI has a 95% probability of containing the true value?!

In my untrained mind, this is exactly equivalent to flipping an unfair coin with a 95% chance of heads. Sure, before flipping, the outcome of heads has a 95% probability. After flipping, you either get heads or tails. But if you flip a coin and hide the outcome without looking at it, doesn't it still have a 95% chance of being heads as far as the experimenter can tell?


That cannot be the result of an experiment without some sense of the prior probability, and I don't think CIs as suggested in the article account for that.

For example, if I perform an experiment on a light source and find a 95% CI of 400-1200 lumens for the brightness, the actual probability of this being true is much higher if the light source is a 60W incandescent bulb than if the light source is the sun.


except in real experimentation you control for all the variables except the one you're investigating. So it is not a very usefull thing to compare a bulb with the Sun.

Instead, if you state that "luminescence OF THE 60W INCANDESCENT LIGHTBULB is 400-1200 lumens with 95% CI", then that's the usefull information that let's you set the right expectations when designing lightning for your new house, for example.


I think you missed the point of my example. I was suggesting that an experiment performed on the sun showing a 95% CI of the brightness being 400-1200 lumens should result in a reasonable person believing that the probability of the Sun's brightness falling in that range is approximately zero, while the same result for a 60W light-bulb should result in a reasonable person being more than 95% certain that the bulb's brightness falls in that range.

Just like a large number of people misinterpret a P value of .01 to mean a 1% chance of the results being due to chance[1], CIs can be similarly misinterpreted.

1: A .01 P value actually means that if the null hypothesis is true, then you would get the result 1% of the time. The analogy to my above example would be that if I run an experiment and get a result that "the sun is less bright than a 60W light-bulb" with a P value of .01, it's almost certainly not true that the sun is less bright than the light-bulb, since the prior probability of the sun being less bright than a 60W light bulb is many orders of magnitude smaller than 1%.


To see how absurd that definition is, think about this - CI itself is random! So if you conduct a 100 experiments, you'll get a 100 (non-overlapping) CI's! So in which of those 100 CI's does the "true population mean" lie ? All 100 of them ? 95 of them ?! You tell me.


>So if you conduct a 100 experiments, you'll get a 100 (non-overlapping) CI's! So in which of those 100 CI's does the "true population mean" lie ? All 100 of them ? 95 of them ?! You tell me.

I don't know why you get the idea that all 100 will be non-overlapping. That's simply false.

And yes, if your assumptions were correct, regular (i.e. frequentist) statistics will state that roughly 95% of the CIs will contain the true mean. There is nothing absurd about it.


Speaking as a Stat TA, literally over 90% of the students taking the class will conduct 1 experiment, not 100, which gives you 1 CI, not 100, & then say that particular CI has a 95% chance of containing the population mean! Then when I tell them the mean is either in that CI or not ( either 100% in, or 0% in ), they google the CI definition & point me to that. That's why I said that definition doesn't work for the masses. It can be interpreted as - "if you conduct a 100 experiments & get 100 CI's then roughly 95 of those will contain true mean", but then nobody ever conducts 100 experiments, so from their pov its an absurd definition.

Its imperative to understand that these definitions are written not for the average user of statistics, but for a trained statistician. Unfortunately, the average stat consumer vastly outnumbers the professional. Papers are littered with statements like p value proves H0, or proves H1. I have had numerous conversations with scientists ( not statisticians, but pharma/epidem/engg people who show up to the stat lab for consult ) that their p value doesn't prove H0 or H1. "What do you mean you can't prove H1 ? Oh you mean it only rejects H0 ? Ok but isn't that same as prove H1 ? It isn't ?! Well in my field if I just state it rejects H1 it won't be well understood so I am going to instead say H1 has been proved!" So there's little the statistician can do.

Regards overlap, I meant total/exact overlap, as in no two CIs will be identical on any conti dist.


>Speaking as a Stat TA, literally over 90% of the students taking the class will conduct 1 experiment, not 100, which gives you 1 CI, not 100, & then say that particular CI has a 95% chance of containing the population mean!

And as a TA, I hope you are marking them wrong! My stats professor went through great pains to point it out, as does my stats textbook.

>but then nobody ever conducts 100 experiments, so from their pov its an absurd definition.

The definition is not absurd. It's just a definition. You can argue that CI's are being abused and are not as helpful as people think they are and I'll probably agree with you. But that's no excuse for people to use it and get a pass for not even knowing what it means!

I contend that it was defined as such because at the time no one had anything better. I suspect that much is true even now. I'm not aware of any obviously better alternatives, and articles like these even suggest there aren't any - just that we need to be mindful that the CI alone doesn't allow for reliable conclusions.

At the end of the day, this is not a problem with the CI definition. It's not a statistics/technical problem. It's a social/cultural one. As such, the solution isn't to change statistics, but to fix the cultural problem: Why do we keep letting people get away with such analyses? Are there any journals that have a clear policy on these analyses? Are referees rejecting papers because of an over reliance on p-values?

Let's not change basic statistics definitions and concepts because the majority of non-statisticians don't understand them. When the majority of the public can't understand basic rules of logic (like A => B does not mean that (not A => not B)), we don't argue for a change to the discipline of logic. When huge numbers of people violate algebra (e.g. sqrt(a^2+b^2) = a + b), we don't blame mathematics for their sins. (I know I'm picking extreme cases for illustration, but the principle is the same). I had only one stats class in my curriculum. If you want people to perform correct statistics as part of the profession, make sure it is fundamental to much of their work. It would have been trivial to add a statistics component to most of my engineering classes, and that would hammer in the correct interpretation. Yet while we were required to know calculus and diff eq for most of our classes, none required any statistics beyond the notion of the mean (and very occasionally, some probability).

Statistics is a tool. It will always be the responsibility of the person invoking the tool to get it right.

>Regards overlap, I meant total/exact overlap, as in no two CIs will be identical on any conti dist.

Isn't that the whole point of inferential statistics? You have a population with a true mean. You cannot poll the whole population. Hence you take a sample. This is inherently random. There is variance in your estimate (obvious), What should be clear is that the CI should move with your point estimate. Furthermore, you never know the true stddev, so you estimate the stddev from your sample. Now both your center and the width of the CI will vary with each sample. I can't comprehend how you could hope to get the same interval from different samples, given that it is quite possible to get all your points below the mean in one sample and above the mean in the other.

I think people are bashing statistics because it isn't helping them come to a clear conclusion (which is fair). But as I said, all the proposals I've seen appear to be as problematic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: