Confidence intervals are weird because of their very minimal definition. My favorite confidence interval procedure for iid data demonstrates why you need a tighter definition for a useful interval.
For a 93.75% confidence interval, draw 5 points (iid). If the last four are all greater than the first one, your CI is the whole real number line, otherwise it’s the empty set.
Once you draw some actual data and get a specific interval, you want to ask about some degree of belief that your specific interval contains the actual parameter. In the case that your CI is all numbers, you know for a fact that it contains the true parameter value. In the case that your CI is the empty set, you know for a fact that it doesn’t contain the true parameter value.
I like this CI procedure because it demonstrates two things. 1) The kind of reasoning for going forward from an unknown parameter to a random interval is very different than what you have working backwards from a specific interval back to the parameter. That asymmetry can be WEIRD. 2) The weirdness is possible if you limit yourself to only the CI definition, meaning that if you want it to be useful, you need something that rules out weird shit like my example.
The properties of specific CI procedures people actually use are generally much much better than what is allowed by the definition of a CI. If you want useful reasoning backwards from the interval, don’t try to reason solely from the definition of a CI.
I'm having trouble understanding your example. What value is the 93.75% confidence interval for? Is it for the population mean? If so, why does the sample order influence the result?
An x% confidence interval takes data D and produces an interval i. For some data generating process from parameter m to a random dataset D (random like if you do the same experiment again, you’ll get different data), the probability that i(D) contains m is x%. That’s the definition of an x% CI.
In my example i(D) is a function of the data (a function of the ordering), and D is a random dataset. Since it’s iid by assumption (sneakily also assuming the probability of an exact repeat is zero), the probability that the interval contains all numbers is 93.75% (1-1/2^4). Otherwise it’s the empty set.
Unpacking that, suppose you have a real number m. The probability that i(D) will contain m (with D as the random variable) is 93.75%, so it is a valid confidence interval for m.
m could be the population mean, the population median, your dog’s age, whatever. The interval depends on the data, but not on the parameter, and the definition of a CI says that’s fine.
It’s a demonstration that definition of a CI alone isn’t really useful for reasoning about a parameter given an interval. You need to know more about the specific data generating process and function i that led to it in order to make sure it’s useful.
I'm confused by your example also ... if the points are iid then how can the order influence the estimate? They are independent of each other, so there is no way for points 1 - 5 to influence the next data value sampled from the distribution.
Or to put it another way, if i(D) is a function of the ordering, then isn't by definition the ultimate random process observed through i(D) not iid even if D is iid?
that is not a correct interpretation of a confidence interval. what you describe refers to a posterior probability, which is not what CIs do. See Greenland et al for a good paper on this.
It is most definitely not a posterior probability, but it’s always hard to tersely write out which process you’re describing in plain English and no formalism. All the probabilities I mention are probabilities in the sense of a CI. And I’m too lazy to write it out thoroughly.
This paper describes the situation more thoroughly
My main objection was regarding the iffy sentence that "the probability that i(D) contains m is x%", as opposed to, say, you would expect "x% of those randomly generated intervals to contain m".
But, fair enough, I appreciate your sentiment about 'terseness'. One can only be so nitpicky with words when trying to communicate, before starting to sound like a criminal defense lawyer ...
I posted the following as a comment on the article, but it's stuck in moderation there...
I'm not clear on what it is that you [the post's author] don't understand about interpretation of Bayesian credible intervals.
Both "objective" and "subjective" Bayesians interpret them as degrees of belief - that, for instance, one would use to make bets (supposing, of course, that you have no moral objection to gambling, etc.).
The difference is that that "objective" Bayesians think that one can formalize "what one knows" and then create an "objective" prior on that basis, that everyone "with the same knowledge" would agree is correct. I don't buy this. Formalizing "what one knows" by any means other than specifying a prior (which would defeat the point) seems impossible. And supposing one did, there is disagreement about what an "objective" prior based on it would be. To joke, "The best thing about objective priors is there are so many of them to choose from!".
Many simple examples can illustrate that the objective Bayesian framework just isn't going to work. One example is the one-way random effects model, where the prior on the variance of the random effects will sometimes have a large influence on the inference (eg, on the posterior probability that the overall mean is positive), but where there is no sensible "objective" prior - you just have to subjectively specify how likely it is that the variance is very close to zero. Another even simpler example is inference for theta given an observation x~N(theta,1), when it is known (with certainty) that theta is non-negative, and the observed x is -1. There's just no alternative to subjectively deciding how likely a priori it is that theta is close to zero.
Frequentist methods also don't give sensible answers in these examples. Subjective Bayesianism is the only way.
It appears to me that the reason Bayesian probability is somewhat elusive is due to its metaphysical underpinning: what do we mean by probability?
If we live in a materialist deterministic world - which many would cite as an axiom for simplicity - then there really is no probability. Everything happens with 100% certainty.
Then, what is probability? If everything will happen with 100% certainty, but probability certainly appears to exist, then probability must reflect something about our information about something occurring.
The author refers to two foundational approaches to our state of knowledge. The first is the objectivist approach, which states that everyone who has the same state of knowledge about a system will evaluate the same probability of something occurring. The second is the subjectivist approach, which states that a given individual with a certain state of knowledge will evaluate some probability of something occurring. To me, these appear to be the same thing except insofar as the former requires a consensus of many while the latter a consensus of one.
The author asks how we might actually define Bayesian probability without resorting to the frequentist approach (i.e. hypothetically simulating many trials of the same event, however infrequent in reality it may be).
First, he says this would mean "interpreting [the credible interval] like a confidence interval". I am no statistician, but is that necessarily true? I don't see why confidence intervals would suddenly emerge in order to interpret a credible interval.
Second, I am not sure the frequentist interpretation is so problematic. When we interpret the plain-English definition of a probability, are we not mentally simulating repeated trials in order to evaluate something's occurrence? What else could a probability imply? If something has a 20% chance of occurring, then it does not occur 80% of the time, and so we must envision 80% of universes (part of the hypothetical trials) where it does not occur. I don't see any other way around this, frequentist or not.
(Note: I am not a statistician, while the author is, and the above is simply my layman's understanding of the article.)
There's no definition of probability that doesn't involve philosophy or metaphysics [0]. Calling frequentist stats "objective" really bugs me. There is no such thing. Every inference procedure involves subjective choices.
Of course, frequentist and Bayesian stats are completely mathematically equivalent. The choice just affects our mental patterns.
No, frequentist and Bayesian statistics are not equivalent.
There are some special cases in which a frequentist 95% confidence interval and a Bayesian 95% credible interval based on some sort of default prior are numerically the same, but that doesn't happen in general.
Statisticians would hardly have been vigorously debating the issue for two centuries if it didn't really matter.
>>>> Let’s now suppose that we’ve done a Bayesian analysis. We’ve specified a prior distribution for the parameter, based on prior evidence, our subjective beliefs about the value of the parameter, or perhaps we used a default ‘non-informative’ prior built into our software package.
At first blush the difference is that the Bayesian is using more information. Now don't get me wrong, if Bayes theorem and its progeny give us useful tools for incorporating that information in our analyses, so much the better.
Every frequentist technique has a Bayesian interpretation and vice-versa. Confidence intervals are equivalent to credible intervals with certain priors.
The result of Bayesian inference is a whole posterior distribution, not just a confidence interval. Any attempt to produce a frequentist version of a posterior distribution is either going to end up just being Bayesian inference in disguise, or be inconsistent.
But even if we focus just on confidence intervals and credible intervals, there needn't be the equivalence you state. A comment elsewhere here discusses a ridiculous confidence interval that is either the whole parameter space or the empty set. That's never going to be what a Bayesian credible interval gives you.
My point is you that you cannot ever actually claim to ground a probability value in objective features of the world. Many events aren't repeated, for example.
I think a further problem may be that probability could take on different meanings under different applications. I admit that I may be influenced by how I learned statistics -- in a math class that was primarily focused on proofs and not applications. But I've formed the view that math is math, we choose a math technique that works for the situation at hand, then we choose an interpretation that works for guiding and explaining what we're trying to do.
It's just a perspective, but Bayesian inference makes the prior explicit, wheareas it is implicit with frequentist inference. You can't not have a prior.
That's fair, it forces you to document your assumptions. But at some point as you drill down into assumptions, the trail will grow cold. The article even talks about inserting priors that are either subjective or some default function provided by the software.
A possible third approach is to create a hypothetical model and feed random data through it, to get an idea of the spread of the outcomes. Modeling doesn't stumble on conditional terms, and if you're unclear on an assumption, it won't run. The computer doesn't know whether it's a frequentist or a Bayesian, or something different from both.
I'm not a statistician. Whenever I need to do something with statistics, I always test my computation with random data.
In fact I wonder, if modeling had been possible since the birth of statistics, if we would even bother with things like elaborate formulas for statistical tests.
Bayesian stats actually let you integrate stochasticity into deterministic models fairly easily, in a way you can't really do with frequentist stats. Bayesian methods are common for geophysical inverse modelling, for example. Probabilistic programming does this, for example. Exact Bayesian inference is impossible in the general case, but approximate inference often works well.
Subjective priors are one of the main advantages of Bayesian stats. Regularization used in ML corresponds to using subjective priors, for example. L2 regularization finds the MAP with a normal prior, favoring parsimonious solutions.
It seems a Bayesian interpretation of probability is more general. The frequency of events over an infinite number of trials is one way of interpreting probability for things that are able to be repeated. But this wouldn’t make sense to apply for an election that is only going to happen once and yet one still wants to be able to quantify uncertainty in these situations.
There are a few things intermingled in this election example.
1. The outcome of the election here is not a probability. It is the population value - the ratio of people voting for candidate X on the election date. It doesn't have to be repeated in the same way measurements of height for all people in United States would not have to be repeated, if instead of vote we were measuring heights.
2. Frequentist probability doesn't require to physically repeat things. It can reason about what would happen in the repeated sampling under certain conditions, and then draw inferences about those assumed conditions. With the election example: if you get a survey of 100 people with 70% voting for candidate "A" we don't need to repeat this survey in order to know the likelihood (frequency) of this result happening if the real proportion of people voting for candidate "A" across the US is 50%.
If you’re trying to quantify your uncertainty about who will win the election, a poll would only be part of it. You want to be able to combine disparate sources of information. Maybe there is preference falsification and you want to incorporate as some sort of prior. As things get further from simple sampling from a population the frequency interpretation makes less and less sense to me.
You’re correct that probability still works in a hypothesized deterministic universe. It’s a point that’s often too often forgotten, causing discussions to go down unnecessary rabbit holes debating foundations of quantum mechanics when discussing the roll of a six sided die.
Statisticians and mathematicians have gone very far down the path you’re discussing, and you might be interested in some sets of axioms that have come up around probability and relaxations of true/false logic.
The Kolmogorov axioms [0] are the “standard” probability axioms, and are phrased in terms of set theory and measure theory (not requiring any mention of physics or a physical universe!).
There are other ways to quantify degree of belief, however, and they are very interesting. Apparently Cox’s theorem [1] justifies a popular probability framework for Bayesians. But there are many more interesting ways to do degree of belief, like Dempster-Schafer theory [2], which I understand to be a plausibility calculus.
Everybody seems to find a single system and decide it’s the only one out there,
> If we live in a materialist deterministic world - which many would cite as an axiom for simplicity - then there really is no probability. Everything happens with 100% certainty.
I don't know who "the many" are - but I thought determinism had already been disproved.
I am not a physicist so I will not go into quantum mechanics - but I will take a simple example from Science Fiction, and that is the Temporal Paradox. https://en.wikipedia.org/wiki/Temporal_paradox
Sorry - I was being flippant because your argument involved time travel. After the fact an event has 100% probability that it did occur but before the event it does not.
The entire field of chaos theory is to make chaos deterministic so you are in good company. There is no general mechanism (yet) to do this. Quantum mechanics is the most interesting area.
The measure of determinism is that it predicts the future before it happens. People have been trying to do this for years in weather and the stock market. This is where the concept of Chaos came from.
Leaving math and physics - there is philosophy. To apply determinism to people you would have to decide there is no free will. Maybe this is true and maybe not.
There seems to be a nice duality between Bayesian and Frequentist inference [1]:
Assume that both the system state and the observation are drawn from some joint probability distribution.
There is some function γ of the system state which we seek to estimate. The experimentator applies some decision procedure d to the observation to get their result.
A Frequentist will analyze the situation by conditioning on the the model parameter θ. As a result, we get a single target value γ and probability distributions for the observation and decision, depending on θ.
If d results in an interval, the Frequentist calculates the confidence level as the probability that the descision procedure d produces an interval containing γ, under worst-case assumptions for θ. Unbiasedness of the decision procedure means that γ is indeed the function it estimates the best, and it is not a better estimator for any other function γ'(θ).
A Bayesian, on the other hand, will condition the joint distribution on the observation. Consequently, γ is a random variable, while the observation and decision are known.
If d is an interval, its credibility is the probability that γ is within this interval, given the observation. Optimality of the decision procedure means that no other estimator d' produces better results.
[1]: S. Noorbaloochi, Unbiasedness and Bayes Estimators, users.stat.umn.edu/~gmeeden/papers/bayunb.pdf
> Unbiasedness of the decision procedure means that γ is indeed the function it estimates the best, and it is not a better estimator for any other function γ'(θ).
Are you suggesting that unbiased estimators are necessarily better than biased ones? If so, check out Stein’s phenomenon for a counterexample. It’s common for biased estimators to dominate unbiased ones in terms of error rates. That’s where the bias variance trade off in ML comes from.
Indeed, the unbiasedness condition for an estimator says nothing about its quality relative to other estimators. Instead, it requires that you cannot change the target function γ without the expected loss becoming worse.
The paper I referenced even includes a theorem saying that, as long as the value to be estimated can not be exactly deduced from the observation, no estimator is both Bayes-optimal and unbiased.
However for many observations, the Bayes-optimal estimator becomes asymptotically unbiased.
Is this at all related to the debate around 538's use of probability in their forecasts? I've been peeking at some of that debate and curious how it will turn out.
This article kind of helps in establishing that it is a hard question to answer. Clearly harder with intervals.
I can't help but think much of this gets overcomplicated because we don't take everything in intervals. In large because it is hard, yes; but we should be more comfortable with things not getting known to an exact value.
For a 93.75% confidence interval, draw 5 points (iid). If the last four are all greater than the first one, your CI is the whole real number line, otherwise it’s the empty set.
Once you draw some actual data and get a specific interval, you want to ask about some degree of belief that your specific interval contains the actual parameter. In the case that your CI is all numbers, you know for a fact that it contains the true parameter value. In the case that your CI is the empty set, you know for a fact that it doesn’t contain the true parameter value.
I like this CI procedure because it demonstrates two things. 1) The kind of reasoning for going forward from an unknown parameter to a random interval is very different than what you have working backwards from a specific interval back to the parameter. That asymmetry can be WEIRD. 2) The weirdness is possible if you limit yourself to only the CI definition, meaning that if you want it to be useful, you need something that rules out weird shit like my example.
The properties of specific CI procedures people actually use are generally much much better than what is allowed by the definition of a CI. If you want useful reasoning backwards from the interval, don’t try to reason solely from the definition of a CI.