I don't want to discount the usefulness of log probabilities. But this post gets off to a poor start.
> Probabilities are very rarely added together, and probability distributions even more rarely.
Umm, we _often_ add probabilities when we want to know the probability that any of a number of finite but mutually exclusive outcomes will happen.
We often add distributions (perhaps with weights) when we think about mixture models, or when there are multiple mechanisms which can both give rise to the same outcomes.
A point the author didn't make about log probabilities which likely relevant to this audience is that we often think about the product of many probability terms when modeling a complex outcome of many steps (e.g. a language model where words are 'drawn' from a distribution, and so the probability of a sentence is a product of many terms corresponding to many generative steps). In these cases log probabilities are often also important for computations to avoid underflow issues. I.e. when you have to reason about large, complex, composed structures generated from stochastic processes, _everything_ can be astoundingly unlikely in absolute terms, and log probabilities let us keep track of the important differences between astoundingly unlikely things.
Sure, but mixture models and mutually exclusive events are less common to work with than just "models." Obviously there are times when you add probabilities together, but usually you don't. Even when you're writing a mixture model, or doing bayes nets, most of the work is done with log probabilities in the various branches, and only retransformed into linear space when the branches are combined.
Mixture models are surprisingly common and addition/subtraction are not limited to mutually exclusive events. In such cases one goes the route of inclusion/exclusion formula.
I think what you meant was that in computation one often needs to multiply probabilities and in such cases it is helpful to work in the log semiring https://en.wikipedia.org/wiki/Log_semiring
> Obviously there are times when you add probabilities together, but usually you don't.
Maybe you don’t but it’s hardly unusual. Integrating out nuisance parameters from a posterior distribution is adding probabilities. Monte Carlo methods are based on adding probabilities.
>Sure, but mixture models and mutually exclusive events are less common to work with than just "models."
What do you mean by this? A mixture model is just a model.
Also, how can you do absolutely anything without "working with mutually exclusive events"? You need mutual exclusivity to support the concept of a decision or a hypothesis.
Shameless plug: I wrote some high quality (I hope) notes on HMC, because I couldn't find an explanation that was (a) rigorous, (b) explained the different points of view of MCMC, and (c) was visual! Here's the notes: https://github.com/bollu/notes/blob/master/mcmc/report.pdf
Looks cool. Also, I wish that when I started university somebody would have forced me to start a "notes" repo like yours. Looks great and very smart to have.
Your code contains a small bug, in the leapfrog HMC function the line pnext = -p should be pnext = -pnext. Also this line is only of theoretical importance, it has no effect on the final result (unless you use a weird asymmetric kinetic energy).
Oh man, these are great notes, very succinct! I'm not much into proofs, but finally picked up the essence of Hamiltonian M-H thanks to this. (Also, I can't imagine what kind of genius it takes to rigorously derive such a method...)
Well, my method so far has been to at least be familiar with some of the theoretical stuff and then try to figure out how to apply it in depth when the opportunity to do so arises. Not sure if there are references where you can directly jump to the applied bit.
Most of those things are stuff you'll encounter if you read up on Hamiltonian MC and information field theory, but it might take some additional reading to get all the required background knowledge.
The first two are just tricks really. For the weighted estimates you just go from:
log(P) = \sum log(P_i)
to
log(P) = \sum w_i log(P_i)
and for the sufficient statistics you get derivations like:
I wonder how much he knows, interpreting it as a thermodynamic ensemble is just going back to thinking about probabilities really. Information field theory is something very different. The cheesy explanation for it is that it is just quantum field theory in imaginary time.
>it is just quantum field theory in imaginary time.
Quantum field theory is just thermodynamics in imaginary time, not sure why you object to calling it a thermodynamic ensemble. Besides, I could hardly introduce a subject as 'quantum field theory in imaginary time' now could I?
Interesting article but with some factual imprecisions: 1. A bit nit-picky but the Beta distribution has support [0, 1]. So the distribution in the graph is not a Beta. It's probably a shifted and scaled Beta. 2. The probability density can be arbitrarily high, which means that the log of the density can't have an upper bound at 0. Probability is bounded at 1 but probability density isn't. The fact that none of the example distributions passes zero is a mere coincident. I'm surprised to see this beginner's mistake in an otherwise insightful article.
It's possible that I misunderstood the text but what's the alternative interpretation of "The log function asymptotes at 0"? Also note that the text says log probabilities when it actually refers to log densities.
Ah, that's likely the interpretation that the author had in mind. But it didn't help that he wrote log probability when he actually talked about log probability density.
You're absolutely right! I meant it the other way around:
As x approaches 0, log(x) goes to minus infinity.
I think that's what he means (the next sentence about "visualiz[ing] very small values of a function by expanding their range" makes sense in that context), but I agree it is not very clear.
I really wish Taleb would employ a different editor. His insights – the ones relayed to me by other people – are brilliant, but I honestly have not been able to get through his writing to them.
Hey! Just wanted to say thanks for posting this, I wasn't aware of it but am excited to read it -- I've been craving a more technical treatment of Taleb's subject matter.
What kind of operation happens when you pointwise multiply two probability density functions? It wouldn't even integrate to 1.0 any more (it can even integrate to 0.0, like in the example 4). Rather than taking the convolution, which corresponds to adding two independent random variables that generate said density functions.
Pointwise multiplication of, say, two density functions gives the likelihood function of two independent random variables with these densities. This function is not a density function, as you say, but is still an important object, as it is used for inference based on the likelihood principle.
But then the term would be P(x1) P(x2) and the correct summation/integration to use would be over the cartesian product of the sample space. Under such summation/integration its a valid density
I'm sorry, but I'm lost, are we talking about this likelihood function? I'm guessing no, since that's an object defined in a statistical space, not in a probabilistic space:
Oftentimes, a “likelihood function” is generalized from that definition to refer to any non-normalized function which is still treated like a probability distribution. Any function with a finite integral over the real line can be turned into a probability distribution easily, and there are other functions that can be used in places where you would normally use a probability function, even when they don’t integrate to a finite value over the real line. (For example, improper priors in Bayesian analysis.)
“Function that can in some contexts be treated like a probability distribution” is just too long.
Yes, that's the right likelihood function. To get probabilities distributions out of the products in the post, all you have to do is normalize them. That would be equivalent to applying bayes rule and combining independent evidence about a parameter with an uninformative prior. Normalizing doesn't change the shapes, so I left out the details.
Joint probability of independent events, especially the common case of independent, identically distributed data. (Or exchangeable observations, in the Bayesian framework).
For example, flipping a coin n times, observing heights of people in a population, etc.
I'm not sure I understand what you are trying to convey.
If we have two continuous random variables in a probability space, we have a mapping P(X, Y): R^2 --> R, such that the integral sums to 1.
If X and Y are independent, then P(X, Y) == P(X)P(Y). The examples in the linked article are a bit confusing because some of them have P(X) and P(Y) as different distributions. But the plot only shows the plot of the 2D domain along the 1D line of X==Y, which isn't a terribly common occurrence in actual practice.
P(X,Y), when integrated along all possible diagonals given by x+y=z where z goes through all real numbers, is what a convolution is (of P(X) and P(Y) generated by those variables). This gives you the distribution of the variable Z=X+Y.
So, we're talking about the same thing, namely that multiplying two probability density functions pointwise roughly corresponds to a single point where X==Y (the same as X-Y==0), but otherwise doesn't seem terribly useful.
Wouldn't you multiply like you do for functions as an inner product for hilbert space?
Something like:
$ f \cdot g = \int_\infty^\infty f(x)g(x)dx $
No?
The schwartz inequality gives us that the product then has a value between 0 and 1. ...Not sure what it means if the value is less than one. What does orthogonality mean when it comes to probability density functions?
Usually orthogonal means uncorrelated random variables (not necessarily independent). A good example is sine and cosine - they are uncorrelated in the interval 0 to pi. But clearly they are not independent.
Sine and cosine can be negative. Density functions cannot, and thus, orthogonality of the density functions implies that they don't have the same support with probability 1, or, said otherwise, the measure of the intersection of their support is 0.
Along these lines, in Probability Theory: The Logic of Science Jaynes suggests thinking in decibels (which are just log probabilities). Here are some excerpts of this:
"... it is very cogent to give evidence in decibels. When probabilities approach one or zero, our intuition doesn't work very well. Does the difference between the probability of 0.999 and 0.9999 mean a great deal to you? It certainly doesn't to the writer. But after living with this for only a short while, the difference between evidence of plus 30 db and plus 40 db does have a clear meaning to us. It's now in a scale which our minds comprehend naturally... In the original acoustical applications, it was introduced so that a 1 db change perceptible to our ears. With a little familiarity and a little introspection, we think that the reader will agree that a 1 db change in evidence is about the smallest increment of plausibility that is perceptible to our intuition."
"... What probability would you assign to the hypothesis that Mr Smith has perfect extrasensory perception? ... To say zero is too dogmatic. According to our theory, this means that we are never going to allow [our] mind to be changed by any amount of evidence, and we don't really want that. But where is our strength of belief in a proposition like this?
"... We have an intuitive feeling for plausibility only when it's not too far from 0db. We get fairly definite feelings that something is more than likely to be so or less likely to be so. So the trick is to imagine an experiment. How much evidence would it take to bring your state of belief up to the place where you felt very perplexed and unsure about it? Not to the place where you believed it -- that would overshoot the mark, and again we'd lose our resolving power. How much evidence would it take to bring you just up to the point where you were beginning to consider the possibility seriously?
"So, we consider Mr Smith, who says he has ESP, and we will write down some numbers from one to ten on a piece of paper and ask him to guess which numbers we've written down. We'll take the usual precautions to make sure against other ways of finding out. If he guesses the first number correctly, of course we will all say 'you're a very lucky person, but I don't believe you have ESP'. And if he guesses two numbers correctly, we'll still say 'you're a very lucky person, but I still don't believe you have ESP'. By the time he's guessed four numbers correctly -- well, I still wouldn't believe it. So my state of belief is certainly lower than -40 db.
"How many numbers would he have to guess correctly before you would really seriously consider the hypothesis that he has extrasensory perception? In my own case, I think somewhere around ten. My personal state of belief is, therefore, about -100 db. You could talk me into a +-10db change, and perhaps as much as +-30 db, but not much more than that."
Bringing it closer to home for many software developers, when we talk about the reliability of a system, we also use a variation on log probability (of success), but we call it "nines:" 99% reliability is two nines, 99.99% is four nines, etc. So
nines(x) = -log_10 (1-x)
The bioinformatics community often uses error probabilities that are PHRED scaled, -10 * log10 of the probability. I didn't realize this was so close to dB!
1000 dB of skepticism is watching someone win the lottery ten times in a row, with ten tickets, and saying "impressive trick, but I bet you can't do it an eleventh time".
> The financial crisis that brought the world economy to its knees was caused largely by bad statistics. Analysts assumed that the values of mortgages are Gaussian, and that the tails of mortgage values aren’t more correlated than than mortgage values are on average.
Analysts did use models that take into account that the tails are more correlated. Arguably overconfidence on these sophisticated models brought complacency and led to excessive risk taking.
Despite the fact that these distributions had very similar
looking shapes, their products are entirely different. The
distributions are shifted so that the center of one is 8
standard deviations from the center of the other
Is this last part correct?
"the distributions" means the distributions of the resulting products, correct?
I might misunderstand, but the centers of all three distributions are identical, at zero, aren't they?
This is just the setting of the example. For each of those distributions he takes two of them which don't overlap much and looks at the convolution (i.e. the distribution of the sum of two independent random variables described by them).
Edit: by the way, the irony is that the convolution calculates products of probabilities and adds them!
> Probabilities are very rarely added together, and probability distributions even more rarely.
Umm, we _often_ add probabilities when we want to know the probability that any of a number of finite but mutually exclusive outcomes will happen.
We often add distributions (perhaps with weights) when we think about mixture models, or when there are multiple mechanisms which can both give rise to the same outcomes.
A point the author didn't make about log probabilities which likely relevant to this audience is that we often think about the product of many probability terms when modeling a complex outcome of many steps (e.g. a language model where words are 'drawn' from a distribution, and so the probability of a sentence is a product of many terms corresponding to many generative steps). In these cases log probabilities are often also important for computations to avoid underflow issues. I.e. when you have to reason about large, complex, composed structures generated from stochastic processes, _everything_ can be astoundingly unlikely in absolute terms, and log probabilities let us keep track of the important differences between astoundingly unlikely things.