Logs, Tails, Long Tails (2013)

abeppu · on Oct 23, 2020

I don't want to discount the usefulness of log probabilities. But this post gets off to a poor start.

> Probabilities are very rarely added together, and probability distributions even more rarely.

Umm, we _often_ add probabilities when we want to know the probability that any of a number of finite but mutually exclusive outcomes will happen.

We often add distributions (perhaps with weights) when we think about mixture models, or when there are multiple mechanisms which can both give rise to the same outcomes.

A point the author didn't make about log probabilities which likely relevant to this audience is that we often think about the product of many probability terms when modeling a complex outcome of many steps (e.g. a language model where words are 'drawn' from a distribution, and so the probability of a sentence is a product of many terms corresponding to many generative steps). In these cases log probabilities are often also important for computations to avoid underflow issues. I.e. when you have to reason about large, complex, composed structures generated from stochastic processes, _everything_ can be astoundingly unlikely in absolute terms, and log probabilities let us keep track of the important differences between astoundingly unlikely things.

joppy · on Oct 23, 2020

Log-probabilities are added so much that there are often special-purpose functions for adding them, Eg https://numpy.org/doc/stable/reference/generated/numpy.logad...

moultano · on Oct 23, 2020

Sure, but mixture models and mutually exclusive events are less common to work with than just "models." Obviously there are times when you add probabilities together, but usually you don't. Even when you're writing a mixture model, or doing bayes nets, most of the work is done with log probabilities in the various branches, and only retransformed into linear space when the branches are combined.

srean · on Oct 23, 2020

Mixture models are surprisingly common and addition/subtraction are not limited to mutually exclusive events. In such cases one goes the route of inclusion/exclusion formula.

Summation is so fundamental to probabilities that it is how probabilities are even defined axiomatically in the Kolmogorov system. https://en.wikipedia.org/wiki/Sigma_additivity

I think what you meant was that in computation one often needs to multiply probabilities and in such cases it is helpful to work in the log semiring https://en.wikipedia.org/wiki/Log_semiring

kgwgk · on Oct 23, 2020

> Obviously there are times when you add probabilities together, but usually you don't.

Maybe you don’t but it’s hardly unusual. Integrating out nuisance parameters from a posterior distribution is adding probabilities. Monte Carlo methods are based on adding probabilities.

mb7733 · on Oct 24, 2020

>Sure, but mixture models and mutually exclusive events are less common to work with than just "models."

What do you mean by this? A mixture model is just a model.

Also, how can you do absolutely anything without "working with mutually exclusive events"? You need mutual exclusivity to support the concept of a decision or a hypothesis.

contravariant · on Oct 22, 2020

Other fun stuff you can do:

- Easy weighting for observations (just put a weight in front of the corresponding term in the sum)

- Identify and isolate sufficient statistics (by breaking up the summation until you've got terms that only contain data, no parameters)

- Just add a quadratic term and pretend it's kinda Gaussian now (esp. fun if you pretend one of your parameters is generated with a Gaussian process).

- Take the expected value of it w.r.t. some other distribution (Cross entropy / Variational Bayes)

- Pretend the negative log probability is a Hamiltonian and add some momentum + kinetic energy terms (Hamiltonian MC)

- Same but skip the kinetic bit and pretend it's a thermodynamic ensemble (information field theory)

- Same but perform a Legendre transformation to get the Effective action / Free energy (ends up being the same as the KL-divergence)

bollu · on Oct 23, 2020

Shameless plug: I wrote some high quality (I hope) notes on HMC, because I couldn't find an explanation that was (a) rigorous, (b) explained the different points of view of MCMC, and (c) was visual! Here's the notes: https://github.com/bollu/notes/blob/master/mcmc/report.pdf

Feedback is much appreciated.

amitport · on Oct 23, 2020

Looks cool. Also, I wish that when I started university somebody would have forced me to start a "notes" repo like yours. Looks great and very smart to have.

contravariant · on Oct 23, 2020

Your code contains a small bug, in the leapfrog HMC function the line pnext = -p should be pnext = -pnext. Also this line is only of theoretical importance, it has no effect on the final result (unless you use a weird asymmetric kinetic energy).

bollu · on Oct 25, 2020

Nice, thank you for the bug report. fixed!

pixelpoet · on Oct 23, 2020

Oh man, these are great notes, very succinct! I'm not much into proofs, but finally picked up the essence of Hamiltonian M-H thanks to this. (Also, I can't imagine what kind of genius it takes to rigorously derive such a method...)

kgwgk · on Oct 23, 2020

Minor correction: it's "detailed balance" (not "detail balance").

bollu · on Oct 23, 2020

Thank you, Fixed!

rawoke083600 · on Oct 23, 2020

I love your real world usage and dare I say "tricks" ?

If I want to learn more about "applied statistics" best source you would recommend that is not dry as bone ?

contravariant · on Oct 23, 2020

Well, my method so far has been to at least be familiar with some of the theoretical stuff and then try to figure out how to apply it in depth when the opportunity to do so arises. Not sure if there are references where you can directly jump to the applied bit.

diego898 · on Oct 23, 2020

Would you mind saying some more about these topics? Or provide some references? I’d like to learn a bit more on all of these!

contravariant · on Oct 23, 2020

Most of those things are stuff you'll encounter if you read up on Hamiltonian MC and information field theory, but it might take some additional reading to get all the required background knowledge.

The first two are just tricks really. For the weighted estimates you just go from:

    log(P) = \sum log(P_i)

to

    log(P) = \sum w_i log(P_i)

and for the sufficient statistics you get derivations like:

    log(P) = \sum_i (1/2) (x_i - m)^2
           = \sum_i (1/2) x_i^2 - x_i m - (1/2)m^2
           = (\sum_i (1/2) x_i^2) - (\sum_i x_i) m - (1/2)m^2

to show that (\sum_i (1/2) x_i^2) and (\sum_i x_i) fully determine the posterior distribution.

ravar · on Oct 23, 2020

I wonder how much he knows, interpreting it as a thermodynamic ensemble is just going back to thinking about probabilities really. Information field theory is something very different. The cheesy explanation for it is that it is just quantum field theory in imaginary time.

contravariant · on Oct 23, 2020

>it is just quantum field theory in imaginary time.

Quantum field theory is just thermodynamics in imaginary time, not sure why you object to calling it a thermodynamic ensemble. Besides, I could hardly introduce a subject as 'quantum field theory in imaginary time' now could I?

tmalsburg2 · on Oct 23, 2020

Interesting article but with some factual imprecisions: 1. A bit nit-picky but the Beta distribution has support [0, 1]. So the distribution in the graph is not a Beta. It's probably a shifted and scaled Beta. 2. The probability density can be arbitrarily high, which means that the log of the density can't have an upper bound at 0. Probability is bounded at 1 but probability density isn't. The fact that none of the example distributions passes zero is a mere coincident. I'm surprised to see this beginner's mistake in an otherwise insightful article.

kgwgk · on Oct 23, 2020

> the log of the density can't have an upper bound at 0

I don't think he claims that. He even shows charts where positive log-densities appear.

tmalsburg2 · on Oct 23, 2020

It's possible that I misunderstood the text but what's the alternative interpretation of "The log function asymptotes at 0"? Also note that the text says log probabilities when it actually refers to log densities.

moultano · on Oct 23, 2020

The log function has a vertical asymptote as x approaches 0.

tmalsburg2 · on Oct 23, 2020

Ah, that's likely the interpretation that the author had in mind. But it didn't help that he wrote log probability when he actually talked about log probability density.

kgwgk · on Oct 23, 2020

If the author himself tells you so, it’s more than likely. :-)

kgwgk · on Oct 23, 2020

When x goes to minus infinity log(x) approaches zero.

tmalsburg2 · on Oct 23, 2020

No, the logarithm is only defined for positive values. Plus probability densities are strictly non-negative. https://en.m.wikipedia.org/wiki/Logarithm

kgwgk · on Oct 23, 2020

You're absolutely right! I meant it the other way around:

As x approaches 0, log(x) goes to minus infinity.

I think that's what he means (the next sentence about "visualiz[ing] very small values of a function by expanding their range" makes sense in that context), but I agree it is not very clear.

xvilka · on Oct 23, 2020

More on this subject is Statistical Consequences of Fat Tails book[1].

[1] https://arxiv.org/abs/2001.10488

kqr · on Oct 23, 2020

I really wish Taleb would employ a different editor. His insights – the ones relayed to me by other people – are brilliant, but I honestly have not been able to get through his writing to them.

diab0lic · on Oct 23, 2020

Hey! Just wanted to say thanks for posting this, I wasn't aware of it but am excited to read it -- I've been craving a more technical treatment of Taleb's subject matter.

H8crilA · on Oct 22, 2020

What kind of operation happens when you pointwise multiply two probability density functions? It wouldn't even integrate to 1.0 any more (it can even integrate to 0.0, like in the example 4). Rather than taking the convolution, which corresponds to adding two independent random variables that generate said density functions.

Mithriil · on Oct 23, 2020

Pointwise multiplication of, say, two density functions gives the likelihood function of two independent random variables with these densities. This function is not a density function, as you say, but is still an important object, as it is used for inference based on the likelihood principle.

srean · on Oct 23, 2020

But then the term would be P(x1) P(x2) and the correct summation/integration to use would be over the cartesian product of the sample space. Under such summation/integration its a valid density

H8crilA · on Oct 23, 2020

I'm sorry, but I'm lost, are we talking about this likelihood function? I'm guessing no, since that's an object defined in a statistical space, not in a probabilistic space:

https://en.m.wikipedia.org/wiki/Likelihood_function

lalaithion · on Oct 23, 2020

Oftentimes, a “likelihood function” is generalized from that definition to refer to any non-normalized function which is still treated like a probability distribution. Any function with a finite integral over the real line can be turned into a probability distribution easily, and there are other functions that can be used in places where you would normally use a probability function, even when they don’t integrate to a finite value over the real line. (For example, improper priors in Bayesian analysis.)

“Function that can in some contexts be treated like a probability distribution” is just too long.

moultano · on Oct 23, 2020

Yes, that's the right likelihood function. To get probabilities distributions out of the products in the post, all you have to do is normalize them. That would be equivalent to applying bayes rule and combining independent evidence about a parameter with an uninformative prior. Normalizing doesn't change the shapes, so I left out the details.

epistasis · on Oct 22, 2020

Joint probability of independent events, especially the common case of independent, identically distributed data. (Or exchangeable observations, in the Bayesian framework).

For example, flipping a coin n times, observing heights of people in a population, etc.

H8crilA · on Oct 22, 2020

But we're talking about "continuous variables", random variables with infinite support for which P(X=a)=0, for all a.

It's kind of like just a single point in the convolution for "X_1 - X_2", corresponding to "X_1 - X_2 = 0" ?

epistasis · on Oct 22, 2020

I'm not sure I understand what you are trying to convey.

If we have two continuous random variables in a probability space, we have a mapping P(X, Y): R^2 --> R, such that the integral sums to 1.

If X and Y are independent, then P(X, Y) == P(X)P(Y). The examples in the linked article are a bit confusing because some of them have P(X) and P(Y) as different distributions. But the plot only shows the plot of the 2D domain along the 1D line of X==Y, which isn't a terribly common occurrence in actual practice.

H8crilA · on Oct 23, 2020

P(X,Y), when integrated along all possible diagonals given by x+y=z where z goes through all real numbers, is what a convolution is (of P(X) and P(Y) generated by those variables). This gives you the distribution of the variable Z=X+Y.

So, we're talking about the same thing, namely that multiplying two probability density functions pointwise roughly corresponds to a single point where X==Y (the same as X-Y==0), but otherwise doesn't seem terribly useful.

Enginerrrd · on Oct 22, 2020

Wouldn't you multiply like you do for functions as an inner product for hilbert space?

Something like: $ f \cdot g = \int_\infty^\infty f(x)g(x)dx $

No?

The schwartz inequality gives us that the product then has a value between 0 and 1. ...Not sure what it means if the value is less than one. What does orthogonality mean when it comes to probability density functions?

sn41 · on Oct 23, 2020

Usually orthogonal means uncorrelated random variables (not necessarily independent). A good example is sine and cosine - they are uncorrelated in the interval 0 to pi. But clearly they are not independent.

Mithriil · on Oct 23, 2020

Sine and cosine can be negative. Density functions cannot, and thus, orthogonality of the density functions implies that they don't have the same support with probability 1, or, said otherwise, the measure of the intersection of their support is 0.

andi999 · on Oct 23, 2020

Scalar product doesnt give you a function/density whatever, just a number (which can be important, but this doesnt adress the question)

adenadel · on Oct 23, 2020

You might be interested in the idea of a copula and Sklar's theorem

https://en.wikipedia.org/wiki/Copula_(probability_theory)#Sk...

jbay808 · on Oct 23, 2020

Along these lines, in Probability Theory: The Logic of Science Jaynes suggests thinking in decibels (which are just log probabilities). Here are some excerpts of this:

"... it is very cogent to give evidence in decibels. When probabilities approach one or zero, our intuition doesn't work very well. Does the difference between the probability of 0.999 and 0.9999 mean a great deal to you? It certainly doesn't to the writer. But after living with this for only a short while, the difference between evidence of plus 30 db and plus 40 db does have a clear meaning to us. It's now in a scale which our minds comprehend naturally... In the original acoustical applications, it was introduced so that a 1 db change perceptible to our ears. With a little familiarity and a little introspection, we think that the reader will agree that a 1 db change in evidence is about the smallest increment of plausibility that is perceptible to our intuition."

"... What probability would you assign to the hypothesis that Mr Smith has perfect extrasensory perception? ... To say zero is too dogmatic. According to our theory, this means that we are never going to allow [our] mind to be changed by any amount of evidence, and we don't really want that. But where is our strength of belief in a proposition like this?

"... We have an intuitive feeling for plausibility only when it's not too far from 0db. We get fairly definite feelings that something is more than likely to be so or less likely to be so. So the trick is to imagine an experiment. How much evidence would it take to bring your state of belief up to the place where you felt very perplexed and unsure about it? Not to the place where you believed it -- that would overshoot the mark, and again we'd lose our resolving power. How much evidence would it take to bring you just up to the point where you were beginning to consider the possibility seriously?

"So, we consider Mr Smith, who says he has ESP, and we will write down some numbers from one to ten on a piece of paper and ask him to guess which numbers we've written down. We'll take the usual precautions to make sure against other ways of finding out. If he guesses the first number correctly, of course we will all say 'you're a very lucky person, but I don't believe you have ESP'. And if he guesses two numbers correctly, we'll still say 'you're a very lucky person, but I still don't believe you have ESP'. By the time he's guessed four numbers correctly -- well, I still wouldn't believe it. So my state of belief is certainly lower than -40 db.

"How many numbers would he have to guess correctly before you would really seriously consider the hypothesis that he has extrasensory perception? In my own case, I think somewhere around ten. My personal state of belief is, therefore, about -100 db. You could talk me into a +-10db change, and perhaps as much as +-30 db, but not much more than that."

pkkim · on Oct 23, 2020

Bringing it closer to home for many software developers, when we talk about the reliability of a system, we also use a variation on log probability (of success), but we call it "nines:" 99% reliability is two nines, 99.99% is four nines, etc. So nines(x) = -log_10 (1-x)

epistasis · on Oct 23, 2020

This is a great insight.

The bioinformatics community often uses error probabilities that are PHRED scaled, -10 * log10 of the probability. I didn't realize this was so close to dB!

swyx · on Oct 23, 2020

I really like this. are there upper or lower bounds to belief decibels? what does 1000db evidence or belief "sound" like vs 100db?

jbay808 · on Oct 23, 2020

1000 dB of skepticism is watching someone win the lottery ten times in a row, with ten tickets, and saying "impressive trick, but I bet you can't do it an eleventh time".

heimatau · on Oct 23, 2020

This takes my phrase 'they have signal' to a whole new level. Wow. Remarkable thoughts by Jaynes. Thank you for sharing this.

kgwgk · on Oct 23, 2020

> The financial crisis that brought the world economy to its knees was caused largely by bad statistics. Analysts assumed that the values of mortgages are Gaussian, and that the tails of mortgage values aren’t more correlated than than mortgage values are on average.

Analysts did use models that take into account that the tails are more correlated. Arguably overconfidence on these sophisticated models brought complacency and led to excessive risk taking.

_bz2r · on Oct 23, 2020

  Despite the fact that these distributions had very similar 
  looking shapes, their products are entirely different. The 
  distributions are shifted so that the center of one is 8 
  standard deviations from the center of the other

Is this last part correct?

"the distributions" means the distributions of the resulting products, correct?

I might misunderstand, but the centers of all three distributions are identical, at zero, aren't they?

kgwgk · on Oct 23, 2020

This is just the setting of the example. For each of those distributions he takes two of them which don't overlap much and looks at the convolution (i.e. the distribution of the sum of two independent random variables described by them).

Edit: by the way, the irony is that the convolution calculates products of probabilities and adds them!

_bz2r · on Oct 23, 2020

Oh! pre-combination! now I understand; thank you!

CalChris · on Oct 23, 2020

I remember from either Linear Systems or was it Discrete Math that the negative of the log probability is the information content of an event.

moultano · on Oct 23, 2020

Yes! Also called the surprisal.