Hacker News new | past | comments | ask | show | jobs | submit login
Logs, Tails, Long Tails (2013) (moultano.wordpress.com)
214 points by moultano on Oct 22, 2020 | hide | past | favorite | 59 comments



I don't want to discount the usefulness of log probabilities. But this post gets off to a poor start.

> Probabilities are very rarely added together, and probability distributions even more rarely.

Umm, we _often_ add probabilities when we want to know the probability that any of a number of finite but mutually exclusive outcomes will happen.

We often add distributions (perhaps with weights) when we think about mixture models, or when there are multiple mechanisms which can both give rise to the same outcomes.

A point the author didn't make about log probabilities which likely relevant to this audience is that we often think about the product of many probability terms when modeling a complex outcome of many steps (e.g. a language model where words are 'drawn' from a distribution, and so the probability of a sentence is a product of many terms corresponding to many generative steps). In these cases log probabilities are often also important for computations to avoid underflow issues. I.e. when you have to reason about large, complex, composed structures generated from stochastic processes, _everything_ can be astoundingly unlikely in absolute terms, and log probabilities let us keep track of the important differences between astoundingly unlikely things.


Log-probabilities are added so much that there are often special-purpose functions for adding them, Eg https://numpy.org/doc/stable/reference/generated/numpy.logad...


Sure, but mixture models and mutually exclusive events are less common to work with than just "models." Obviously there are times when you add probabilities together, but usually you don't. Even when you're writing a mixture model, or doing bayes nets, most of the work is done with log probabilities in the various branches, and only retransformed into linear space when the branches are combined.


Mixture models are surprisingly common and addition/subtraction are not limited to mutually exclusive events. In such cases one goes the route of inclusion/exclusion formula.

Summation is so fundamental to probabilities that it is how probabilities are even defined axiomatically in the Kolmogorov system. https://en.wikipedia.org/wiki/Sigma_additivity

I think what you meant was that in computation one often needs to multiply probabilities and in such cases it is helpful to work in the log semiring https://en.wikipedia.org/wiki/Log_semiring


> Obviously there are times when you add probabilities together, but usually you don't.

Maybe you don’t but it’s hardly unusual. Integrating out nuisance parameters from a posterior distribution is adding probabilities. Monte Carlo methods are based on adding probabilities.


>Sure, but mixture models and mutually exclusive events are less common to work with than just "models."

What do you mean by this? A mixture model is just a model.

Also, how can you do absolutely anything without "working with mutually exclusive events"? You need mutual exclusivity to support the concept of a decision or a hypothesis.


Other fun stuff you can do:

- Easy weighting for observations (just put a weight in front of the corresponding term in the sum)

- Identify and isolate sufficient statistics (by breaking up the summation until you've got terms that only contain data, no parameters)

- Just add a quadratic term and pretend it's kinda Gaussian now (esp. fun if you pretend one of your parameters is generated with a Gaussian process).

- Take the expected value of it w.r.t. some other distribution (Cross entropy / Variational Bayes)

- Pretend the negative log probability is a Hamiltonian and add some momentum + kinetic energy terms (Hamiltonian MC)

- Same but skip the kinetic bit and pretend it's a thermodynamic ensemble (information field theory)

- Same but perform a Legendre transformation to get the Effective action / Free energy (ends up being the same as the KL-divergence)


Shameless plug: I wrote some high quality (I hope) notes on HMC, because I couldn't find an explanation that was (a) rigorous, (b) explained the different points of view of MCMC, and (c) was visual! Here's the notes: https://github.com/bollu/notes/blob/master/mcmc/report.pdf

Feedback is much appreciated.


Looks cool. Also, I wish that when I started university somebody would have forced me to start a "notes" repo like yours. Looks great and very smart to have.


Your code contains a small bug, in the leapfrog HMC function the line pnext = -p should be pnext = -pnext. Also this line is only of theoretical importance, it has no effect on the final result (unless you use a weird asymmetric kinetic energy).


Nice, thank you for the bug report. fixed!


Oh man, these are great notes, very succinct! I'm not much into proofs, but finally picked up the essence of Hamiltonian M-H thanks to this. (Also, I can't imagine what kind of genius it takes to rigorously derive such a method...)


Minor correction: it's "detailed balance" (not "detail balance").


Thank you, Fixed!


I love your real world usage and dare I say "tricks" ?

If I want to learn more about "applied statistics" best source you would recommend that is not dry as bone ?


Well, my method so far has been to at least be familiar with some of the theoretical stuff and then try to figure out how to apply it in depth when the opportunity to do so arises. Not sure if there are references where you can directly jump to the applied bit.


Would you mind saying some more about these topics? Or provide some references? I’d like to learn a bit more on all of these!


Most of those things are stuff you'll encounter if you read up on Hamiltonian MC and information field theory, but it might take some additional reading to get all the required background knowledge.

The first two are just tricks really. For the weighted estimates you just go from:

    log(P) = \sum log(P_i)
to

    log(P) = \sum w_i log(P_i)
and for the sufficient statistics you get derivations like:

    log(P) = \sum_i (1/2) (x_i - m)^2
           = \sum_i (1/2) x_i^2 - x_i m - (1/2)m^2
           = (\sum_i (1/2) x_i^2) - (\sum_i x_i) m - (1/2)m^2
to show that (\sum_i (1/2) x_i^2) and (\sum_i x_i) fully determine the posterior distribution.


I wonder how much he knows, interpreting it as a thermodynamic ensemble is just going back to thinking about probabilities really. Information field theory is something very different. The cheesy explanation for it is that it is just quantum field theory in imaginary time.


>it is just quantum field theory in imaginary time.

Quantum field theory is just thermodynamics in imaginary time, not sure why you object to calling it a thermodynamic ensemble. Besides, I could hardly introduce a subject as 'quantum field theory in imaginary time' now could I?


Interesting article but with some factual imprecisions: 1. A bit nit-picky but the Beta distribution has support [0, 1]. So the distribution in the graph is not a Beta. It's probably a shifted and scaled Beta. 2. The probability density can be arbitrarily high, which means that the log of the density can't have an upper bound at 0. Probability is bounded at 1 but probability density isn't. The fact that none of the example distributions passes zero is a mere coincident. I'm surprised to see this beginner's mistake in an otherwise insightful article.


> the log of the density can't have an upper bound at 0

I don't think he claims that. He even shows charts where positive log-densities appear.


It's possible that I misunderstood the text but what's the alternative interpretation of "The log function asymptotes at 0"? Also note that the text says log probabilities when it actually refers to log densities.


The log function has a vertical asymptote as x approaches 0.


Ah, that's likely the interpretation that the author had in mind. But it didn't help that he wrote log probability when he actually talked about log probability density.


If the author himself tells you so, it’s more than likely. :-)


When x goes to minus infinity log(x) approaches zero.


No, the logarithm is only defined for positive values. Plus probability densities are strictly non-negative. https://en.m.wikipedia.org/wiki/Logarithm


You're absolutely right! I meant it the other way around:

As x approaches 0, log(x) goes to minus infinity.

I think that's what he means (the next sentence about "visualiz[ing] very small values of a function by expanding their range" makes sense in that context), but I agree it is not very clear.


More on this subject is Statistical Consequences of Fat Tails book[1].

[1] https://arxiv.org/abs/2001.10488


I really wish Taleb would employ a different editor. His insights – the ones relayed to me by other people – are brilliant, but I honestly have not been able to get through his writing to them.


Hey! Just wanted to say thanks for posting this, I wasn't aware of it but am excited to read it -- I've been craving a more technical treatment of Taleb's subject matter.


What kind of operation happens when you pointwise multiply two probability density functions? It wouldn't even integrate to 1.0 any more (it can even integrate to 0.0, like in the example 4). Rather than taking the convolution, which corresponds to adding two independent random variables that generate said density functions.


Pointwise multiplication of, say, two density functions gives the likelihood function of two independent random variables with these densities. This function is not a density function, as you say, but is still an important object, as it is used for inference based on the likelihood principle.


But then the term would be P(x1) P(x2) and the correct summation/integration to use would be over the cartesian product of the sample space. Under such summation/integration its a valid density


I'm sorry, but I'm lost, are we talking about this likelihood function? I'm guessing no, since that's an object defined in a statistical space, not in a probabilistic space:

https://en.m.wikipedia.org/wiki/Likelihood_function


Oftentimes, a “likelihood function” is generalized from that definition to refer to any non-normalized function which is still treated like a probability distribution. Any function with a finite integral over the real line can be turned into a probability distribution easily, and there are other functions that can be used in places where you would normally use a probability function, even when they don’t integrate to a finite value over the real line. (For example, improper priors in Bayesian analysis.)

“Function that can in some contexts be treated like a probability distribution” is just too long.


Yes, that's the right likelihood function. To get probabilities distributions out of the products in the post, all you have to do is normalize them. That would be equivalent to applying bayes rule and combining independent evidence about a parameter with an uninformative prior. Normalizing doesn't change the shapes, so I left out the details.


Joint probability of independent events, especially the common case of independent, identically distributed data. (Or exchangeable observations, in the Bayesian framework).

For example, flipping a coin n times, observing heights of people in a population, etc.


But we're talking about "continuous variables", random variables with infinite support for which P(X=a)=0, for all a.

It's kind of like just a single point in the convolution for "X_1 - X_2", corresponding to "X_1 - X_2 = 0" ?


I'm not sure I understand what you are trying to convey.

If we have two continuous random variables in a probability space, we have a mapping P(X, Y): R^2 --> R, such that the integral sums to 1.

If X and Y are independent, then P(X, Y) == P(X)P(Y). The examples in the linked article are a bit confusing because some of them have P(X) and P(Y) as different distributions. But the plot only shows the plot of the 2D domain along the 1D line of X==Y, which isn't a terribly common occurrence in actual practice.


P(X,Y), when integrated along all possible diagonals given by x+y=z where z goes through all real numbers, is what a convolution is (of P(X) and P(Y) generated by those variables). This gives you the distribution of the variable Z=X+Y.

So, we're talking about the same thing, namely that multiplying two probability density functions pointwise roughly corresponds to a single point where X==Y (the same as X-Y==0), but otherwise doesn't seem terribly useful.


Wouldn't you multiply like you do for functions as an inner product for hilbert space?

Something like: $ f \cdot g = \int_\infty^\infty f(x)g(x)dx $

No?

The schwartz inequality gives us that the product then has a value between 0 and 1. ...Not sure what it means if the value is less than one. What does orthogonality mean when it comes to probability density functions?


Usually orthogonal means uncorrelated random variables (not necessarily independent). A good example is sine and cosine - they are uncorrelated in the interval 0 to pi. But clearly they are not independent.


Sine and cosine can be negative. Density functions cannot, and thus, orthogonality of the density functions implies that they don't have the same support with probability 1, or, said otherwise, the measure of the intersection of their support is 0.


Scalar product doesnt give you a function/density whatever, just a number (which can be important, but this doesnt adress the question)


You might be interested in the idea of a copula and Sklar's theorem

https://en.wikipedia.org/wiki/Copula_(probability_theory)#Sk...


Along these lines, in Probability Theory: The Logic of Science Jaynes suggests thinking in decibels (which are just log probabilities). Here are some excerpts of this:

"... it is very cogent to give evidence in decibels. When probabilities approach one or zero, our intuition doesn't work very well. Does the difference between the probability of 0.999 and 0.9999 mean a great deal to you? It certainly doesn't to the writer. But after living with this for only a short while, the difference between evidence of plus 30 db and plus 40 db does have a clear meaning to us. It's now in a scale which our minds comprehend naturally... In the original acoustical applications, it was introduced so that a 1 db change perceptible to our ears. With a little familiarity and a little introspection, we think that the reader will agree that a 1 db change in evidence is about the smallest increment of plausibility that is perceptible to our intuition."

"... What probability would you assign to the hypothesis that Mr Smith has perfect extrasensory perception? ... To say zero is too dogmatic. According to our theory, this means that we are never going to allow [our] mind to be changed by any amount of evidence, and we don't really want that. But where is our strength of belief in a proposition like this?

"... We have an intuitive feeling for plausibility only when it's not too far from 0db. We get fairly definite feelings that something is more than likely to be so or less likely to be so. So the trick is to imagine an experiment. How much evidence would it take to bring your state of belief up to the place where you felt very perplexed and unsure about it? Not to the place where you believed it -- that would overshoot the mark, and again we'd lose our resolving power. How much evidence would it take to bring you just up to the point where you were beginning to consider the possibility seriously?

"So, we consider Mr Smith, who says he has ESP, and we will write down some numbers from one to ten on a piece of paper and ask him to guess which numbers we've written down. We'll take the usual precautions to make sure against other ways of finding out. If he guesses the first number correctly, of course we will all say 'you're a very lucky person, but I don't believe you have ESP'. And if he guesses two numbers correctly, we'll still say 'you're a very lucky person, but I still don't believe you have ESP'. By the time he's guessed four numbers correctly -- well, I still wouldn't believe it. So my state of belief is certainly lower than -40 db.

"How many numbers would he have to guess correctly before you would really seriously consider the hypothesis that he has extrasensory perception? In my own case, I think somewhere around ten. My personal state of belief is, therefore, about -100 db. You could talk me into a +-10db change, and perhaps as much as +-30 db, but not much more than that."


Bringing it closer to home for many software developers, when we talk about the reliability of a system, we also use a variation on log probability (of success), but we call it "nines:" 99% reliability is two nines, 99.99% is four nines, etc. So nines(x) = -log_10 (1-x)


This is a great insight.

The bioinformatics community often uses error probabilities that are PHRED scaled, -10 * log10 of the probability. I didn't realize this was so close to dB!


I really like this. are there upper or lower bounds to belief decibels? what does 1000db evidence or belief "sound" like vs 100db?


1000 dB of skepticism is watching someone win the lottery ten times in a row, with ten tickets, and saying "impressive trick, but I bet you can't do it an eleventh time".


This takes my phrase 'they have signal' to a whole new level. Wow. Remarkable thoughts by Jaynes. Thank you for sharing this.


> The financial crisis that brought the world economy to its knees was caused largely by bad statistics. Analysts assumed that the values of mortgages are Gaussian, and that the tails of mortgage values aren’t more correlated than than mortgage values are on average.

Analysts did use models that take into account that the tails are more correlated. Arguably overconfidence on these sophisticated models brought complacency and led to excessive risk taking.


  Despite the fact that these distributions had very similar 
  looking shapes, their products are entirely different. The 
  distributions are shifted so that the center of one is 8 
  standard deviations from the center of the other
Is this last part correct?

"the distributions" means the distributions of the resulting products, correct?

I might misunderstand, but the centers of all three distributions are identical, at zero, aren't they?


This is just the setting of the example. For each of those distributions he takes two of them which don't overlap much and looks at the convolution (i.e. the distribution of the sum of two independent random variables described by them).

Edit: by the way, the irony is that the convolution calculates products of probabilities and adds them!


Oh! pre-combination! now I understand; thank you!


I remember from either Linear Systems or was it Discrete Math that the negative of the log probability is the information content of an event.


Yes! Also called the surprisal.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: