A Conceptual Introduction to Hamiltonian Monte Carlo

soVeryTired · on Feb 28, 2017

Radford Neal's expository paper [1] on HMC is a beautiful overview of the area, and honestly one of the most enlightening papers I've ever read. Conceptually, HMC really isn't that difficult. The broad idea is as follows:

Borrowing from statistical physics, it's possible to link energy and probability as E(x) = exp(-P(x)). So you can interpret your MCMC target as an "energy landscape". Highpoints in the landscape correspond to low probabilities in the target distribution, and valleys to high-probability regions.

Now add a source of kinetic energy (an auxiliary probability distribution), and think of your MCMC sampler as a hockey puck sliding frictionlessly through the energy landscape. Since kinetic + potential energy is conserved, it follows that you move through contours of your joint (target, auxiliary) distribution. That means you never have to reject a sample, regardless of how far the hockey puck travels.

This interpretation also shows why you need to use gradient information. To know where the hockey puck is going to move next, you need to know how steep the walls of your energy landscape are: that is, you need the gradient!

[1] https://arxiv.org/pdf/1206.1901.pdf

betanalpha · on Feb 28, 2017

So V(x) = exp(-pi(x)) and you use HMC to explore those potential energy level sets? But what happens when you change coordinates and get a different density, pi(x), a different potential energy, V(x), and hence different level sets? And why does the gradient help as it points transverse to the level sets and not along them which you would need to move in the right direction?

Maybe the momenta fixes that somehow? So you just add any IID gaussian variables to the target? You need special variables? Even so, why IID gaussians?

The problem with the "hockey puck" intuition is that it is incredibly shallow and obfuscates the underlying principles that really make the method work. Unfortunately the false intuition also motivates bad tuning strategies and method extensions which have real practical consequences.

Radford's review was really, really important but it was written when we really didn't know what was going on at a foundational level. The linked paper has the benefit of the last five years of research where we've figured those foundations out and consequently can identify the principled important to robust use of the method in practice.

soVeryTired · on Feb 28, 2017

> So V(x) = exp(-pi(x)) and you use HMC to explore those potential energy level sets?

Not quite. You explore the level sets of the augmented system of kinetic-plus-potential energy. The use of a Gaussian distribution as auxiliary variable is arbitrary, so it's not surprising that it's suboptimal.

That the gradient information is needed comes from physical intuition. I roll a marble with along an inclined plane with velocity V. What determines the trajectory of the marble? The gradient of the plane. Now since differentiable functions look locally like planes, that's more or less all you need, at least instantaneously.

I'm sure you know more about this topic than I do, but you'll have to work quite a bit harder to convince me that the hockey puck intuition is "incredibly shallow".

betanalpha · on Feb 28, 2017

The point of those rhetorical questions is that HMC is _not_ just about adding augmenting the system with additional parameters and following the gradients of the larger system.

Even in your marble system, for example, the gradient of the plane does not affect the position of the marble directly. The gradient affects the velocity which then moves the marble. If the interactions aren't just right then you no longer have a physical system with energy conservation and all the other nice properties one might expect.

In both the physical system of the marble and the pseudo-physical system created by HMC there is a deeper mathematical structure that relates gradients to motion to probability in exactly the right way. If that structure is compromised, say in a naive implementation of HMC, then all of the magic is lost and you don't get the performance that you might expect.

It may sound like I'm being pedantic here, but it's only be understanding and respecting the underlying math far beneath the "hockey puck" description that we've been able to take HMC to the next level, especially in tools like Stan, with faster and more robust performance as well as sensitive diagnostics of failures.

soVeryTired · on Feb 28, 2017

Yep, I can absolutely agree with that. It's not just the intuition that matters, the deeper details are important too.

betanalpha · on Feb 28, 2017

+1. As someone who has struggled to understand math for a long time, I have come to appreciate that it's not about finding any "intuitive" description of a system but rather the one that's compatible with the underlying mathematics.

kgwgk · on Feb 28, 2017

I also like Neal's introduction. I find the discussion about the typical set in the paper linked in the submission more confusing than illuminating. Why wouldn't the typical set include the mode of the distribution given that its contribution is higher than for any other region of the same volume. Of course the "outside half" of a hypersphere (say 0.5<R<1) will contribute more than the "inside half" (R<0.5) as N grows but it will also be exponentially larger. It's not clear to me how that defines a neighbourhood (or what does one gain from doing so, it seems the neighbourhood could be defined to be the full hypersphere without any loss).

betanalpha · on Feb 28, 2017

Because probability has to sum to one. The higher you go in dimension that unit probability is distributed across more and more "boxes" until any single box, including the one around the mode becomes irrelevant. When you are computing expectations you need to focus on those boxes that dominate contributions to the integrals, which are all in the typical set.

The typical set is a vaguely defined notation as you cannot define an explicit neighborhood without defining some explicit probability threshold which is often done in information theory of discrete systems. The point here, however, is not to define an explicit neighborhood but rather to motivate that the whichever neighborhood of high-probability mass you would define would not near the mode and instead extend out into a compact neighborhood around the mode.

You are welcome to define a neighborhood as the convex hull of the typical set, which would then include the mode (subject to some technical conditions). That wouldn't add any appreciable probability mass, however, but would introduce a whole bunch of space that your statistical algorithm would have to explore at additional computational cost.

kgwgk · on Feb 28, 2017

How would that add a whole bunch of space if the only reason why the probability mass is not appreciable is that the volume is minuscule?

betanalpha · on Feb 28, 2017

Yes, "a whole bunch" is not a technical term. :-p You are correct that including the entire convex hull would not itself be absurdly costly. A naive search (say grid search) would still spend most of its time evaluating points at the boundary of the hypersphere, although the additional evaluations near the mode would strictly add more cost. Similarly, any algorithm targeting probability mass (like MCMC) would avoid the interior of the hypersphere altogether. In practice you don't exclude the mode explicitly, you build an algorithm that targets mass and it ends up inherently avoiding the mode. The important conceptual take aways are that, for example, algorithms that try to utilize information just in the neighborhood around the mode (for example so-called MAP estimators) will never be able to accurately quantify a probability distribution and that algorithms that explore neighborhoods of high probability mass will behave in seemingly-counterintuitive ways.

kgwgk · on Feb 28, 2017

> although the additional evaluations near the mode would strictly add more cost.

No, one evaluation near the mode is more efficient than one evaluation at the boundary (because the density is higher).

I agree that the region immediately around the mode is naturally "avoided" (because the volume is small). But your paper makes it look like we increase efficiency by explicitely avoiding it and concentrating somewhere else. That's what I found confusing.

betanalpha · on Feb 28, 2017

"No, one evaluation near the mode is more efficient than one evaluation at the boundary (because the density is higher)."

Incorrect in general. Firstly, one evaluation anywhere does not yield any reasonably accurate estimate of expectations. We'll always need an ensemble of evaluations, in which case the relevant question really is "how should I distribute these evaluations". And for any scalable algorithm none of them will be near the mode.

This is often hard to grok because people implicit fall back to the example of a gaussian density where the mode and the Hessian at the mode fully characterize the density which can then be used to compute analytic integrals. But for a general target distribution we do not have any of that structure and instead have to consider general computational strategies.

"I agree that the region immediately around the mode is naturally "avoided" (because the volume is small). But your paper makes it look like we increase efficiency by explicitely avoiding it and concentrating somewhere else. That's what I found confusing."

In high-dimensions the probability mass of any well-behaved probability distribution concentrates in (or _around_ if you want to acknowledge the fuzziness) the typical set. Hence accurate estimation of expectations requires quantifying the typical set. Any evaluation outside of the typical set is wasted because it offers increasingly negligible contributions to the integrals -- a few additional evaluations may not drive the cost up appreciably but they will still be wasted.

So it's not that the only thing that matters is avoiding the mode. Rather what matters is that, contrary to many's intuitions, the neighborhood around the mode does inform expectation values and so exploring that neighborhood is insufficient for estimating expectations. That then motivates the question of what neighborhoods do matter, which is answered by concentration of measure and the existence of the typical set.

And in practice, we don't actually do any of this explicitly. Instead we construct algorithms that somehow quantify probability mass (MCMC, VB, etc) and they will implicitly avoiding the mode and end up working with the typical set.

kgwgk · on Feb 28, 2017

> We'll always need an ensemble of evaluations, in which case the relevant question really is "how should I distribute these evaluations". And for any scalable algorithm none of them will be near the mode.

Why not distributing them according to the probability distribution? If I sample one million points and one happens to be near the mode, this doesn't mean any additional cost. This is no worse than sampling one million points none of which happens to be near the mode.

> Any evaluation outside of the typical set is wasted because it offers increasingly negligible contributions to the integrals

This is not a reason to exclude the mode from the typical set, because one evaluation at the mode offers a larger contribution to the integrals than one evaluation elsewhere.

> Instead we construct algorithms that somehow quantify probability mass (MCMC, VB, etc) and they will implicitly avoiding the mode and end up working with the typical set.

The algorithms don't have to avoid the mode, they just have to sample from it fairly (not much, but more than from any other region of the same size in the rest of the typical set).

betanalpha · on Feb 28, 2017

May I suggest that you try sampling from a high-dimensional distribution and see how many samples end up near the mode? For example, try a 50-dimensional IID unit gaussian and check how often r = sqrt(x_1^2 + ... x_50^2) < 0.25 * sqrt(D)? You can also work this out analytically -- it's the classic example of concentration of measure.

By construct samples from a distribution will concentrate in neighborhoods of high probability mass, and hence the typical set.

kgwgk · on Feb 28, 2017

It was you who said that it was better to exclude the mode from the typical set because "the additional evaluations near the mode would strictly add more cost". Now you tell me that evaluations near the mode are extremely unlikely. Something that, believe it or not, I understand. Maybe you would have liked it better if I had talked about one sample in one trillion being near the mode. And in that case that evaluation wouldn't be wasted because its contribution to the computed integral would be more important than if I had sampled by chance another point in the typical set with lower probability density.

Excluding the region with highest probability density from the typical set is a bit like saying that the population of New York is concentrated outside NYC because most people lives elsewhere.

soVeryTired · on Feb 28, 2017

The only illustration that I know of a "typical set" is as follows:

Take a sample from an N-dimensional standard Gaussian. The square of its Its distance from the origin is x_1^2 + x_2^2 + ... + x_N^2. Each component of this sum has mean 1, so the law of large numbers says for large N, the sum will be about equal to N.

On other words, I take a sample from a high-dimensional Gaussian and with very high probability, it will live a distance sqrt(N) from the origin. Which is counterintuitive as you point out, since the most likely value is the 0 vector.

kgwgk · on Feb 28, 2017

Yes, the typical point is far from the origin (but unless there is spherical symmetry the only thing those points have in common is being far from the origin). What do you gain by excluding the origin from the typical set?

Edit: I'm not sure if your comment contradicts or complements what I wrote. As I said, taking a hypersphere of radius 1 the "inner" hypersphere of radius 0.5 accounts for just 1/2^n of the hypervolume. Would you say that the "outer" shell is a neighbourhood? How is that defined? What is the advantage of doing so instead of using the whole (only marginally bigger) hypersphere?

soVeryTired · on Feb 28, 2017

There's some information theory that goes over my head, but I think the point is that you can approximate E[f(x)] with an integral over the typical set and expect good results, for non-pathological functions f.

Defined in this way, you could of course leave the origin in the typical set. But then there would be another, smaller, typical set that would do the job just as well.

kgwgk · on Feb 28, 2017

Thanks. I'm sure the formalism makes sense if properly developed and can lead to useful approximations. It's just that the "conceptual introduction" in that paper makes no sense to me... I understand it's just a simplification, but simplifying too much removes the essence of things.

> A grid of length N distributed uniformly in a D-dimensional parameter space requires N^D points and hence N^D evaluations of the integrand. Unless N is incredibly large, however, it is unlikely that any of these points will intersect the narrow typical set, and the exponentially-growing cost of averaging over the grid yields worse and worse approximations to expectations.

In fact any point within the hypersphere is as good (or better) as any point in that narrow typical set not including the mode. The problem is not missing the "narrow typical set", is missing the whole hypersphere (because the volume of the hypersphere is small). I fail to see where in that introduction the fact that the typical set excludes the mode of the distribution makes any difference.

sndean · on Feb 28, 2017

> So you can interpret your MCMC target as an "energy landscape". Highpoints in the landscape correspond to low probabilities in the target distribution, and valleys to high-probability regions.

Thinking of things in terms of energy landscapes was the only I could understand this stuff during grad school. Plus they had some of the coolest visualizations (e.g., Fig 2 from [1]) that kept me interested.

[1] http://www.pnas.org/content/101/41/14766.full.pdf

tmalsburg2 · on Feb 28, 2017

Also worth checking out is the author's talk at this year's StanCon:

  https://youtu.be/DJ0c7Bm5Djk?t=16810

Title: Everything you should have learned about Markov Chain Monte Carlo.

betanalpha · on Feb 28, 2017

There are also talks focusing on HMC available online, the most recent of which are https://www.youtube.com/watch?v=VnNdhsm0rJQ (more introductory) and https://icerm.brown.edu/video_archive/#/play/1107 (more mathematically formal).

leblancfg · on Feb 28, 2017

Great talk, thanks for the link!

xioxox · on Feb 28, 2017

As far as I know, you need to be able to compute the gradient of the function being explored to use this method. Is that correct?

Most of my likelihood functions I use are nasty and complex. I've mostly been using an affine-invariant sampler, emcee, http://dan.iel.fm/emcee/current/, which works well on many problems.

imurray · on Feb 28, 2017

Yes, you need the gradient. And despite what the autodiff people say, that might be hard and/or not very useful in practice, if your likelihood is based on running a lengthy astrophysical simulation.

You can run HMC on any differentiable surrogate surface surface. It's only the accept/reject step that has to use your real model and data. You might consider that if you have a less hairy approximation to your model.

Emcee is popular in astronomy, partly because of network effects, partly because it's a really nice package, and partly because the method's a great fit for some of the posteriors you have: often unimodal-ish even if the likelihood is nasty to compute, and often in not that many dimensions. It would fall flat on its face for something like a Bayesian neural network.

betanalpha · on Feb 28, 2017

"And despite what the autodiff people say, that might be hard and/or not very useful in practice, if your likelihood is based on running a lengthy astrophysical simulation."

Only if you treat the simulation as an un-interrogatable black box. All of those astro simulations are ODE or PDE systems of one kind or another which can be very cleanly autodiffed to calculate exact gradients. N-bodies are particularly well-suited to this approach.

"You can run HMC on any differentiable surrogate surface surface. It's only the accept/reject step that has to use your real model and data. You might consider that if you have a less hairy approximation to your model."

You _can_ but it won't work above O(10) dimensions. The problem is that the more dimension you have the more any surrogate surface (and its gradients) will drift away from the true surface and worse your acceptance probabilities will be.

With exact gradients the optimal performance of HMC will drop negligibly with dimension (you need to get up O(2^100) or so before you really start to see problems), assuming the target distribution itself doesn't get more complex its dimension increases.

contravariant · on Feb 28, 2017

You need the gradient of the log-likelihood in all Hamiltonian MCMC implementations I've seen so far (you don't need it in classical MCMC though). However I'm reasonably sure that the method will still work even if you use the gradient of some approximate log-likelihood (the proof only requires the dynamics to be reversible and preserve volume, although the closer you can get to an exact solution the better so use this information at your own risk).

yummyfajitas · on Feb 28, 2017

I don't think this is right - small deviations from the true gradient will result in large deviations along the Hamiltonian path. The magic trick of HMC is that you can safely make large jumps in state space because you are following the right path - doesn't feel safe to break this assumption.

But I'm not an expert on HMC. My intuition comes mainly from my past career as a physicist. So if someone knows why it would be safe I'd love to hear.

betanalpha · on Feb 28, 2017

This intuition is correct -- symplectic integrators admit highly accurately energy conservation only when you have access to the exact gradients. When the gradients are inexact then you get bias in the integrator that pulls you away from the target energy level set and decreases the acceptance probabilities (or rather the transition probabilities away from the initial point if you using a more sophisticated scheme as advocated in the paper).

Note that symplectic integrator trajectories stay near the target energy level set but they need not stay near the true trajectory itself. In general the two will diverge pretty quickly (a manifestation of Hamiltonian chaos) within the level sets, but that's not a problem as we don't really care how they explore the level sets, just that they're staying on them.

imurray · on Feb 28, 2017

Implementations of HMC are approximating ideal Hamiltonian dynamics by discretizing time. The "symplectic" integrators that are used have the property that they correspond to simulating continuous-time Hamiltonian dynamics exactly on some other energy surface, which is not too far from the one you had in mind. So, in some sense, an approximation like the grandparent had in mind is the usual case.

You are right that a bad approximation of the surface could lead to proposals that get rejected a lot. But at least the dynamics are likely to be stable. It could be worth trying.

contravariant · on Feb 28, 2017

It will result in changes in total 'energy', but that's fine if you account for this in the acceptance probability.

contravariant · on Feb 28, 2017

I've already responded regarding the need to calculate the gradient, but after reading up on the affine-invariant sampler I'm wondering if it wouldn't be an idea to use a hybrid setup where you have several chains running alongside like the affine-invariant sampler and use those to approximately solve the Hamiltonian dynamics... not sure what will happen, but it removes some of the 'arbitrarity' in the proposal of steps for the affine-invariant method, and actually uses the already computed values of the distribution, which should (in theory) be better.

Suppose I'll have to try it now.

plafl · on Feb 28, 2017

I have used the NUTS sampler, via PyMC. If your function is smooth you are OK, even if it's complex, because both PyMC and I think Stan too will auto differentiate your function. PyMC in particular uses theano under the hood.