Variational Inference for Machine Learning [pdf]

marmaduke · on Nov 4, 2016

Stan and PyMC3 both implement automatic differentiation based variational inference, so you can write down your statistical model and not care "much" about derivatives.

http://mc-stan.org https://github.com/pymc-devs/pymc3

AlexCoventry · on Nov 4, 2016

There's also the TensorFlow-based Edward: https://github.com/blei-lab/edward

proditus · on Nov 4, 2016

stan and edward dev here. happy to answer any questions.

(shakir's blog posts are amazing; i recommend them all.)

murbard2 · on Nov 4, 2016

Very cool, many questions

1) Why create a project distinct from Stan? Was it the prospect of benefiting of all the work going into TF and focus solely on the sampling procedures rather than autodiff or GPU integration?

2) Are you implementing NUTS?

3) Any plans to implement parallel tempering

4) Any plans to handle "tall" data using stochastic estimates of the likelihood?

proditus · on Nov 4, 2016

great questions.

1. you touch upon the right strengths of TF; that was certainly one consideration. edward is designed to address two goals that complement stan. the first is to be a platform for inference research: as such, edward is primarily a tool for machine learning researchers. the second is to support a wider class of models than stan (at the cost of not offering a "works out of the box" solution).

our recent whitepaper explains these goals in a bit more detail:

https://arxiv.org/pdf/1610.09787.pdf

2) no immediate plans. but we have HMC and are looking for volunteers :)

3) same answer as above :) should be relatively easy to implement tempering.

4) this is already in the works! stay tuned!

marmaduke · on Nov 4, 2016

Why TF instead of Theano as PyMC3 has done? Shouln't it be straightforward to port PyMC3 algos over TF?

My main gripe with Theano is that OpenCL support is near non-existent, but this is also the case with TF.

murbard2 · on Nov 4, 2016

4) which approach are you using? Generalized Poisson Estimator, or estimating the convexity effect of the exponential by looking at the sample variance of the log likelihood? The former is more pure, the latter may be more practical if ugly.

proditus · on Nov 4, 2016

theses are great insights.

our first approach is the simplest: stochastic variational inference. consider a likelihood that factorizes over datapoints. stochastic variational inference then computes stochastic gradients of the variational objective function at each iteration by subsampling a "minibatch" of data at random.

i reckon the techniques you suggest would work as we move forward!

murbard2 · on Nov 4, 2016

Edit: ah never mind, variational inference, got it! I was thinking stochastic HMC!

---

Ok but that will get an unbiased estimate of the log-likelihood. MCMC or HMC do work with noisy estimators, but they require unbiased estimates of the likelihood.

At the very least, you need to do a convexity adjustment by measuring the variance inside your mini batch. Or you can use the Poisson technique which will get you unbiased estimates of exp(x) from unbiased estimates of x (albeit at the cost of introducing a lot of variance).

proditus · on Nov 7, 2016

great points; yes, the challenge becomes considerably more challenging with MCMC!

marmaduke · on Nov 4, 2016

if I can bug you also since you're working with Alp, how does Edward handle ADVI covariance? Is it diagonal or dense or some sparse structure estimated?

proditus · on Nov 4, 2016

you may bug me on this. i work too closely with alp :)

edward does not implement completely implement advi yet. the piece that is missing is the automated transformation of constrained latent variable spaces to the real coordinate space. however, edward offers much more flexibility in specifying the dependency structure of the posterior approximation. diagonal is, just like in stan, the fastest and easiest. however introducing structure (e.g. assigning a dense normal to some of the latent variables while assigning a diagonal to others) is much easier in edward.

marmaduke · on Nov 5, 2016

OK I would be interested in seeing how to do that. Are there any examples or hints on how to start? I worked a lot with time series models (think nonlinear autoregressive), where there's strong short term autocorrelation, and the coercion to diagonal covariance seemed inappropriate.

I have also a naïve question: why not use the graphical structure of the model itself to add structure to the covariance? For example, in an AR model, each time point places prior on the next time point, so why not assume a banded covariance? More generally, one could use a cutoff on shortest path length (through the model's graphic structure) between parameters to decide if they should have nonzero coefficients.

marmaduke · on Nov 7, 2016

I came across the examples in the repo and commented on

https://github.com/blei-lab/edward/issues/211

so I'll try to do my homework before asking more questions ;)

marmaduke · on Nov 4, 2016

Whoa cool but the build is failing tisk tisk.

A big issue I ran into with stan even with advi was scaling to large datasets since it (and Eigen) are single threaded. Would Edward answer all my prayers?

When is Riemannian HMC going to arrive?

proditus · on Nov 4, 2016

i'm assuming you're referring to building edward? installation is a bit of a pain because tensorflow is not on pypi yet.

please take a look here: http://edwardlib.org/troubleshooting

edward should answer some of your prayers :) there's still some time until stan goes parallel/gpu, though there's lots of interest there.

riemannin hmc is likely just around the corner!

marmaduke · on Nov 4, 2016

I was shallow-ly referring to the Travis CI badge on the GitHub page..

I've been working with both Stan & PyMC3 on some large datasets and will definitely try Edward on them.

proditus · on Nov 4, 2016

ah ok. i agree. we're working on that. :)

give it a shot at let us know!

eli_gottlieb · on Nov 4, 2016

What's the impediment to porting these variational algorithms over to a Turing-complete probabilistic programming language?

marmaduke · on Nov 4, 2016

Stan is Turing complete or you mean.. inception??

eli_gottlieb · on Nov 4, 2016

So I can write a Church to Stan compiler, in principle?

marmaduke · on Nov 4, 2016

Yes, you would either generate Stan code or rewrite your AST into something the Stan compiler understands.

However, if Church allows you to express non differentiable models (e.g. if you have a Heaviside function), then they will either fail or not work well with HMC or ADVI (the variational inference algorithm Stan uses), because both assume that gradient of the posterior can be computed and is informative about the posterior.

coherentpony · on Nov 4, 2016

> Many samples needed, especially in high dimensions

This isn't true. For Monte Carlo sampling, the convergence of unbiased estimators (for example the expectation) is independent of the dimension of the state space. In fact, this is exactly the reason to prefer Monte Carlo integration over, say, a Riemann sum.

murbard2 · on Nov 4, 2016

That's true in the asymptotic, but the dimension find a way to rear its head in the constant to the big O through the variance of your estimator. It's still better than Riemann integration, but it sucks nonetheless.

For instance, integrate the function f(x1,x2,...xd) = 6^d * x1 * (1-x1) * ... * xd * (1-xd) on the d-dimensional unit cub (the answer is 1, for all d). the expected variance for a single point estimator is -1+(6/5)^d, which increases exponentially in the number of dimensions. That's the multiplicative constant to your big O.

For a probabilistic example let x_i ~ N(0,1) for i in 1..d and let s = Sum x_i. Try estimating the probability that -0.1 < s < 0.1 by Monte-Carlo sampling.

If most of the probability distribution mass is located near a manifold of lower dimensions, as tends to be the case for natural data, your variance will be huge.

MCMC, HMC both improve this state of affair by letting you walk or "glide" on the manifold, but you still have to contend with curvature, with multiple modes etc.

zump · on Nov 4, 2016

What course did you study to get this knowledge?

murbard2 · on Nov 4, 2016

AlexCoventry is right, these are "the mathematics of Bayesian inference", to unpack a bit my recommendation is:

- Get a strong grip on linear algebra and euclidean spaces. You should be able to have an intuitive feel for the key theorems (e.g. Cayley-Hamilton, the Spectral theorem) and should be comfortable proving them (at least once). The point isn't that you need to check that they're correct (they are), but if you can prove them, it means you've learned a certain amount of prerequisite knowledge and gained enough familiarity with the topic. They aren't just theorems you apply, they make sense and you understand the intuition behind them.

- Get the same feel for multivariate calculus. The intuition there is generally easier to acquire than for linear algebra, but you need to be comfortable with the mechanics of it. Learn to prove your results rigorously, but also to quickly derive formulas by treating infinitesimals as variables, like a physicist.

- Study integration, measures, distributions and the fundamentals of probability. If the course talks about sigma-algebras, you're in the right place. Finally study Baseyian statistics, Monte-Carlo Integration, Markov-Chain Monte-Carlo, some information theory

You don't need that level of rigor and that level of fundamentals to do machine learning. Some linear algebra, some calculus and some probability theory that you pick along the way will generally do. However, I think if you're interested in ML it is worth the effort because it will make most of the math seamless. This is a lot of math to learn, but it's not particularly "advanced" math. The underlying intuitions are relatively concrete and a lot of the procedures are relatively mechanical.

zump · on Nov 4, 2016

Thanks for the wonderful reply. How long do you estimate it took for you to get to the level you are now?

murbard2 · on Nov 4, 2016

That's hard to answer because there wasn't a precise point where I started and that's not the only thing I've been studying in my life, and I've studied a lot of this while working full time jobs which didn't really require this knowledge (constructive procrastination ftw). I started getting interested in sequential Monte-Carlo methods around 2008 while working at a hedge-fund and I think I had a pretty solid grasp by 2011. But I started with solid math fundamentals, so picking it up wasn't too troublesome.

I think a talented high-schooler could learn this topic in three to four years by studying it (and nothing else) intensively. I think it tends to happen more organically in general. You become proficient along the way, it's more of a lifelong thing. I think every piece you'll learn will be valuable and useful on its own.

AlexCoventry · on Nov 4, 2016

Any course on the mathematics of Bayesian inference will introduce you to the necessary tools.

yummyfajitas · on Nov 4, 2016

How does MCMC improve this state of affairs? From what I can see, MCMC only worsens things since you can't be sure you are sampling the entire manifold.

murbard2 · on Nov 4, 2016

That's a good question. If your distribution isn't multimodal but still suffers from having its mass concentrated in one place, and if the variable you're integrating isn't fat-tailed, then MCMC will help you by sticking around the right area once it finds it.

yummyfajitas · on Nov 4, 2016

I don't even think that's true. Consider a distribution in R^2 that is concentrated on [0,1] x [0,100].

If you have an O(1) jump size it'll take time O(100^2) for the MCMC to fully sample the support of this distribution. If you had a jump size of O(100), then you'll be rejecting 99% of your jumps due to the narrowness in the x-direction.

This is hardly an uncommon scenario.

murbard2 · on Nov 4, 2016

It's easy to design examples that trip MCMC, but they also trip straight Monte Carlo integration even more.

You can, of course, construct examples where MCMC will do worse than simple Monte Carlo integration, but these are uncommon. They mostly illustrate the difficulty of picking an appropriate jump size.

yummyfajitas · on Nov 5, 2016

If you directly draw samples from the underlying distribution, you don't have to worry at all about whether the markov chain or other process (e.g. HMC) is fully sampling the distribution.

MC has no rejections and always samples the entire distribution. You simply don't need to worry about trajectories not going everywhere they should.

glial · on Nov 4, 2016

This is true, but in my experience sampling can take longer with higher-dimensional models, even with the same number of samples. So from an efficiency point of view, VI is still great compared to sampling.

srean · on Nov 4, 2016

murbard2 and glial have pointed out some caveats. Another thing that ties down Monte Carlo sampling at times is when it is not easy to generate the samples required for Monte Carlo. A common route is to use a Markov chain sampler, hence MCMC, but that's far from a solved problem. Its very hard to reason and test that the Markov chain mixes fast, and if it does not it would take exponential time to generate the samples from its stationary distribution. Coming up with schemes better than a naïve Markov chain is an intellectual industry in itself.

murbard2 · on Nov 4, 2016

HMC has better asymptotics than MCMC. If your probability surface is differentiable you should always use that. Riemannian MC takes curvature into account but you need to start storing Hessians which isn't practical. Two missing pieces would be: - An SGD equivalent for sampling "tall" data. The generalized Poisson sampler can do that, but the variance is crappy. - An L-BFGS like version of Riemannian MC, to get rid of the Hessians.

Parallel tempering should also be exploited more.

That said, at the end of the day, the ideal sampler would be able to reason about the distribution as program and not just as a mathematical black box. It should build tractable approximations intelligently and use those to bootstrap exact sampling schemes.

I think we're going to see a wealth of better samplers come out in the next decade, following the path of combinatorial optimization towards preserving the structure of the programs.

marmaduke · on Nov 4, 2016

> you need to start storing Hessians

aren't there methods for online sparse estimates of the Hessian?

I'd expect a lot of "large" problems for which RMC is useful would have sparse structures.

> ideal sampler would be able to reason about the distribution as program

Are you a developer of Stan? If not, you might be interested.

http://mc-stan.org

_edit_ by online estimate of a Hessian, I meant online numerical approximation based on the sequence of Jacobians.

jmalicki · on Nov 4, 2016

"An L-BFGS like version of Riemannian MC, to get rid of the Hessians." - that's what he said

murbard2 · on Nov 4, 2016

I'm not a developer of Stan, but I know many of them and I'm a fan of the project in general.