Stan and PyMC3 both implement automatic differentiation based variational infere...

AlexCoventry · on Nov 4, 2016

There's also the TensorFlow-based Edward: https://github.com/blei-lab/edward

proditus · on Nov 4, 2016

stan and edward dev here. happy to answer any questions.

(shakir's blog posts are amazing; i recommend them all.)

murbard2 · on Nov 4, 2016

Very cool, many questions

1) Why create a project distinct from Stan? Was it the prospect of benefiting of all the work going into TF and focus solely on the sampling procedures rather than autodiff or GPU integration?

2) Are you implementing NUTS?

3) Any plans to implement parallel tempering

4) Any plans to handle "tall" data using stochastic estimates of the likelihood?

proditus · on Nov 4, 2016

great questions.

1. you touch upon the right strengths of TF; that was certainly one consideration. edward is designed to address two goals that complement stan. the first is to be a platform for inference research: as such, edward is primarily a tool for machine learning researchers. the second is to support a wider class of models than stan (at the cost of not offering a "works out of the box" solution).

our recent whitepaper explains these goals in a bit more detail:

https://arxiv.org/pdf/1610.09787.pdf

2) no immediate plans. but we have HMC and are looking for volunteers :)

3) same answer as above :) should be relatively easy to implement tempering.

4) this is already in the works! stay tuned!

marmaduke · on Nov 4, 2016

Why TF instead of Theano as PyMC3 has done? Shouln't it be straightforward to port PyMC3 algos over TF?

My main gripe with Theano is that OpenCL support is near non-existent, but this is also the case with TF.

murbard2 · on Nov 4, 2016

4) which approach are you using? Generalized Poisson Estimator, or estimating the convexity effect of the exponential by looking at the sample variance of the log likelihood? The former is more pure, the latter may be more practical if ugly.

proditus · on Nov 4, 2016

theses are great insights.

our first approach is the simplest: stochastic variational inference. consider a likelihood that factorizes over datapoints. stochastic variational inference then computes stochastic gradients of the variational objective function at each iteration by subsampling a "minibatch" of data at random.

i reckon the techniques you suggest would work as we move forward!

murbard2 · on Nov 4, 2016

Edit: ah never mind, variational inference, got it! I was thinking stochastic HMC!

---

Ok but that will get an unbiased estimate of the log-likelihood. MCMC or HMC do work with noisy estimators, but they require unbiased estimates of the likelihood.

At the very least, you need to do a convexity adjustment by measuring the variance inside your mini batch. Or you can use the Poisson technique which will get you unbiased estimates of exp(x) from unbiased estimates of x (albeit at the cost of introducing a lot of variance).

proditus · on Nov 7, 2016

great points; yes, the challenge becomes considerably more challenging with MCMC!

marmaduke · on Nov 4, 2016

if I can bug you also since you're working with Alp, how does Edward handle ADVI covariance? Is it diagonal or dense or some sparse structure estimated?

proditus · on Nov 4, 2016

you may bug me on this. i work too closely with alp :)

edward does not implement completely implement advi yet. the piece that is missing is the automated transformation of constrained latent variable spaces to the real coordinate space. however, edward offers much more flexibility in specifying the dependency structure of the posterior approximation. diagonal is, just like in stan, the fastest and easiest. however introducing structure (e.g. assigning a dense normal to some of the latent variables while assigning a diagonal to others) is much easier in edward.

marmaduke · on Nov 5, 2016

OK I would be interested in seeing how to do that. Are there any examples or hints on how to start? I worked a lot with time series models (think nonlinear autoregressive), where there's strong short term autocorrelation, and the coercion to diagonal covariance seemed inappropriate.

I have also a naïve question: why not use the graphical structure of the model itself to add structure to the covariance? For example, in an AR model, each time point places prior on the next time point, so why not assume a banded covariance? More generally, one could use a cutoff on shortest path length (through the model's graphic structure) between parameters to decide if they should have nonzero coefficients.

marmaduke · on Nov 7, 2016

I came across the examples in the repo and commented on

https://github.com/blei-lab/edward/issues/211

so I'll try to do my homework before asking more questions ;)

marmaduke · on Nov 4, 2016

Whoa cool but the build is failing tisk tisk.

A big issue I ran into with stan even with advi was scaling to large datasets since it (and Eigen) are single threaded. Would Edward answer all my prayers?

When is Riemannian HMC going to arrive?

proditus · on Nov 4, 2016

i'm assuming you're referring to building edward? installation is a bit of a pain because tensorflow is not on pypi yet.

please take a look here: http://edwardlib.org/troubleshooting

edward should answer some of your prayers :) there's still some time until stan goes parallel/gpu, though there's lots of interest there.

riemannin hmc is likely just around the corner!

marmaduke · on Nov 4, 2016

I was shallow-ly referring to the Travis CI badge on the GitHub page..

I've been working with both Stan & PyMC3 on some large datasets and will definitely try Edward on them.

proditus · on Nov 4, 2016

ah ok. i agree. we're working on that. :)

give it a shot at let us know!

eli_gottlieb · on Nov 4, 2016

What's the impediment to porting these variational algorithms over to a Turing-complete probabilistic programming language?

marmaduke · on Nov 4, 2016

Stan is Turing complete or you mean.. inception??

eli_gottlieb · on Nov 4, 2016

So I can write a Church to Stan compiler, in principle?

marmaduke · on Nov 4, 2016

Yes, you would either generate Stan code or rewrite your AST into something the Stan compiler understands.

However, if Church allows you to express non differentiable models (e.g. if you have a Heaviside function), then they will either fail or not work well with HMC or ADVI (the variational inference algorithm Stan uses), because both assume that gradient of the posterior can be computed and is informative about the posterior.