1) Why create a project distinct from Stan? Was it the prospect of benefiting of all the work going into TF and focus solely on the sampling procedures rather than autodiff or GPU integration?
2) Are you implementing NUTS?
3) Any plans to implement parallel tempering
4) Any plans to handle "tall" data using stochastic estimates of the likelihood?
1. you touch upon the right strengths of TF; that was certainly one consideration. edward is designed to address two goals that complement stan. the first is to be a platform for inference research: as such, edward is primarily a tool for machine learning researchers. the second is to support a wider class of models than stan (at the cost of not offering a "works out of the box" solution).
our recent whitepaper explains these goals in a bit more detail:
4) which approach are you using? Generalized Poisson Estimator, or estimating the convexity effect of the exponential by looking at the sample variance of the log likelihood? The former is more pure, the latter may be more practical if ugly.
our first approach is the simplest: stochastic variational inference. consider a likelihood that factorizes over datapoints. stochastic variational inference then computes stochastic gradients of the variational objective function at each iteration by subsampling a "minibatch" of data at random.
i reckon the techniques you suggest would work as we move forward!
Edit: ah never mind, variational inference, got it! I was thinking stochastic HMC!
---
Ok but that will get an unbiased estimate of the log-likelihood. MCMC or HMC do work with noisy estimators, but they require unbiased estimates of the likelihood.
At the very least, you need to do a convexity adjustment by measuring the variance inside your mini batch. Or you can use the Poisson technique which will get you unbiased estimates of exp(x) from unbiased estimates of x (albeit at the cost of introducing a lot of variance).
if I can bug you also since you're working with Alp, how does Edward handle ADVI covariance? Is it diagonal or dense or some sparse structure estimated?
you may bug me on this. i work too closely with alp :)
edward does not implement completely implement advi yet. the piece that is missing is the automated transformation of constrained latent variable spaces to the real coordinate space. however, edward offers much more flexibility in specifying the dependency structure of the posterior approximation. diagonal is, just like in stan, the fastest and easiest. however introducing structure (e.g. assigning a dense normal to some of the latent variables while assigning a diagonal to others) is much easier in edward.
OK I would be interested in seeing how to do that. Are there any examples or hints on how to start? I worked a lot with time series models (think nonlinear autoregressive), where there's strong short term autocorrelation, and the coercion to diagonal covariance seemed inappropriate.
I have also a naïve question: why not use the graphical structure of the model itself to add structure to the covariance? For example, in an AR model, each time point places prior on the next time point, so why not assume a banded covariance? More generally, one could use a cutoff on shortest path length (through the model's graphic structure) between parameters to decide if they should have nonzero coefficients.
A big issue I ran into with stan even with advi was scaling to large datasets since it (and Eigen) are single threaded. Would Edward answer all my prayers?
(shakir's blog posts are amazing; i recommend them all.)