Doing Bayesian Data Analysis On the GPU: Examples

dragandj · on July 30, 2016

Characteristic examples from the book Doing Bayesian Data Analysis 2nd edition [1] programmed in Clojure and OpenCL to run on the GPU. Much, much faster than Stan or JAGS!

The library used (Bayadera) is still pre-release, so much polishing is still needed, so this can be considered a preview. But, it is still very useful, and not more complex for programmers than the mainstream Bayesian tools.

[1] https://www.amazon.com/Doing-Bayesian-Data-Analysis-Second/d...

fergal_reid · on July 30, 2016

Any hint how much faster? Not looking for defensible benchmarks, but are you talking an order of magnitude? Multiple orders?

Does it make the same probabilistic guarantees as the methods used in Stan etc? Or is it trading validity for speed?

dragandj · on July 30, 2016

Nothing is universal and guaranteed, of course. YMMV, and all that.

For example, robust linear regression from chapter 17, that fits 300 points over 4 parameters (easy, but far from trivial) runs in 180 seconds in JAGS and 485 in Stan, in parallel with 4 chains, taking 20,000 samples.

Bayadera takes 276,297,912 samples in 300 milliseconds, giving much fine-grained estimations.

So, depending on how you count the difference, it would be 500-1000 times faster for this particular analysis, while per-sample ratio is something like 7,000,000 (compared to JAGS).

Of course, JAGS and Stan are mature software packages, while Bayadera is still pre-release...

fergal_reid · on July 30, 2016

Thanks. About the second part of my question - are you doing much the same stuff as JAGS/Stan? Like, they do a lot of work to make sure that their MCMC is validly converging to the posterior - does Bayadera make similar guarantees?

Is the speedup coming from a better implementation, or because GPUs are just way faster, or because it cuts statistical corners? If its cutting corners, are they sensible?

dragandj · on July 30, 2016

It uses different MCMC algorithm - affine invariant ensemble MCMC. The difference comes from the fact that this algorithm is parallelizable, while JAGS/Stan's isn't. So, many GPU cores are the main factor. But, the algorithm is also a factor, in a sense that parallel chains always mutually inform each other.

They may do a lot of work to make sure that MCMC is validly converging, and Bayadera also does its stuff on that front, but the truth is, and you'll find it in any book on MCMC (Gelman included) that you can never guarantee MCMC convergence.

nextos · on July 31, 2016

Looks very nice. I wonder if the upcoming Xeon Phi will make the task of parallel sampling simpler. Or at least compiling and optimising automatic parallel samplers on the fly. Macros might be great for this. That's the ultimate probabilistic programming goal. Write the model and get efficient sampling for free.

dragandj · on July 31, 2016

Thanks. I doubt that XeonPhi would be any faster than my old AMD R9 290X, and the 10x price tag is also not inviting.

eli_gottlieb · on July 31, 2016

Can you point me to any good documentation on parallel MCMC algorithms and any info you might have written down on how you parallelized it? This sounds extremely worth porting over to some other probabilistic programming languages.

dragandj · on July 31, 2016

I'll be glad to send you the paper once it gets accepted.

eli_gottlieb · on July 31, 2016

Thanks!

patall · on July 30, 2016

Validity probably depends on whether you use a professional GPU or gaming devices. I am currently running NMF on a GTX980 and from time to time the algorithm totally fails, probably due to missing EC. I hope our new Tesla server will solve the issue

dragandj · on July 30, 2016

I can not guarantee, but this should not be a problem for this particular algorithm (MCMC).

marmaduke · on July 31, 2016

It's not really comparable to stan if you're not computing gradients for HMC. Of course maybe for these models you don't need to.

dragandj · on July 31, 2016

Please point me to models that absolutely need HMC, and I can try to see how Bayadera fares.

marmaduke · on July 31, 2016

Stan 's developers in particular use it for hierarchical models, but it general anything with highly correlated parameters works better with HMC than MCMC, IIRC.

Michael Betancourt (à stan dev working on the HMC parts) has a pair of YouTube videos which go into details.

That said, I switched to pymc3 so that I could compute logp via opencl more easily and if there were better ways to do this, I'm happy to see them.

dragandj · on July 31, 2016

But there ARE hierarchical models in the examples. One is 158-dimensional. With highly correlated parameters. Works like a charm in Bayadera.

marmaduke · on Aug 1, 2016

My point was mainly that comparing speed between an algorithm that doesn't require gradients and HMC in Stan is apples and oranges.

dragandj · on Aug 1, 2016

How's that? The algorithms have the same goal - find the posterior distribution. The time to get to that distribution is what is important and what is compared, provided that both algorithms get proper results. How they do it underneath is irrelevant for the user who waits.

That's like saying that comparing a horse cart and an automobile is comparing apples and oranges.

That being said, there are other things where Stan might fare better. User familiarity, or maturity, or personal taste...

te · on July 30, 2016

Looks cool. Would love to see the "much, much faster" claim quantified. Both including and excluding compile times. Stan is neat, but the recompile time after every tweak to the model really got to be a drag. If you can improve on that, it would be a real win for me.

dragandj · on July 30, 2016

Recompile time is under a second for most models that I tried. Let's say 150 - 700 ms. Once you compile it, you can use it many times.

The diference for the inference time is in the post below (but YMMV).

thom · on July 31, 2016

This looks great, and very tastefully implemented on the Clojure side. Have you considered splitting out things like the quil utils?

dragandj · on July 31, 2016

Thanks! About the quil utils - you mean get rid of them, or you think that they would be useful on their own? Btw, I only use quil for the sketch setup, the plots are done in low-level Processing. Quil does some(un)boxing, so It would be a drag when plotting millions of sample points.

thom · on Aug 1, 2016

Ultimately I've been disappointed with Incanter, which seems effectively abandoned at this point, and I'm always on the lookout for alternatives that might one day rival the tools available in R. I use quil for various visualisations, but it (and I guess Processing beneath it) is jarringly mutable. Despite that, it would be interesting to me to see more plotting libraries spring up for Clojure, I suppose that was my point.

dragandj · on Aug 1, 2016

Bayadera does not aim to rival or to mimic R tools - it aims to be better! But, it is strongly opinionated; either you take the bayesian path, or you use something else :)

Currently, I do not have resources to take on the task of creating a general-purpose plotting library. My idea with the plotting toolbox available in Bayadera is to provide a tool that would do one job - provide fast, easy to use, plotting for exactly those types of plots that are needed for bayesian stuff - and do that job well.

Plotting in general is not that hard technically, but the problem is that each user in each situation wants something unique. The level of customizability found in ggplot, matplotlib and the likes is rather high; that's something I do not have motivation to pursue.

simonb · on Aug 5, 2016

Shameless self promotion: https://github.com/sbelak/huri has a half-decent plotting DSL that uses ggplot underneath.

anon1253 · on July 31, 2016

Just of out curiosity, how hard would it be to write a compiler that takes JAGS or STAN model files and compiles it to the s-exps needed to use the library?

jsweojtj · on July 31, 2016

I must be missing something. Is there any output that we are supposed to be seeing? Plots of the fits or something?

dragandj · on July 31, 2016

Yes. Run the tests in the REPL. Call (analysis) to only invoke the fits and get the timings, or call (display-sketch) and then (reset! all-data (analysis)) to see the plots. It is only done like that in these examples; for actual work, you are free to use any meaning combination of functions that you like.