Bayes and Big Data: The Consensus Monte Carlo Algorithm

bazzargh · on Nov 18, 2013

Any hint as to how this differs from other parallel Monte Carlo algorithms? Or is it an application of those methods to big data? (People were doing parallel Monte Carlo at least as far back as the early 90s - my own research topic on exact nuclear calculations was basically obsoleted by the superior scalability of that approach)

tlarkworthy · on Nov 18, 2013

it's first author is a Googler, and its posted on a Google domain.

EDIT: haha I am downvoted for speaking the truth. The parent and I have both read lots of papers like these, there is only a modest contribution of this paper. It basically says we fudged the inference and the end results look similar so its a good optimization. However, there will be probably lots of specific models that "need" the mixing freedom the approximation removes, and hence the algorithm will only work for a specific subspace of MCMC problems. MCMC is basically impossible to debug, so we dunno how well it works overall. THIS IS THE SAME CONCLUSION OF EVERY OTHER MCMC APPROXIMATION PAPER. The only reason this is on HN, is because of its heritage. I do not think this paper is revolutionary (unlike some other papers coming out of Google).

EDIT 2: evidence Modern: "Fully Parallel Inference in Markov Logic Networks" - Max-Planck "Hybrid Parallel Inference for Hierarchical Dirichlet Process" "A split-merge mcmc algorithm for the hierarchical dirichlet process" Old: "Parallel Implementations of Probabilistic Inference" a 1996 review paper!

You might say these papers are not exactly the same, ok, but the final justification for the given paper is:

"Depending on the model, the resulting draws can be nearly indistinguishable"

NOTE kewords: "depending" and "nearly"

ColinWright · on Nov 18, 2013

I wasn't the downvoter, but your initial response didn't answer the question, and was a snide irrelevancy. Was it intended as criticism of the HN audience? So what if it's related to Google?

Your subsequent edit is useful - thank you. I just wish you'd said that in the first place and left out the swipe.

tlarkworthy · on Nov 18, 2013

it wasn't a swipe, compressed? yes. It IS the main reason why this paper is getting more air than the work of hundreds of other machine learning papres int he same league. It was not meant in malice, just a plain ordinary fact.

I thought arguing it has a was a average contribution would be more snidy. It's perfectly good work, just not in the same league as map-reduce or self-driving cars.

ColinWright · on Nov 18, 2013

I really wish you'd said in the first instance that it appeared that the paper wasn't really saying anything new, and left out the implication that it's getting more air purely because it's from Google.

Not least, you might be wrong. It might be getting more air not because it's from Google, but simply because it has more visibilty in general. My feeling is that other, perhaps more scholarly, and perhaps more in-depth articles don't see the light of day simply because they are in more obscure places. You can help by pointing at other sources of similar material, and explaining why they are potentially of more value.

A single line of "It's by a Googler" was unenlightening, and to me came off as pure snark.

Demiurge · on Nov 18, 2013

I have no idea if you're right or not, but I really hate how people downmod on HN instead of stating their disagreements.

ColinWright · on Nov 18, 2013

In general so do I, but in this case the initial reply was a content free snipe. In that case I'm not surprised it got a downvote without investing the time in making a more complete reply.

The subsequent edit was useful, although the attitude is, well, "sharp".

derefr · on Nov 18, 2013

As far as I've been able to tell from years of Reddit/HN, downvoting is basically a signal that says "this comment is not worth the time it'd take to rebut it--but also isn't doing anything actively harmful enough to elicit flagging."

gil · on Nov 18, 2013

"The parent and I have both read lots of papers like these,"

Can you please post the title/author of the ones that are most relevant?

btown · on Nov 18, 2013

Are there any theoretical results on this type of algorithm saying, i.e., amortized over some number of iterations, there is some probability of satisfying detailed balance?

alextp · on Nov 18, 2013

ML PhD student here. The reason why this is different is that the parallel monte carlo simulations are running on different subsets of the data in each machine, and then averaged.

It is not obvious that this can work at all in some cases. Think, for example, a clustering model. If there are two clusters, but one machine calls them A B and the other machine calls them B A, averaging will give you useless results.

So the contribution of this paper is finding a set of models on which naive averaging works, and showing an efficient mapreduce implementation of it.

That said, I don't find the paper particularly interesting.

christopheraden · on Nov 18, 2013

Ph.D student in Stats here. That we can even sample the full posterior with so much data is interesting in and of itself. The method isn't particularly revolutionary, especially compared to a method Zaiyang Huang and Andy Gelman[1] came up with 8 years ago, but it's practical (for a restricted class of models). Steve Scott spoke to my department about a month ago, describing the problem. It's not a solution that works for all posteriors, but it certainly allows more freedom than restricting oneself to conjugate priors and allows computation on large datasets.

Academics tend to trivialize the implementation (we had some pretty strong critics of his talk in my dept), but some kudos are in order for that, even if the algorithm itself isn't revolutionary.

[1]: http://www.stat.columbia.edu/~gelman/research/unpublished/co...

3pt14159 · on Nov 18, 2013

The main reason is that it is so legible and readable, and assumes less of the reader than contemporary ML papers.

dgmdoug · on Nov 18, 2013

I think it's more to do with the application of parallel MC to approximate inference over Bayesian networks. In the normal case of approximate inference, each node would operate over the incoming data to approximate maximum likelihoods (this is the MC bit) then pass them out on each link. The key contribution of this work seems to be to tie the MC algorithms together from difference nodes. Difficult to say any more without reading the paper though.

My key question would be whether there are independence considerations that Parallel MC breaks? (Message passing works because all nodes need only information from their neighbours to summarise the rest of the network, due to the modelled independence).

mathattack · on Nov 18, 2013

Can anyone explain cases where the Monte Carlo optimizations wouldn't work in a distributed manner? Most of the examples of Monte Carlo that come to mind are path dependent pricings of financial instruments. That seems to be an easy one to distribute the computations. As long as each path gets it's own processor, you can average the prices at the end.

What's a good example of where this breaks down? I'm thinking weather simulations, but I don't know the domain well enough.

tristanz · on Nov 18, 2013

Most Monte Carlo methods used to calculate Bayesian posterior distributions are iterative, you can't parallelize each draw since it depends on the previous draw. Moreover, each draw depends on a function of all the data, so if your data is larger than can be stored on a single computer it gets complicated.

This paper doesn't look like it has a magic solution though. It seems to propose an adhoc sharding approach and then shows it sometimes works.

nyan_sandwich · on Nov 18, 2013

MCMC isn't strictly serial; you can run multiple chains in parallel and get decent results (provided your break in period isn't too big), and this is commonly done for multicore.

The "all the data" problem is interesting. It seems solvable if each node summarizes the likelihood function for part of the data. I'll have to read the paper and see if that's what they did.

tristanz · on Nov 18, 2013

All of it's definitely solvable, it's just harder than lots of Monte Carlo problems that are embarrassingly parallel.

mathattack · on Nov 18, 2013

Thanks. This helps. Can you give a specific example of when someone would use this calculation, and where it would go wrong? Or what kind of problem requires the draws to be a function of all the data?

tristanz · on Nov 18, 2013

This is the core calculation to all of Bayesian statistics, so it's used pretty much everywhere. Almost any time you try to compute p(unknown|data) you run into computational issues. It's a high dimensional integration problem that has no closed form solution. There are models that can be simplified, but they are the minority.