Hacker News new | past | comments | ask | show | jobs | submit login

Most Monte Carlo methods used to calculate Bayesian posterior distributions are iterative, you can't parallelize each draw since it depends on the previous draw. Moreover, each draw depends on a function of all the data, so if your data is larger than can be stored on a single computer it gets complicated.

This paper doesn't look like it has a magic solution though. It seems to propose an adhoc sharding approach and then shows it sometimes works.




MCMC isn't strictly serial; you can run multiple chains in parallel and get decent results (provided your break in period isn't too big), and this is commonly done for multicore.

The "all the data" problem is interesting. It seems solvable if each node summarizes the likelihood function for part of the data. I'll have to read the paper and see if that's what they did.


All of it's definitely solvable, it's just harder than lots of Monte Carlo problems that are embarrassingly parallel.


Thanks. This helps. Can you give a specific example of when someone would use this calculation, and where it would go wrong? Or what kind of problem requires the draws to be a function of all the data?


This is the core calculation to all of Bayesian statistics, so it's used pretty much everywhere. Almost any time you try to compute p(unknown|data) you run into computational issues. It's a high dimensional integration problem that has no closed form solution. There are models that can be simplified, but they are the minority.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: