Was just about to say it reminded me of Bret Victor. But someone needs to synthesize these technical articles with the field of General Semantics ( https://www.youtube.com/playlist?list=PLaoJIXlyLvLkMQUtbiTi1... ) to make it even clearer for the curious complete beginner to mathematical/technical thinking/approaches.
A deeper awareness of all the jargon one is using + deep intuitive layman explanation would be valuable.
FTA: "Looking to the future, we anticipate a long-term trend towards a "hardware jungle" of parallel, heterogeneous, distributed and asynchronous computing systems which communicate in a peer-to-peer manner. "
As somebody who has used belief propagation a lot, before the current neural network renaissance, I'm not yet optimistic of that. (However, I have been wrong enough in the past to be somewhat humble about my ability to predict these things!)
As I see it, the current big machine learning models use data that can't fit well on a single machine, or training regimes that can't fit onto a single machine, but the end model still fits on a single GPU or box.
What I understand them to be advocating for here, is a model that is bigger than can fit on a single chip, that can use belief propagation to scale to that number of variables/size. I don't yet know what sort of applications that could address, so that's my primary point of skepticism about this sort of huge model being developed. BUT, my lack of imagination is not much of an argument on its own, it's merely the absence of an argument, not an actual argument against such large models being useful.
Question to HN: is there somewhere a dictionary explaining and relating terms, in applicable way, that everyone uses without ever explaining them, such as "priors", "posteriors", "marginals", "belief", "marginal distribution", etc.
Regression and Other Stories by Andrew Gelman et al. is IMHO the best introduction to Bayesian reasoning without being overly technical, but still maintaining rigor and providing lots of insights about how statistics is done in practice.
if you're willing to sink time into a lecture series, you may enjoy Richard McElreath's introductory lectures on bayesian inference: https://www.youtube.com/watch?v=_NEMHM1wDfI
> Theoretical grounds for applying the same update rules to loopy graphs were later developed that explain loopy BP as an approximate variational inference method in which inference is cast as an optimization problem. Instead of directly minimizing the factor energies (as is done in MAP inference), loopy BP minimizes the KL divergence between the posterior and a variational distribution which we use as a proxy for the marginals after optimization
> As the Bethe free energy is non-convex, loopy BP is not guaranteed to converge and even when it does it may converge to the wrong marginals. Empirically, however BP generally converges to the true marginals although for very loopy graphs it can fail
If i follow this, in the case where loopy BP converges to something, it sounds like there are at least two components of error:
(i) the error between the true posterior distribution and the best approximation ; and
(ii) the error between the approximation you managed to find with loopy BP and the best approximation
The fact that loopy BP can fail to converge or fails to converge to a global optima is not that attractive. On another hand, global optimisation is often difficult or intractable. Maybe in practice, in some cases the error from failing to find the globally optimal best fit approximation might be a lot less than the error introduced by making the approximation in the first place. This doesn't necessarily mean that loopy BA will give useful results -- maybe that approximation error is so bad that you need to do something else!
> Other notable candidate methods that are local, probabilistic and iterative are Expectation Propagation (EP) and Barfoot's algorithm . EP is generally not node-wise parallel and simplifies to GBP in the special case when it is node-wise parallel, while Barfoot's algorithm involves extra communication edges and is yet to be applied to real problems.
It's worth reading Barfoot's paper [1] -- especially the latter sections discussing related work, conclusions & future work, give another perspective:
> Belief propagation is known to converge to the correct global solution to the inference problem when the graphical model is a tree; however, loopy belief propagation may or may not converge and various sufficient conditions have been established for convergence. Weiss and Freeman (2000) provide the starting point by showing that GaBP will converge (mean only not covariance) under the condition that A is diagonally dominant. To our knowledge, no condition has been derived for GaBP to ensure the covariance quantities will converge to the global solution in the case of a loopy graph. Some of the analyses of GaBP look to view the method as an iterative solver for a linear system of equations (e.g., Jacobi’s algorithm), a connection first pointed out by Weiss and Freeman (2000). However, as there have been approximations made, GaBP cannot be shown in general to be solving the primary (and secondary) linear systems discussed in this paper. Our paper in a sense follows this connection in the opposite direction by starting with the known result of Takahashi et al. (1973) and then showing what it would take to implement this as a local message-passing scheme. While we can guarantee convergence in both the mean and covariance quantities, this comes at the cost of additional communication links, memory, and computation compared to GaBP. [...] It would also be interesting to look more deeply at the connection to Gaussian belief propagation; the methods perform the same on a tree graph, but what approximations are needed in the loopy case for our message-passing scheme to be similar/equivalent to GaBP (e.g., perhaps we simply need to delete the extra communication channels our method requires)?
https://youtu.be/oUaOucZRlmE?t=1175 (Media for Thinking the Unthinkable)