Stanford Lecture Notes on Probabilistic Graphical Models

georgeek · on April 23, 2017

An amazing text on this topic is Martin Wainwright/Michael Jordan's Graphical Models, Exponential Families, and Variational Inference: https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_F...

rsp1984 · on April 23, 2017

Thanks for sharing. Somehow I find a lot of Computer Science texts published under the "Foundations and Trends" model of very high quality.

signa11 · on April 24, 2017

wow ! thank you :) this should be required reading/prereq, ideally, before taking the PGM course offered via coursera.

georgeek · on April 24, 2017

This is more of a graduate-level text though. It's as close as you can currently get to Michael Jordan's forthcoming book on graphical models and ML.

chasely · on April 24, 2017

Is this the book that's been in development for the past decade with bits and pieces available online?

georgeek · on April 24, 2017

That's the one :)

refrigerator · on April 23, 2017

For anyone interested, here are the materials for the Graphical Models course at Oxford: http://www.stats.ox.ac.uk/~evans/gms/index.htm

philipov · on April 23, 2017

OpenCourseOnline, different Stanford professor: https://www.youtube.com/watch?v=WPSQfOkb1M8&list=PL50E6E80E8...

Carnegie-Mellon: https://www.youtube.com/watch?v=lcVJ_zsynMc&list=PLI3nIOD-p5...

beambot · on April 23, 2017

For a practical application of using graphical models to "solve" a bayesian problem, I recommend Frank Dellaert's whitepaper, which cover Simultaneous Localization and Mapping (SLAM, a robotics algorithm) using similar techniques: https://research.cc.gatech.edu/borg/sites/edu.borg/files/dow...

rsp1984 · on April 23, 2017

I'm not an expert in Probabilistic Graphical Models but I do know factor graphs well. I've also read quite a bit of the recent SLAM and VO stuff that came out of Dellaert's group.

Here's the thing: Dellaert loves to write about SLAM from a viewpoint of probabilities and densities. I think he likes to see himself as a mathematician. However at the end of the day he's working with Gaussian noise assumptions, just like everyone else, and the solution is obtained by some form of Least Squares, again, like everyone else.

I would really like it if he just cut all the probabilistic thicket and went straight to the core of how his methods improve the SOTA just in terms of how the friggin Least Squares problem is set up and solved (e.g. like here [1]). But I guess that would probably take away most of the "magic".

[1] http://grail.cs.washington.edu/projects/mcba/pba.pdf

beambot · on April 24, 2017

The factor graph representation and analysis lends itself to simplified understanding and reasoning about the relationship between poses, measurements, loop closures, etc. in a way that is simple and intuitive. Dellaert then uses that intuitive structure to show how to setup the problem in a way that can be efficiently computed using linear algebra.

To use another example: Let's say you understand and prefer to use QR factorization for some linear algebra use case. That doesn't mean that Cholesky or SVD are somehow "worthless" tools -- they just provide other useful insights or applications.

I personally find Dellaert's decomposition of the problem useful. It doesn't hurt that he was one of the early SLAM "inventors" (along with the likes of Fox, Burguard, et al). And since the topic of the post is (literally) probabilistic graphical models, it seems extremely relevant to this thread.

(I read through the article you linked. I do not agree that it is an "easier" or "simpler" formulation.)

rsp1984 · on April 24, 2017

> Dellaert then uses that intuitive structure to show how to setup the problem in a way that can be efficiently computed using linear algebra.

But Dellaert didn't invent Factor Graphs, nor how to map a Factor Graphs to a Least Squares problem. Others did that way before him.

My criticism is directed towards his style of writing about SLAM in terms of probabilities and densities, which, IMO, is moot since at the end of the day he's just using Gaussians as everyone else and hence everything boils down to a Least Squares problem again. His probabilistic viewpoint is just a detour, a distraction at that point.

If he contributed a different kind of solver that could deal with arbitrary distributions then his probabilistic viewpoint would make a whole lot more sense but as long as it's just Gaussians I just don't see the point.

utkarshsinha · on April 25, 2017

You're wrong. What you're suggesting is there is no merit in going for an O(n^2) sorting algorithm to O(nlogn) and then to an O(n^1.2534) sorting algorithm.

SLAM at a multi-city scale can use mathematical trickery to improve runtime. Sure, a new solver would help - but we don't need it as long as there is another trick up the mathematician's sleeve. Oh, and it's harder to come up with a new solver / technique.

aub3bhat · on April 23, 2017

You cannot be more correct in your criticism, it feels as if there are a group of researchers who are just too enamored by beauty of "Bayesian stats and graphical models" that finding better solution of problem has become a secondary task for them.

Plus a segment of academia loves this type of research.

Xcelerate · on April 23, 2017

> However at the end of the day he's working with Gaussian noise assumptions, just like everyone else, and the solution is obtained by some form of Least Squares, again, like everyone else.

Well you need some kind of model to optimize, and finding the perfect one is a vastly more difficult problem that probably begins to incorporate some stuff from information theory (which starts to become rather a different field).

I don't actually see the issue with researchers working on algorithms to solve the Gaussian noise versions of these problems — they can usually be extended to some other model relatively easily (for instance, rotation synchronization via Riemannian manifolds can incorporate a Pseudo-Huber loss function without much difficulty).

Also, what is SOTA?

gadjo95 · on April 24, 2017

> Also, what is SOTA? State Of The Art

theoh · on April 24, 2017

State of the art?

dirtyaura · on April 24, 2017

For a novice, it's hard to assess how important PGMs are currently.

Investing my time to learn Deep Learning (CNNs, RNNs) vs. Random Forest vs. PGMs vs. Reinforcement Learning well enough to be able to apply the chosen approach, it seems that PGMs are not high in the list, is that correct?

Are there Kaggle competitions, in which PGMs have been the best approach?

What are the real-world problem areas that PGM's currently excel at compared to other methods?

mannigfaltig · on April 24, 2017

https://www.quora.com/What-are-some-applications-of-Probabil...

I'd say for people with interest in ML and DL, the main insight from PGMs is intuitions about probability theory, especially conditional probabilities and the concept of (un)conditional independence. State of the art in which PGMs would be relevant is currently mostly brute-force learning via stochastic gradient descent (with momentum) in deep directed models/function approximators, but function approximation likely does not solve all problems as adversarial examples show. There are certainly very important applications of PGMs in all kinds of domains as listed in the link above (perhaps I would have added FastSLAM [1]), and there are also variational autoencoders/Helmholz machines and (Deep) Boltzmann Machines, which are especially interesting for research perspective.

[1] http://robots.stanford.edu/papers/montemerlo.fastslam-tr.pdf

eachro · on April 24, 2017

Random Forests is not really at all in the same class of depth(no pun intended) as the other models you've mentioned; if you understand decision trees, random forests are just a way of combining independently trained decision trees that is less prone to overfitting.

I'd probably rank these areas in order of importance as follows: deep learning, pgms, reinforcement learning. Deep learning as a framework is pretty general. PGMs, as I have seen them, don't really have any one killer domain area - maybe robotics and areas where you want to explicitly model causality? Applications for reinforcement learning seem the more niche, but maybe that's because they haven't been adequately explored to the extent that DL/CNNs/RNNs/PGMs have been.

digitalzombie · on April 24, 2017

Tree based algorithm have it's place. I dunno what your definition of depth is but it all depends on your data.

I'm doing a thesis on tree based algorithm and it works great for medical data.

Granted I have little exposer to NN, you can't do that with NN when clinical trials data is small as hell.

It's all base on the type data and your resource and criteria.

NN is really hype and people tend to over look other algorithms but NN is not a silver bullet. I don't get how you rank those importance either, it could be bias.

eachro · on April 24, 2017

Oh there's no question that tree based methods are effective - random forests/gradient boosted trees routinely win kaggle competitions. But I was more referring to how random forests is learnable in a day, whereas deep learning would probably take at least a few weeks to learn properly.

jhbadger · on April 24, 2017

In analysis of scientific data, causality is important. Typically in science you aren't really making a model to predict things (although that can be a good way to test the model) but rather to understand what is going on, and PGMs are great for that.

likelynew · on April 23, 2017

We have a professor in our college, who uses this thing with great passion. I took two related courses under him. The problem with this thing is he uses his own mental image and notation in the lectures and examination. Even the internet seems to be highly affected with this problem. I think there are many concepts that are not hard but it takes time to get a feeling. Like see "monads are not burritos" essay. It describes the problem of using analogies in explaining monads. This seems like a great page in the sense it uses minimal confusing analogies as is common in most resources in bayesian rule.

graycat · on April 23, 2017

In their "Probability review" at

http://ermongroup.github.io/cs228-notes/preliminaries/probab...

I see two problems:

(1) First Problem -- Sample Space

Their definition of a sample space is

"The set of all the outcomes of a random experiment. Here, each outcome ω can be thought of as a complete description of the state of the real world at the end of the experiment."

The "complete description" part is not needed and even if included has meaning that is not clear.

Instead, each possible experiment is one trial and one element in the set of all trials Ω. That's it: Ω is just a set of trials, and each trial is just an element of that set. There is nothing there about the outcomes of the trials.

Next the text has

"The sample space is Ω = {1, 2, 3, 4, 5, 6}."

That won't work: Too soon will find that need an uncountably infinite sample space. Indeed an early exercise is that the set of all events cannot be countably infinite.

Indeed, a big question was, can there be a sample space big enough to discuss random variables as desired? The answer is yes and is given in the famous Kolomogorov extension theorem.

(2) Second Problem -- Notation

An event A is an element of the set of all events F and a subset of the sample space Ω.

Then a probability measure P or just a probability is a function P: F --> [0,1] that is, the closed interval [0,1].

So, we can write the probability of event A by P(A). Fine.

Or, given events A and B, we can consider the event C = A U B and, thus, write P(C) = P(A U B). Fine.

But the notes have P(1,2,3,4), and that is undefined in the notes and, really, in the rest of probability. Why? Because

1,2,3,4,

is not an event.

For the set of real numbers R, a real random variable X: Ω --> R (that is measurable with respect to the sigma algebra F and a specified sigma algebra in R, usually the Borel sets, the smallest sigma algebra containing the open sets, or the Lebesgue measurable sets).

Then an event would be X in {1,2,3,4} subset of R or the set of all ω in Ω so that X(ω) in {1,2,3,4} or

{ω| X(ω) in {1,2,3,4} }

or the inverse image of {1,2,3,4} under X -- could write this all more clearly if had all of D. Knuth's TeX.

in which case we could write

P(X in {1,2,3,4})

When the elementary notation is bad, a bit tough to take the more advanced parts seriously.

A polished, elegant treatment of these basics is early in

Jacques Neveu, Mathematical Foundations of the Calculus of Probability, Holden-Day, San Francisco, 1965.

Neveu was a student of M. Loeve at Berkeley, and can also see Loeve, Probability Theory, I and II, Springer-Verlag. A fellow student of Neveu at Berkeley under Loeve was L. Breiman, so can also see Breiman, Probability, SIAM.

These notes are from Stanford. But there have long been people at Stanford, e.g., K. Chung, who have these basics in very clear, solid, and polished terms, e.g.,

Kai Lai Chung, A Course in Probability Theory, Second Edition, ISBN 0-12-174650-X, Academic Press, New York, 1974.

K. L. Chung and R. J. Williams, Introduction to Stochastic Integration, Second Edition, ISBN 0-8176-3386-3, Birkhaüser, Boston, 1990.

Kai Lai Chung, Lectures from Markov Processes to Brownian Motion, ISBN 0-387-90618-5, Springer-Verlag, New York, 1982.

ilzmastr · on April 23, 2017

If the notes only discuss discrete valued RVs there is no problem. This looks like the case since from greping for "continuous" there are few results: https://github.com/ermongroup/cs228-notes/search?l=Markdown&...

graycat · on April 23, 2017

Hmm, discrete only? So, can't take averages, state the law of large numbers, the central limit theorem, convergence, completeness, do linear regression, etc. Can't multiply a random variable by a real number. Hmm ....

Will be in trouble when considering a sequence of, say, independent random variables, say, as in coin flipping.

Will have trouble with P(A|B) and E[Y|X].

And the notation is still wrong.

ilzmastr · on April 23, 2017

On my planet all of those concepts work whether your RV takes values on a countable set (what a discrete RV means) or a continuum.

graycat · on April 23, 2017

Right. And if don't want to get involved in what, really, are the measure theory foundations of probability, then fine. Nearly all of statistics has been done this way.

But if do try to give the measure theory foundations, as the OP did, then at least don't make a mess out of it.

If don't want to get the measure theory right, then, sure, just leave out the foundations and start with events and random variables. The measure theory foundations are so powerful, so robust, so general that in practice it's tough to get into trouble. A place to get into trouble: In stochastic processes, take the union of uncountably infinitely many points, call that an event, and ask for its probability. Okay, then, don't do that.

mrcactu5 · on April 23, 2017

course like these make me wonder ... how much one can dress up basic probability. i think the answer is A LOT

adamnemecek · on April 23, 2017

If you can derive all this just from probability my hat is off to you.

Kinnard · on April 23, 2017

What makes you say that?