For a practical application of using graphical models to "solve" a bayesian problem, I recommend Frank Dellaert's whitepaper, which cover Simultaneous Localization and Mapping (SLAM, a robotics algorithm) using similar techniques: https://research.cc.gatech.edu/borg/sites/edu.borg/files/dow...
I'm not an expert in Probabilistic Graphical Models but I do know factor graphs well. I've also read quite a bit of the recent SLAM and VO stuff that came out of Dellaert's group.
Here's the thing: Dellaert loves to write about SLAM from a viewpoint of probabilities and densities. I think he likes to see himself as a mathematician. However at the end of the day he's working with Gaussian noise assumptions, just like everyone else, and the solution is obtained by some form of Least Squares, again, like everyone else.
I would really like it if he just cut all the probabilistic thicket and went straight to the core of how his methods improve the SOTA just in terms of how the friggin Least Squares problem is set up and solved (e.g. like here [1]). But I guess that would probably take away most of the "magic".
The factor graph representation and analysis lends itself to simplified understanding and reasoning about the relationship between poses, measurements, loop closures, etc. in a way that is simple and intuitive. Dellaert then uses that intuitive structure to show how to setup the problem in a way that can be efficiently computed using linear algebra.
To use another example: Let's say you understand and prefer to use QR factorization for some linear algebra use case. That doesn't mean that Cholesky or SVD are somehow "worthless" tools -- they just provide other useful insights or applications.
I personally find Dellaert's decomposition of the problem useful. It doesn't hurt that he was one of the early SLAM "inventors" (along with the likes of Fox, Burguard, et al). And since the topic of the post is (literally) probabilistic graphical models, it seems extremely relevant to this thread.
(I read through the article you linked. I do not agree that it is an "easier" or "simpler" formulation.)
> Dellaert then uses that intuitive structure to show how to setup the problem in a way that can be efficiently computed using linear algebra.
But Dellaert didn't invent Factor Graphs, nor how to map a Factor Graphs to a Least Squares problem. Others did that way before him.
My criticism is directed towards his style of writing about SLAM in terms of probabilities and densities, which, IMO, is moot since at the end of the day he's just using Gaussians as everyone else and hence everything boils down to a Least Squares problem again. His probabilistic viewpoint is just a detour, a distraction at that point.
If he contributed a different kind of solver that could deal with arbitrary distributions then his probabilistic viewpoint would make a whole lot more sense but as long as it's just Gaussians I just don't see the point.
You're wrong. What you're suggesting is there is no merit in going for an O(n^2) sorting algorithm to O(nlogn) and then to an O(n^1.2534) sorting algorithm.
SLAM at a multi-city scale can use mathematical trickery to improve runtime. Sure, a new solver would help - but we don't need it as long as there is another trick up the mathematician's sleeve. Oh, and it's harder to come up with a new solver / technique.
You cannot be more correct in your criticism, it feels as if there are a group of researchers who are just too enamored by beauty of "Bayesian stats and graphical models" that finding better solution of problem has become a secondary task for them.
Plus a segment of academia loves this type of research.
> However at the end of the day he's working with Gaussian noise assumptions, just like everyone else, and the solution is obtained by some form of Least Squares, again, like everyone else.
Well you need some kind of model to optimize, and finding the perfect one is a vastly more difficult problem that probably begins to incorporate some stuff from information theory (which starts to become rather a different field).
I don't actually see the issue with researchers working on algorithms to solve the Gaussian noise versions of these problems — they can usually be extended to some other model relatively easily (for instance, rotation synchronization via Riemannian manifolds can incorporate a Pseudo-Huber loss function without much difficulty).
For a novice, it's hard to assess how important PGMs are currently.
Investing my time to learn Deep Learning (CNNs, RNNs) vs. Random Forest vs. PGMs vs. Reinforcement Learning well enough to be able to apply the chosen approach, it seems that PGMs are not high in the list, is that correct?
Are there Kaggle competitions, in which PGMs have been the best approach?
What are the real-world problem areas that PGM's currently excel at compared to other methods?
I'd say for people with interest in ML and DL, the main insight from PGMs is intuitions about probability theory, especially conditional probabilities and the concept of (un)conditional independence. State of the art in which PGMs would be relevant is currently mostly brute-force learning via stochastic gradient descent (with momentum) in deep directed models/function approximators, but function approximation likely does not solve all problems as adversarial examples show. There are certainly very important applications of PGMs in all kinds of domains as listed in the link above (perhaps I would have added FastSLAM [1]), and there are also variational autoencoders/Helmholz machines and (Deep) Boltzmann Machines, which are especially interesting for research perspective.
Random Forests is not really at all in the same class of depth(no pun intended) as the other models you've mentioned; if you understand decision trees, random forests are just a way of combining independently trained decision trees that is less prone to overfitting.
I'd probably rank these areas in order of importance as follows: deep learning, pgms, reinforcement learning. Deep learning as a framework is pretty general. PGMs, as I have seen them, don't really have any one killer domain area - maybe robotics and areas where you want to explicitly model causality? Applications for reinforcement learning seem the more niche, but maybe that's because they haven't been adequately explored to the extent that DL/CNNs/RNNs/PGMs have been.
Tree based algorithm have it's place. I dunno what your definition of depth is but it all depends on your data.
I'm doing a thesis on tree based algorithm and it works great for medical data.
Granted I have little exposer to NN, you can't do that with NN when clinical trials data is small as hell.
It's all base on the type data and your resource and criteria.
NN is really hype and people tend to over look other algorithms but NN is not a silver bullet. I don't get how you rank those importance either, it could be bias.
Oh there's no question that tree based methods are effective - random forests/gradient boosted trees routinely win kaggle competitions. But I was more referring to how random forests is learnable in a day, whereas deep learning would probably take at least a few weeks to learn properly.
In analysis of scientific data, causality is important. Typically in science you aren't really making a model to predict things (although that can be a good way to test the model) but rather to understand what is going on, and PGMs are great for that.
We have a professor in our college, who uses this thing with great passion. I took two related courses under him. The problem with this thing is he uses his own mental image and notation in the lectures and examination. Even the internet seems to be highly affected with this problem. I think there are many concepts that are not hard but it takes time to get a feeling. Like see "monads are not burritos" essay. It describes the problem of using analogies in explaining monads. This seems like a great page in the sense it uses minimal confusing analogies as is common in most resources in bayesian rule.
"The set of all the outcomes of a random
experiment. Here, each outcome ω can be thought of as a complete
description of the state of the real world
at the end of the experiment."
The "complete description" part
is not needed and even if included
has meaning that is not clear.
Instead, each possible experiment
is one trial
and one element in the
set of all trials Ω.
That's it: Ω is just
a set of trials, and each trial
is just an element of that set.
There is nothing there about the
outcomes of the trials.
Next the text has
"The sample space is Ω = {1, 2, 3, 4, 5, 6}."
That won't work: Too soon will find that
need an uncountably infinite sample
space. Indeed an early exercise
is that the set of all events
cannot be countably infinite.
Indeed, a big question was, can
there be a sample space big
enough to discuss random variables
as desired? The answer is yes and
is given in the famous Kolomogorov
extension theorem.
(2) Second Problem -- Notation
An event A is an element of
the set of all events F
and a subset of the sample space
Ω.
Then a probability measure P
or just a probability
is a function P: F --> [0,1]
that is, the closed interval [0,1].
So, we can write the probability
of event A by P(A).
Fine.
Or, given events A and B, we can consider the event C = A U B and, thus, write P(C) = P(A U B). Fine.
But the notes have P(1,2,3,4),
and that is undefined in the
notes and, really, in the rest of
probability. Why? Because
1,2,3,4,
is not an event.
For the set of real
numbers R, a real
random variable
X: Ω --> R
(that is measurable
with respect to
the sigma algebra F
and a specified sigma
algebra in R, usually
the Borel sets,
the smallest sigma algebra
containing the open sets,
or
the Lebesgue measurable
sets).
Then an event would
be X in {1,2,3,4} subset of R
or the set of all ω
in Ω so that X(ω) in
{1,2,3,4} or
{ω| X(ω) in {1,2,3,4} }
or the inverse image of
{1,2,3,4} under X --
could write this all more
clearly if had all of D. Knuth's
TeX.
in which case we
could write
P(X in {1,2,3,4})
When the elementary notation
is bad, a bit tough to
take the more advanced
parts seriously.
A polished, elegant
treatment of these
basics is early in
Jacques Neveu, Mathematical Foundations
of the Calculus of Probability,
Holden-Day, San Francisco, 1965.
Neveu was a student of M. Loeve
at Berkeley, and can also see
Loeve, Probability Theory, I and
II, Springer-Verlag.
A fellow student of Neveu
at Berkeley under Loeve
was L. Breiman, so can also
see Breiman, Probability,
SIAM.
These notes are from Stanford.
But there have long been
people at Stanford, e.g.,
K. Chung, who
have these basics
in very clear, solid, and polished
terms, e.g.,
Kai Lai Chung,
A Course in Probability Theory,
Second Edition,
ISBN 0-12-174650-X,
Academic Press,
New York,
1974.
K. L. Chung and R. J. Williams,
Introduction to Stochastic Integration,
Second Edition,
ISBN 0-8176-3386-3,
Birkhaüser,
Boston,
1990.
Kai Lai Chung,
Lectures from Markov Processes to
Brownian Motion,
ISBN 0-387-90618-5,
Springer-Verlag,
New York,
1982.
Hmm, discrete only? So, can't take averages, state the law of large numbers, the central limit theorem, convergence, completeness, do linear regression, etc.
Can't multiply a random variable by a real number. Hmm ....
Will be in trouble when considering a sequence of, say, independent random variables, say, as in coin flipping.
Right. And if don't want to get involved in what, really, are the measure theory foundations of probability, then fine. Nearly all of statistics has been done this way.
But if do try to give the measure theory foundations, as the OP did, then at least don't make a mess out of it.
If don't want to get the measure theory right, then, sure, just leave out the foundations and start with events and random variables. The measure theory foundations are so powerful, so robust, so general that in practice it's tough to get into trouble. A place to get into trouble: In stochastic processes, take the union of uncountably infinitely many points, call that an event, and ask for its probability. Okay, then, don't do that.