Introduction to Probability for Data Science

heresie-dabord · on Jan 24, 2022

> Some people ask how much money I can make from this book. The answer is ZERO. There is not a single penny that goes to my pocket. Why do I do that? Textbooks today are just ridiculously expensive. [...] Education should be accessible to as many people as possible, especially to those underpreviledged families.

B r a v o ! A free, quality education is the foundation for social progress and economic prosperity.

dwrodri · on Jan 24, 2022

This looks like a fantastic resource. Thanks for sharing!

I really enjoy the Bayesian side of ML, but it's definitely not the most accessible. Erik Bernhardsson cites latent dirichlet allocation as a big inspiration behind the music recommendation system he originally designed for Spotify, which is apparently still in use today[1]. I still struggle with grokking latent factor models, but it can be so rewarding to build your own and watch it work (even with only moderate success!).

Kevin Murphy has been working on a new edition of MLaPP that is now two volumes, with the last volume on advanced topics slated for release next year. However, both the old edition and the drafts for the new edition are available on his website here[2].

The University of Tübingen has a course on probabilistic ml which probably has one of the most thorough walkthroughs of a latent factor model I've found on the Internet. You can find the full playlist of lectures for free here on YouTube[3].

In terms of other resources for deep study on fastinating topics which require some command over stats and probability:

- David Silverman's lectures on reinforcement learning are fantastic [4]

- The Machine Learning Summer School lectures are often quite good, with exceptionally talented researchers / practictioners being invited to provide multi-hour lectures on their domain of expertise with the intended audience being a bunch of graduate students with intermediate backgrounds in general ML topics. [5]

1: https://www.slideshare.net/erikbern/music-recommendations-ml... 2: https://probml.github.io/pml-book/ 3: https://www.youtube.com/playlist?list=PL05umP7R6ij1tHaOFY96m... 4: https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPe... 5: http://mlss.cc

tacoluv · on Jan 24, 2022

Does anyone know of a way to send some money to the author? I know he says "free" a lot but this is so awesome I want to treat them to something.

novakinblood · on Jan 25, 2022

Really happy to see Dr. Chan do this. We were in the same research group at UCSD and he helped me a ton when I started learning how to write a research paper and deal with optimizing matlab code for image and video algorithms. Now he’s exponentially impacting more people.

rg111 · on Jan 25, 2022

I do not know how this book is, but if you are looking to learn Probability by yourself, a great resource is- Introduction to Probability by Dimitri Bertsekas and John TTsitsiklis.

I highly recommend it.

graycat · on Jan 24, 2022

"A random process is a function indexed by a random key."

Not just wrong, wildly bad nonsense.

Go get some data. Now you have the value of a random variable.

We don't get clear on just what random means, and in random variable we do not assume some element of not knowing. In particular truly random is nonsense.

Suppose we have a non-empty set I and for each i in I we have a random variable X_i (using TeX notation for a subscript). Then the I and the set of all X_i is a random process or a stochastic process. We might write (X_i, I) or some such notation.

Commonly the set I is an interval subset of the real line and denotes time. Set I might be half of the real line or all of it or just some interval, e.g., [0,1].

The set I might be just the numbers

[1, 2, 3, 4, 5, 6}

for, say, playing with dice with the usual six sides.

I might be the integers in [1, 52] for considering card games.

But the set I might be all the points on the surface of a sphere for considering, say, the weather, maybe the oceans, etc.

The set I might be all pairs (t, x, y, z) where t is a real number denoting time and the other three are coordinates in ordinary 3-space.

A random variable can also be considered a function with domain a probability space O. So for random variable Y, for each w in O, Y(w) is the value of the random variable Y at sample w. Right, the usual notation has capital Greek omega for O and lower case Greek omega for w.

Then for a particular w and stochastic process X with index set I, all the X_t(w) as t varies is a sample path of the process X. E.g., a plot of the stock marked DJI for yesterday is part of such a sample path. So, with stochastic processes, what we observe are sample paths.

That's a start on stochastic processes. Going deep into the field gets to be difficult quickly. Just quickly, look for names Kolmogorov, Dynkin, Doob, Ito, Shiryaev, Skorokhod, Rockafellar, Cinlar, Strook, Varadhan, Mckean, Blumenthal, Getoor, Fleming, Bertsekas, Karatzas, Shreve, Neveu, Tulcea(s).

For some of the flavor of probability theory and stochastic processes, see the article on liftings at

https://en.wikipedia.org/wiki/Lifting_theory

I had the main book on liftings, I'd gotten for $1 at a used book store (not a big seller) but lost it in a recent move.

mturmon · on Jan 25, 2022

Nah, the book in the OP has it right (chapter 10, first boxed definition).

The "random key" is the little-omega of standard probability theory. Pick a little-omega, get a realization of the stochastic process.

For example, if the stochastic process is a function of time, then when you pick a little-omega, you get a realization X(t;omega) or just X(t) following the usual shorthand which suppresses the little-omega ("random key").

Once you have fixed that little-omega, you have defined the entire function for all time, X(t), -\infty < t < +\infty.

In his narrative the author is trying to be careful to distinguish between averages over time (say) and averages over the sample space (little-omega).

graycat · on Jan 25, 2022

I object, objected, to the phrasing, word choice "random key". Just where the author got that, I don't know. I doubt that that phrase is anywhere in any of the writings of any of the people I listed.

With your rewriting, clarification, explanation of his phrasing, word choice "random key", you show that the author was at least trying to be right and was thinking of the right things in spite of his, say, unique wording.

In an attempt to be more clear, as you were, about the roles of what we have called w and t, I did continue on and define sample path.

In some applied work, there is a tendency to jump too soon to averages, e.g., expectations. But for doing much with stochastic processes, we should also consider the sample paths: E.g., in stochastic optimal control we have the controller control each sample path individually and do not (A) average the sample paths, (B) design a controller for that average, and (C) apply that controller to each of the sample paths.

So, net, and in line with some of your post, we should at least have the concept and some notation for sample paths.

mturmon · on Jan 26, 2022

I agree that the term "random key" seems his own invention. It worked for me. I wonder if he also used it in the earlier material on plain random variables?

trenchgun · on Jan 25, 2022

What would you suggest as an introductory reading?

graycat · on Jan 28, 2022

It appears that there are three camps of stochastic processes:

(1) Old. Analyze bumps on railroad tracks, sound noise, the weather over time, ocean waves, .... So, the expectation is fixed, that is, does not vary. The sample paths are bounded. Might assume that the variance is fixed. If increments are independent and identically distributed, then can assume the distributions are Gaussian (have a Gaussian process).

In that case can do interpolation, extrapolation, smoothing, find power spectra, see what happens to sample paths as you run them through a time-invariant linear filter.

(2) Newer. Can get into Markov processes and maybe also martingales. There is a good text by Cinlar, long at Princeton. His chapter on the Poisson process is especially good. Cinlar does not mention measure theory, but other texts do or really are all about measure theory, that is probability based on measure theory. Can get the measure theory out of Royden, Real Analysis and Rudin, Real and Complex Analysis. Should have a measure theory background in probability, and for that try any or all of Breiman, Neveu, Loeve, Chung. Can get into potential theory. Then stochastic optimal control. May end up with some expensive books from Springer written by Russians. Otherwise I gave some names above.

(3) Queuing Theory. Get to do a lot of work with the Poisson processes with the hope of making applications, e.g., in the sense of operations research.

LittlePeter · on Jan 24, 2022

In second paragraph of Chapter 2 - Probability:

> No matter whether you prefer the frequentist's view or the Bayesian's view...

I don't think the intended audience reading this chapter has this preference at all...

Then the set notation uses square brackets instead of curly braces? I cannot get over this for some reason.

hervature · on Jan 24, 2022

You are misrepresenting that quote. That comes after giving a fairly generic overview of both in which someone could form an opinion. One does not need to know the peculiarities of Bayesian reasoning to have the opinion "you should incorporate prior knowledge". Also, the set notation does use curly braces.

LittlePeter · on Jan 24, 2022

In my mind you cannot be frequentist or Bayesian after reading just the first paragraph of Chapter 2. But fair enough I am a bit too critical here.

Also you are right, set notation does use curly braces, I am relieved :-). I was confused by the A = [-1, 1-1/n] (interval set notation) on page 8 that I misread as [-1, 1, 1/n]...

ska · on Jan 24, 2022

> In my mind you cannot be frequentist or Bayesian after reading just the first paragraph of Chapter 2.

I don't think the author is asking you to, at all. They are pointing out that there are two "camps" and you will see these terms bandied about (e.g. if you google stuff). But then they claim (rightly, I think for an intro like this) that it doesn't really matter for the material to (immediately) follow and you are better off focusing on more fundamental ideas of probability.